If we have a huge amount of structured data, numerical or categorical, a simple neural network architecture can give amazing results. A CNN architecture can take care of Image data and a simple RNN or LSTM architecture are best to perform predictions on Text. What do we do when we face a problem statement where we have to cater to all of the above data types all together?
Let us consider the example of the Avito Demand Prediction Challenge that took place a year ago. Avito is Russia’s largest classified advertisements website. A primary concern for most sellers is to understand the demand and supply equation for their products. Such insights can help sellers increase prices when the demand is high, or work on improving the advertisement or even the product itself when the demand is low. The task in this competition was to learn the properties of each ad, their context and historical demand for similar ads, to predict the demand that they can generate. Once we have a good estimate of the demand, Avito can work with the sellers on their platform to better optimize their ad listing.
This was a good opportunity to design an all-inclusive architecture.
The basic idea of an all-inclusive architecture is to build separate neural network architectures for different types of data and join them together at the end to create a bigger neural network. The final output is then our prediction, which in this case is the saleability of the product. Let us see how it could be done.
Firstly, let us look at the numerical values. We had the item price and a few engineered features like dimensions of the ad image and length of the descriptions. These featured could be fed to a dense neural network with one hidden layer.
Secondly, we had the categorical values. These were region, city, the ads category, its parent category, some parameter features and ad image category. We could apply one-hot encoding or label encoding to each of these features, concatenate the same to the numerical values and feed it to the dense neural network we had already designed. The only factor was that a couple of these categories, namely ad category and image category, had levels in multiples of hundreds. So instead of treating all categories, in the same way, we could handle them separately. One network was designed such that features with less than 20 levels were put together and one-hot encoded and had its own neural network. The categories like ad category (category_name) were processed through categorical embedding and were fed to its own exclusive neural network. A similar was done with the image category feature, image_top_1. At this point, we have four individual neural networks and we were yet to cater to text and image types.
With the text fields, Title and Description, we created two more neural networks. This time instead of a dense network we designed an LSTM network to deal with continuous nature of the text. For the images, we created one simple CNN architecture with two convolution layers along with dropouts and normalization layers for each of them.
Now we have 6 layers. Each taking care of the behaviour of its input types. More focussed handling with respect to the data types. To combine all of these networks we take the output from each of them and connect them to one dense layer. This complete network is then trained based on Binary Cross-Entropy Loss function.
The major drawback of an all-inclusive architecture such as described above is its high level of complexity. Although the idea of handling each type of data differently may sound attractive to a data scientist, applying such a model in production may have its own challenges. Having said that, like any other form of neural network architecture we need more and more work done in this new area of data science so that this idea can be easily implemented in the real world.