Sparse codes are an effective representation of data that facilitates high representational capacity without compromising on fault tolerance and generalization. The advantages of sparse codes over both dense and entirely local coding schemes have been discussed extensively in the computational neuroscience literature, [1,2], which contain ample evidence for the presence of sparse codes in cortical computations. Sparse representations are also generated by many unsupervised learning methods.
In this work, Dr Kishore Konda along with Vivek Bakaraju, Data Scientist at INSOFE present a more general understanding of the relationship between data distribution and sparsity of the learned representations. We hypothesize that when the data is distributed along a non-linear or discontinuous manifold, one of the ways to efficiently represent the data is by tiling the input space and let non-overlapping subsets of neurons to represent the tiles.
Sparsity in Neural Networks
Sparse representations can be observed in hidden layers of multi-layer-perceptron (MLP) as well as regularized autoencoder architectures. We present a hypothesis for the emergence of sparsity in hidden layers and show that it is related to non-linear or discontinuous nature of the input manifold and the number of neurons used in the hidden layer.
To validate our hypothesis we created a toy dataset, in which every sample is a 3D point on a unit sphere. The sphere itself is divided into N+1 latitudinal (horizontal) cross-sections and M longitudinal (vertical) sectors. This arrangement divides the sphere’s surface into M*(N+1) partitions. Each partition is randomly assigned one of the ‘C’ classes to make sure the data is distributed along a discontinuous manifold. For each experiment, 5000 such points are generated belonging to 10 classes (C = 10). One hidden layer neural network was trained on the dataset with a varying number of hidden units. The sparsity of hidden representations from multiple experiments is presented in Figure below. It can also be observed that the sparsity in the network increases with the increase in the number of input clusters to be mapped.
New Regularization Method
We propose a new regularization term named One-Vs-Rest (OVR) loss, based on the observation that sparsity can be achieved by forcing a neural network model to minimize the overlap between hidden representations of random samples from the dataset. Let H be a matrix with the hidden layer representations of the samples from a randomly sampled mini-batch as rows. OVR-loss term is defined as,
Minimization of OVR-loss implies reducing the overlap across representations of samples, within the mini-batch, coming from different input clusters or regions. In the case where all samples in a mini-batch are from the same cluster or region, highly unlikely when the batch is randomly sampled, OVR-loss cannot be used as it would mean learning different representations for samples from a local region of the input space, similar to the original data representation. OVR-loss can be employed as a regularization term in both autoencoder and MLP architectures simply by adding the OVR-loss to the objective function of the model via a hyper-parameter λ.
From the figure above, it can be observed that an increase in λ (regularization strength for OVR-loss) till a certain level increased both sparsity in representations and it’s classification accuracy. Increase in λ beyond 10<sup>-4</sup> still increases the sparsity in representations but with a decrease in classification accuracy.
We also ran supervised classification experiments on CIFAR-10 dataset using single hidden layer MLP models with OVR-loss as regularization and the results are presented in Figure above. From the plots, it can be observed that sparsity of the hidden layer representation increases with increased regularization strength and also using OVR-loss as regularization we achieve performance on-par with the network using Dropout.
A Single Layered Encoder Model
We propose a new single-layer network, “OVR-Encoder”, for learning representations with cost function as OVR-loss. When we use OVR-loss as cost function, one possible trivial solution is that all hidden activations go to zero which minimizes the cost to zero. To prevent this from occurring we add another term L (Equation 2) to the cost function, for encouraging non-zero activations in the network.
So the final cost function J is a weighted sum of OVR-loss and L from Equation 2. OVR-loss based update rule for the weight vector wk which corresponds to the kth neuron in the hidden layer is,
where N is the batch of input samples x, hkj is the response of neuron k for input xi from the batch. Interestingly, while single-layer models like online-k-means update their weights/centroids in the direction of input, our model pushes the weights, as shown in Equation 4, away from the weighted mean of rest of the samples from the batch. We intend to study the effect this nature of our model more in future work.
To evaluate the effectiveness of the learned data representations from the OVR-Encoder we use the same dataset and experimental pipeline as described in Section below. We trained OVR-Encoder with 8192 neurons for learning representations for PCA reduced CIFAR-10 data. During training batch size of 128 and Adam optimizer with an initial learning rate of 0.001 was used. The regularization strength λ was varied between 10-5 to 5*10-4. The performance of the Logistic regression model on hidden representations from the best OVR-Encoder is presented in Table below along with performances of Logistic regression models trained on representations from different models. The OVR-Encoder model performs poorly when the training batch size is too small as the OVR-loss operates by computing overlap across hidden representations of multiple input samples.
Many approaches have been proposed for learning sparse representations and their importance in neural networks. We in this article presented reasoning for the emergence of sparsity and its relation to the input data distribution, “more discontinuous the input manifold more sparse the representations”. By employing OVR-loss for regularization in autoencoders and MLPs, we achieved encouraging results which support our hypothesis on sparsity. We showed that efficient representations of data can be learned simply by minimizing OVR-loss only and proposed a single layer encoder only model (OVR-Encoder) which uses Hebbian-like local learning rules for training and does not require error-backpropagation. In future work, we’ll intend to understand the novel learning approach of the OVR-Encoder better and explore its usability for continuous learning problems. Also, understanding sparsity in multi-layer models and using OVR-loss for training multi-layer networks will be part of our future steps.
- Peter Földiák and Malcom P Young. Sparse coding in the primate cortex. The handbook of brain theory and neural networks, 1:1064–1068, 1995.
- Philippe Tigreat. Sparsity, redundancy and robustness in artificial neural networks for learning and memory. PhD thesis, Ecole nationale supérieure Mines-Télécom Atlantique, 2017.