prlz77

An information compressor in a pale blue dot.

Regularizing CNNs with Locally Constrained Decorrelations (ICLR2017)

TL;DR We propose to locally decorrelate the feature weights of CNNs. When the proposed method, which we call OrthoReg, is used to regularize the 40 layers of Wide Residual Networks, we obtain state of the art results on CIFAR, and SVHN. Here is an example of the effects of our local regularizer:

Dataset Size
Step Size
Lambda
Max Iter
Delta Stop
Baseline
Ours
Average nearest neighbor distance

I encourage the reader to continue reading for better understanding of this example.

Deep learning models are becoming increasingly bigger and deeper, beating the state of the art in numerous tasks. But how is it possible to grow these neural net architectures while preventing them from completely memorizing the training data and showing good performances on unseen examples (overfitting)? Although we need to rethink generalization, the current answer to this question includes multiple factors such as having bigger datasets, clever architectures, and good regularization methods. Our work is centered on the latter. Concretely, we focus on those regularization methods that prevent the co-adaptation of feature detectors, thus reducing the redundancy in the models and increasing their generalization power.

Probably, the most popular way to prevent feature co-adaptation is Dropout. However, dropping techniques such as Dropout or DropConnect reduce the capacity of the models, which need to be bigger to compensate for this fact. A least common method for preventing co-adaptation is to add a regularization term to the loss function of the model so as to penalize the correlation between neurons. This idea has been explored several times for supervised learning on neural networks, first with the name of incoherent training, and later as DeCov (Cogswell et al. at ICLR2016).

Although the presented decorrelation methods proved to perform well, they are still far from being commonly used in state of the art models. In part, this is because they were applied to rather shallow models, and it is not clear that the computational overhead introduced by these regularizers in state of the art models will compensate the reduction in overfitting. Our work aims to dissipate these inconveniences by:

  • reducing the computational cost by regularizing the network weights,
  • increasing the performance margins by imposing locality constraints,
  • successfully regularizing all the layers of a state of the art residual network.

Why imposing locality when decorrelating feature detectors? Since a toy is more amusing than one thousand words, please play with the example I provide at the beginning of the post (if you have not already tried it ;) ). Note that although it is a toy example with 2D nice-to-plot features, similar behavior was found in actual CNN features, especially in bottleneck layers, when the number of filters matches their dimensionality. The intuition is that regularizing negatively decorrelated feature detectors is counter-productive. Namely, in the left plot, each dot (feature detector) “tries to be different” from all the other dots, even those which are in the opposite quadrant. We propose to make dots only sensitive to their nearest neighbors, thus increasing the average minimum angle (linear correlation) between all the feature detectors. We achieve this locality by means of a squashing function:

\[C(\theta) = \sum_{i=1}^{n}\sum_{j=1,j\ne i}^{n} \log(1 + e^{\lambda (cos(\theta_i,\theta_j)-1)}) = \log(1 + e^{\lambda (\langle \theta_i,\theta_j\rangle -1)}), \ ||\theta_i||=||\theta_j||=1,\]

where $\lambda$ is a coefficient that controls the minimum angle-of-influence of the regularizer, i.e. the minimum angle between two feature weights so that there exists a gradient pushing them apart. Again, a toy is worth a thousand words:

$cos^2(x)$
$\log(1 + e^{\lambda (x-1)})$
$\lambda=$10

As it can be seen, when $\lambda \in [7,10]$, the squashing function gradient approximates zero for those features further than $\pi / 2$ (90º). In other words, the regularizer will enforce feature weights to be orthogonal. As a result, the linear correlation between feature detectors will decrease, resulting in better generalization as it can be seen in the following figure:

Note that OrthoReg is able to reduce the overfitting in the presence of Dropout and BatchNorm at the same time. A possible hypothesis for this fact is that different regularizers act on different aspects of the network, exposing how much is still to be explored if we want to comprehend how to reach Deep Learning’s full potential.

I am delighted that you, the reader, have arrived at this line, for it means that I have been able to capture your attention and you have decided to spend some moments of your precious time to give a meaning to this work by reading it (and I hope you played a little bit too).

We provide the code at github.

Acknowledgements

Thanks to @pepgonfaus, @gcucurull, and Jordi Gonzàlez for their comments and suggestions.