Learning to Learn by Gradient Descent with Rebalancing

Neural networks, as the name implies, comprise many little neurons. Often, multiple layers of neurons. How many? Quick googling on the “number of layers” or “number of neurons in a layer” leaves one with a strong impression that there are no good answers.

The first impression is right. There is a ton of recipes on the web, with the most popular and often-repeated rules of thumb boiling down to “keep adding layers until you start to overfit” (Hinton) or “until the test error does not improve anymore” (Bengio).

Part I of this post stipulates that selecting the optimal neural network architecture is, or rather, can be a search problem. There are techniques to do massive searches. Training a neural network (NN) can be counted as one such technique, where the search target belongs to the function space defined by both this environment and this NN architecture. The latter includes a certain (and fixed) number of layers and number of neurons per each layer. The question then is, would it be possible to use a neural network to search for the optimal NN architecture? To search for it in the entire NN domain, defined only and exclusively by the given environment?

Ignoratio elenchi

Aggregating multiple neural networks into one (super) architecture comes with a certain number of tradeoffs and risks including the one that is called ignoratio elenchi – missing the point. Indeed, a super net (Figure 1) would likely have its own neurons and layers (including hidden ones), and activation functions. (Even a very cursory acquaintance with neural networks would allow one to draw this conclusion.)

Which means that training this super net would inexorably have an effect of training its own internal “wiring” – instead of, or maybe at the expense of, helping to select the best NN – for instance, one of the 9 shown in Figure 1. And that would be missing the point, big time.

Fig. 1. Super network that combines 9 neural nets to generate 4 (green) outputs

The primary goal remains: not to train super-network per se but rather to use it to search the vast NN domain for an optimal architecture. This text describes one solution to circumvent the aforementioned ignoratio elenchi.

I call it a Weighted Mixer (Figure 2):

Fig. 2. Weighted sum of NN outputs

Essentially, a weighted mixer, or WM, is a weighted sum of the contained neural nets, with a couple important distinctions…

TL;DR – WM can automatically grade multiple network architectures (a link to the white paper and the supporting code follows below):

Animated Weights

One picture that is worth a thousand words. This shows a bunch of NN architectures, with the sizes and numbers of hidden layer ranging from 16 to 56 and from 1 to 3, respectively. The column charts (Figure 8) depict running weights that the WM assigns to each of the 16 outputs of each of the constituent nets that it (the WM) drives through the training:

Fig. 8. Weight updates in-progress

The winner here happens to be the (56, 2) network, with the (48, 2) NN coming close second.  This result, and the fact that, indeed, the 56-neurons-2-layers architecture converges faster and produces better MSE, can be separately confirmed by running each of the 18 NNs in isolation…