Authors: Andreas Veit, Serge Belongie
Year: 2018
Majority of features of a CNN are these high level features that are not relevant to the image.
Why does convolution layers have a fixed feedforward structure?
Instead of Image being passed through hundreds of layers why couldn’t image be directed to first high level features (initial layers) then fine grained differences that is specifically designed for this class.
This paper proposes the idea of - the more a network already knows about an image, the better it should be at deciding which layer to compute next, so the architecture ConvNet-AIG decides for each input image on the fly which layers are needed. What if, after identifying that an image contains a bird, a ConvNet could move directly to a layer that can distinguish different bird species, without executing intermediate layers that specialize in unrelated aspects?
Solution proposed to this idea - A gated mechanism should be attached to each residual layer which makes a discrete decision on whether the image should be passed through it or not. This idea was supported by a study which showed even after removal of few residual block in a fully trained model, the performance of the model didn’t drop much.
The architecture follows the basic structure of a ResNet with the key difference that instead of executing all layers, the network determines for each input image which subset of layers to execute. In particular, with layers focusing on different subgroups of categories, it can select only those layers necessary for the specific input.
Gate can be split into 2 parts
Problems with this Gates -
*for math and gumbel samples have a look at the paper
They also introduce a target rate loss, which pushes the network to execute t percent of all layers. This is to prevent the scenario where the model uses all the layer or very few of them.
Let z’l denote the fraction of images within a mini-batch that layer l is executed.
ConvNet-AIG 110 clearly outperforms ResNet 110 while only using a subset of 82% of the layers.
This image signifies that even of the same category if the classification of that object is difficult it would use much more layers than that of an easy image of the same category.
It can be seen that for Birds (in general living animal) one less residual block is needed to compute than that of the man made objects. Which implies lesser computational time.