Lottery Ticket Hypothesis
Neural Network pruning. But currently sparse networks cannot be trained from scratch.
Question: Why arent pruned networks trained from scratch?
The Hypothesis
Lottery Ticket Hypothesis: Dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that when trained in isolation reach test accuracy comparable to the original network in a similar number of iterations.
Algorithm 1: Pruning
-
Randomly initialize a neural network: $f(x; \theta_{0}) \ where \ \theta_{0} \text{\textasciitilde} D_{\theta}$
- Train the network for $j$ iterations
- Prune $p \%$ of the network (through some algorithm).
- Reset the remaining weights to values in $\theta_{0}$, and hence the winning ticket $f(x;\ m \bigodot \theta_{0})$
.
data:image/s3,"s3://crabby-images/7a73f/7a73fa6b49c88da38fb6eb9b3260e8cc4808967b" alt="Lottery%20Ticket%20Hypothesis%2072c1972c2b9547a1be108c45c99d9774/Screenshot_2020-06-21_at_5.34.14_PM.png"
Figure 1: The legends represent Sparsity mask = 1 - Percentage of weights pruned. So 100% sparsity mask correspondings to 0% weights pruned. Pruned network performance with sparsity mask 21% is highest.
data:image/s3,"s3://crabby-images/98a19/98a19e0ec3711288dcc8f8e9d2a5faa986918b5e" alt="Lottery%20Ticket%20Hypothesis%2072c1972c2b9547a1be108c45c99d9774/Screenshot_2020-06-21_at_5.53.44_PM.png"
Figure 2: Early stop iteration increases with higher sparsity.
Algorithm 2: Iterative Pruning
- Perform a step of finding the lottery ticket and prune by 20%
- Repeat
Algorithm 3: One Shot Pruning
- Is essentially Algorithm 1.
data:image/s3,"s3://crabby-images/70017/70017bc53c43e4d7efcd37ea3c198d78f08e1815" alt="Lottery%20Ticket%20Hypothesis%2072c1972c2b9547a1be108c45c99d9774/Screenshot_2020-06-21_at_5.51.36_PM.png"
Figure 3: One shot pruned tickets are also winning tickets.
Convolutional Neural Networks
Whats the difference?
- Shared weights. Hence computation is very sensitive.
data:image/s3,"s3://crabby-images/b4ef2/b4ef204dd5dc064af8e063d23ae4bcda26fdfc2d" alt="Lottery%20Ticket%20Hypothesis%2072c1972c2b9547a1be108c45c99d9774/Screenshot_2020-06-21_at_5.56.32_PM.png"
Figure 4: CNNs also have winning tickets!
Dropout
Whats the difference?
- Connections are randomly removed during training.
data:image/s3,"s3://crabby-images/23ad9/23ad9c924a15588442e03c78ce960814ae3d3263" alt="Lottery%20Ticket%20Hypothesis%2072c1972c2b9547a1be108c45c99d9774/Screenshot_2020-06-21_at_6.00.20_PM.png"
- Better winning tickets than even usual ones. (Rigging the lottery).
Discussion
- Do these tickets come from having the same initial distribution as the final one? No. Check Appendix F.
- Reinitializing to the same distribution is almost the same as having an inductive bias.
- Winning tickets exceed performance of original tickets? A very known result even in the literature surrounding Quantized networks perform better than original overparameterised networks.
-
Deeper networks require learning rate warmup.
data:image/s3,"s3://crabby-images/f5dd4/f5dd4364e33ea49c16b35ccb0091b96cf2346167" alt="Lottery%20Ticket%20Hypothesis%2072c1972c2b9547a1be108c45c99d9774/Screenshot_2020-06-21_at_6.07.31_PM.png"
Future work
- Structured pruning
- Non-magnitude based pruning
- Extension to Language Models