Submitted by galaxy_dweller t3_zqstlm in MachineLearning
Are you an AI researcher itching to test Hinton's Forward-Forward Algorithm? I was too, but could not find any full implementation so I decided to code it myself, from scratch. Here's the GitHub repo and don’t forget to leave a star if you enjoy the project.
As soon as I read the paper, I started to wonder how AI stands to benefit from Hinton’s FF algorithm (FF = Forward-Forward). I got particularly interested in the following concepts:
- Local training. Each layer can be trained just comparing the outputs for positive and negative streams.
- No need to store the activations. Activations are needed during the backpropagation to compute gradients, but often result in nasty Out of Memory errors.
- Faster weights layer update. Once the output of a layer has been computed, the weights can be updated right away, i.e. no need to wait the full forward (and part of the backward) pass to be completed.
- Alternative goodness metrics. Hinton’s paper uses the sum-square of the output as goodness metric, but I expect alternative metrics to pop up in scientific literature over the coming months.
Hinton’s paper proposed 2 different Forward-Forward algorithms, which I called Base and Recurrent. Let’s see why, despite the name, Base is actually the most performant algorithm.
​
As shown in the chart, the Base FF algorithm can be much more memory efficient than the classical backprop, with up to 45% memory savings for deep networks. I am still investigating why the base FeedForward underperforms with “thin” networks; any ideas, let’s talk.
Unlike Base FF, Recurrent FF do not have a clear memory advantage versus backprop for deep networks (15+ layers). That’s by design, since the recurrent network must save each intermediate step at time t to compute the following and previous layer outputs at time t+1. While scientifically relevant, the Recurrent FF is clearly less performant memory-wise than the Base FF.
What’s next?
The most interesting question is why the Base FF model memory consumption keeps increasing with the number of layers. That’s surprising given this model is trained one layer at a time, i.e. each layer is treated as a mini-model and trained separately from the rest of the model. I will explore this and let you know over the coming days
Deep-Station-1746 t1_j12m6fe wrote
Looking forward to updates. :)