Are you an AI researcher itching to test Hinton's Forward-Forward Algorithm? I was too, but could not find any full implementation so I decided to code it myself, from scratch. Here's the GitHub repo and don’t forget to leave a star if you enjoy the project.

https://preview.redd.it/zne5aapb837a1.png?width=581&format=png&auto=webp&s=c1a25a2df94b365c283076a1a84c480371ed296e

As soon as I read the paper, I started to wonder how AI stands to benefit from Hinton’s FF algorithm (FF = Forward-Forward). I got particularly interested in the following concepts:

Local training. Each layer can be trained just comparing the outputs for positive and negative streams.
No need to store the activations. Activations are needed during the backpropagation to compute gradients, but often result in nasty Out of Memory errors.
Faster weights layer update. Once the output of a layer has been computed, the weights can be updated right away, i.e. no need to wait the full forward (and part of the backward) pass to be completed.
Alternative goodness metrics. Hinton’s paper uses the sum-square of the output as goodness metric, but I expect alternative metrics to pop up in scientific literature over the coming months.

Hinton’s paper proposed 2 different Forward-Forward algorithms, which I called Base and Recurrent. Let’s see why, despite the name, Base is actually the most performant algorithm.

As shown in the chart, the Base FF algorithm can be much more memory efficient than the classical backprop, with up to 45% memory savings for deep networks. I am still investigating why the base FeedForward underperforms with “thin” networks; any ideas, let’s talk.

Unlike Base FF, Recurrent FF do not have a clear memory advantage versus backprop for deep networks (15+ layers). That’s by design, since the recurrent network must save each intermediate step at time t to compute the following and previous layer outputs at time t+1. While scientifically relevant, the Recurrent FF is clearly less performant memory-wise than the Base FF.

What’s next?

The most interesting question is why the Base FF model memory consumption keeps increasing with the number of layers. That’s surprising given this model is trained one layer at a time, i.e. each layer is treated as a mini-model and trained separately from the rest of the model. I will explore this and let you know over the coming days

Comments

Deep-Station-1746 t1_j12m6fe wrote on December 21, 2022 at 5:18 AM

#991,932

Looking forward to updates. :)

zimonitrome t1_j12vva7 wrote on December 21, 2022 at 7:06 AM

#992,554

Awesome!

kevin_malone_bacon t1_j13ocby wrote on December 21, 2022 at 1:06 PM

#994,136

Curious how you think it compares to: https://github.com/mohammadpz/pytorch_forward_forward

ainap__ t1_j13ps2q wrote on December 21, 2022 at 1:19 PM

#994,250

Cool! Why do you think that for the base FF memory requirement keep increasing with the number of layers?

galaxy_dweller OP t1_j13ri6w wrote on December 21, 2022 at 1:34 PM

#994,389

Replying to kevin_malone_bacon (#994,136)

Hi u/kevin_malone_bacon, I saw that repo, very good work by Mohammad! But to collect the results I needed the full implementation of the paper. For instance, Mohammad’s implementation does not include the recurrent network for MNIST nor the NLP benchmark.
Another thing I wanted to test was to concatenate the one-hot representation of the labels in FF baseline instead of replacing the values of the first 10 pixels, so that I could apply the same network to new datasets in the future

galaxy_dweller OP t1_j13sd41 wrote on December 21, 2022 at 1:41 PM

#994,466

Replying to ainap__ (#994,250)

>Cool! Why do you think that for the base FF memory requirement keep increasing with the number of layers?

Hi u/ainap__! The memory usage of the forward-forward algorithm increases respect to the number of layers, but significantly less respect to the backpropagation algorithm. This is due to the fact that the increase in memory usage for forward-forward algorithm is just related to the number of parameters of the network: each layer contains 2000x2000 parameters which when trained using the Adam optimizer occupies approximately 64 MB. The total memory occupied difference between n_layers=2 and n_layers=47 is approximately 2.8 GB which corresponds to 64MB * 45 layers

perceptSequence t1_j15dvep wrote on December 21, 2022 at 8:06 PM

#998,850

Replying to zimonitrome (#992,554)

Love Your videos! Had no idea You are into ML lol

Superschlenz t1_j170rk6 wrote on December 22, 2022 at 3:13 AM

#1,003,763

Is already known how well it can handle online/lifelong/curriculum learning and avoid catastrophic forgetting, compared to backprop? I've googled but no result.

[R] PyTorch implementation of Forward-Forward Algorithm by Geoffrey Hinton and analysis of performances over backpropagation