Submitted by abhitopia t3_ytbky9 in MachineLearning

Hello,

I am new to this community. I am an ML researcher and a computer scientist. I have been interested in Category theory and functional programming (and Haskell in particular). I am also very interested in brain inspired computation and do not believe that current Deep Learning systems are a way to go.

In recent year, there are a few papers now which suggest how predictive coding can replace backpropagration based systems.

While initial research focussed on MLPs only, recently it have been applied to arbitrary computations graphs including CNNs, LSTMs, etc.

As is typical of ML practitioners, I don't have a neuroscience background. However, I found this amazing tutorial to understand predictive coding and how it can be used for actual computation.

A tutorial on the free-energy framework for modelling perception and learning

To best of my knowledge, no mainstream ML libraries (Pytorch or Tensorflow) currently support predictive coding efficiently.

As such, I am interested in building a highly parallel and extensive framework to do just that. I think a future "artificial brain" will be like a server that is never turned off, and can be scaled up (horizontally or vertically on demand). After reading up, I found Erland is a perfect language for that as it natively supports distributed computed, with millions of small indendent processes that communicate with each other using lightweight IPC.

Digging further, it seems that someone even wrote a 1000 page book Handbook of Neuroevolution Through Erlang . This book was written in 2012 before the advent of deep learning and focussing on evolution techniques (like genetic algorithm).

My proposal is to take these ideas and build a general purpose, highly parallel, scalable arifitical neural network library (with first class support for online/continual learning) using Erlang. I am looking for any feedback or advice here as well as looking for collaborators. So if interested, please reach out!

UPDATE [22-11-2022]: Considering using Rust and Actix library instead for performance reasons.

76

Comments

You must log in or register to comment.

karius85 t1_iw4v3ya wrote

It seems that this aims to show that predictive coding is equivalent to backprop. Just something to note before diving into your project.

7

CireNeikual t1_iw5fuqx wrote

Predictive coding is a good place to start, but I think it's also important to embrace sparsity to permit computationally efficient fully online/incremental learning. As is, predictive coding is mostly just used as a drop-in replacement for backpropagation, without really providing too many additional advantages. Predictive coding by itself doesn't permit online learning.

5

maizeq t1_iw5mh0v wrote

I will save you a significant amount of wasted time and tell you now that predictive coding (as it has been described more or so for 20 years in the neuroscience literature) is not equivalent to backpropagation in the way that Millidge, Tschantz, Song and co have been suggesting for the last two years.

It is extremely disheartening to see them continue to make this claim when they are clearly using a heavily modified version of predictive coding (called FPA PC, or fixed predicted assumption PC), which is so distinct to PC it is a significant stretch to lend it the same name.

For one predictive coding under the FPA no longer corresponds to MAP estimation on a probabilistic model (gradient descent on the log joint probability), so it loses its interpretation as a variational Bayes algorithm (something that afaik has not been explicitly mentioned by them thus far).

Secondly, if you spend any appreciable time on predictive coding you will realise that the computational complexity of FPA PC is guaranteed to be at best equal to backpropagation (and in most cases significantly worse).

Thirdly, FPA-PC requires "inverted" PC models in order to form this connection with backpropagation. These are models where high dimensional observations (such as images), parameterise latent states - no longer rendering them generative models in the traditional sense.

FPA PC can really be understood as just a dynamic implementation of backprop (with very little actual connection to predictive coding). This implementation of backpropagation is in many ways practically inefficient and meaningless. Let me use an analogy to make this more clear: Let's say you want to assign the variable a to f(x). You could either do a = f(x). Or you could set up a to update based on da/dt = a - f(x). The fixed/convergence point of which results in a = f(x). But if you think about it, if you already have the value 25, this is just a round about method of assigning a.

In the case of backpropagation "a" corresponds to backpropagated errors, and the dynamical update equation corresponds to the recursive equations which defines backpropagation. I.e. we are assigning "a" to the value of dL/dz, for a loss L. (it's a little more than this, but I'm drunk so I'll leave that to you to discern). If you look at the equations more closely you find that it basically can not be any more efficient than backpropagation because the error information still has to propagate backwards, albeit indirectly. I would check out this paper by Robert Rosenbaum which I think is quite fantastic if you want more nitty gritty details, and which deflates a lot of the connections espoused between the two works, particularly from a practical perspective.

I don't mean to be dismissive of the work of Millidge and co! Indeed, I think the original 2017 paper by Whittington and Bogacz was extremely interesting and a true nugget of insight (in terms of how PC with certain variance relationships between layers can approximate backprop etc. - something which makes complete sense when you think about it), but the flurry of subsequent work that has capitalised on this subtle relationship has been (in my honest opinion) very misleading.

Also, I would also not take any of what I've said as a dismissal of predictive coding in general. PC for generative modeling (in the brain) is extremely interesting, and may be promising still.

34

abhitopia OP t1_iw6l77r wrote

Thanks for the response.

I am yet to read in details the work of Millidge, Tschantz, Song in detail. I agree that this is not PC in the sense that came out from NeuroScience literature. I have only thoroughly read Bogatz 2017 paper.
and next on my list is Can the Brain Do Backpropagation? —Exact Implementation of Backpropagation in Predictive Coding Networks (also from Bogatz).

>If you look at the equations more closely you find that it basically can not be any more efficient than backpropagation

The interesting bit for me is not the exact correspondence with PC (as described in Neuroscience) but rather following properties that lend it suitable for asynchronous paralellisation is Local Synaptic Plasticity which I believe still holds good. The problem with backprop is not that it is not efficient, in fact it is highly efficient. I just cannot see how backprop systems can be scaled, and do online and continual learning.

>In the case of backpropagation "a" corresponds to backpropagated errors, and the dynamical update equation corresponds to the recursive equations which defines backpropagation. I.e. we are assigning "a" to the value of dL/dt, for a loss L. (it's a little more than this, but I'm drunk so I'll leave that to you to discern). If you look at the equations more closely you find that it basically can not be any more efficient than backpropagation because the error information still has to propagate backwards, albeit indirectly.

Can't we make first order approximation, like we do in any gradient descent algorithm? Again emphasing that the issue is not only speed of learning.

I will certainly checkout the paper by Robert Rosenbaum and thanks for sharing that. I will comment more once I have read this paper.

2

simonthefoxsays t1_iw6wznc wrote

The paper you link talks about the advantages of predictive coding coming from hardware architectures that colocate compute and memory in many small, somewhat independent units. Erlang will not give you that. The BEAM VM uses 1 thread per core, limiting its parallelism to the number of cpus, and even in that context it is designed for concurrency (allowing many tasks to make progress on one thread) which is in tension with data locality to the processor. In contrast, modern backprop implementations may have limitatiins on their parallelism compared to ideal state predictive coding, but they do heavily rely on gpus for much greater parallelism than cpus can allow.

Predictive coding looks very interesting, but to be useful it needs fundamentally different hardware than commodity computers today, not just a language with good parallel semantics.

2

mardabx t1_iw8j67a wrote

I am not ML scientist by any means, but I do know enough about programming to give my 3 cents.

Erlang is a very scalable horizontally and resilient, but not very performant. To scale upwards you will most likely need r/Rust, which already has some efforts put into ML and efficient horizontal scaling. If you insist on using Erlang/Elixir for base, do note that you can use Rust to speed up performance-sensitive parts of your project.

2

abhitopia OP t1_iw8udxk wrote

Thanks Mardabx for sharing your 3 cents. :) Very helpful.

The current ML systems today lack the scalability and fault tolerance which in my mind is more critical than training speed. Remember biological brains are not as fast either, but they are highly resilient and fault tolerant. And biological brains learning still surpasses some of the best AI currently trained on million of human equivalent life times. This is the direction I wanna go to where predictive coding based system runs continually, and scaled on demand, but it is never stopped.

Such a system would already be better than biological brain in the sense that brain is not scalable, but there is no such restriction on computer hardware systems.

Having said that, it is really impressive how performance gains can be had by using Rust (I didn't know it was even possible) and I am definitely open to using Rust to implement core functionality as NIFs (perhaps as optimisation). Thanks again for sharing.

2

abhitopia OP t1_iw8xv1o wrote

You are right. A neuromorphic hardware would be better. The reason right now is that everything runs on top of beam in Erlang, but then I am hoping that we can use Rust to implement core functions as NIFs as u/mardabx pointed out. https://discord.com/blog/using-rust-to-scale-elixir-for-11-million-concurrent-users

Having said that, I also do not think that speed is really the most critical problem to solve here. (For example, human brains are not even as fast as Beam single threads) Petaflots of compute is needed today because modern DL uses dense representations (unlike brain) and needs to be retrained from scratch (lacks continual learning). If resilient and fault tolerant system (say written in Erlang/Elixir) which could learn continuously and optimised (say using sparse representations) existed, it would eventually surpass competition.

1

liukidar t1_iwc2tbm wrote

Hello. Since it may be relevant for the conversation, I'd like to specify that the work by Song doesn't use FPA (except here where they mathematically prove the identity between fpa PC and BP) and all the experimental results in others of his papers are obtained via "normal" PC, where the prediction is updated at every iteration using gradient descent on the log joint probability (so, as far as my understatement of the theory is correct, it corresponds to the MAP on a probabilistic model). I'm not 100% sure about which papers by Millidge do and don't, but I'm quite confident that the majority don't (like here the predictions seem to be updated at every iteration; however, in the paper cited by abhitopia, apparently, they use FPA). Unfortunately, I'm not familiar with the work by Tschantz, so I cannot comment on that.

1

liukidar t1_iwc3rkb wrote

> The interesting bit for me is not the exact correspondence with PC (as described in Neuroscience) but rather following properties that lend it suitable for asynchronous paralellisation is Local Synaptic Plasticity which I believe still holds good

Indeed this still holds with all the definitions of PC out there (I guess that's why very different implementations such as FPA are still called PC). In theory, therefore, it is possible to parallelise all the computations across different layers.

However, it seems that deep learning frameworks such as PyTorch and JAX are not able to do this kind of parallelization on a single GPU (I would be very very glad if someone who knows more about this would like to have a chat on the topic; maybe I'm lucky and some JAX/Pytorch/Cuda developers stumble upon this comment :P)

3

maizeq t1_iwcimaq wrote

Thanks for the reply, there was some nuance left out of my comment since it was getting long enough, but if you take a closer look you'll find they all more or less adopt similar assumptions to make the two equivalent, and all suffer from the same points.

To be more specific:

The Millidge paper, which most of the BP = PC literature is based on uses the FPA assumption, and is not a descent on the log joint. (It also uses inverted models as I mentioned).

This paper by Song which was published in NeurIPS doesn't use the FPA-PC "directly", but achieves effectively the same thing by requiring the weight update to occur at a precise inference step, and requires that the modes are initialised to a feed-forward pass, and also requires the inference learning rate to be exactly 1. (All required for equivalence)

Does this sound familiar? That's right, this is literally computationally equivalent to backprop! (a forward pass and a sequential coordinated backward pass). This is intuitively obvious if you read the paper but you can see the Rosenbaum paper to see it play out experimentally also.

The Salvatori paper you linked uses the algorithm from the aforementioned Song paper, and so the same points apply. Note how they do not empirically evaluate "IL", which, in their terminology, corresponds to the actual PC algorithm.

Finally the Kinghorn paper you linked refers to standard uninverted (generative) PC, and isn't part of the BP=PC literature. (Note how label accuracy for MNIST is 80%, whereas in the inverted PC=BP models it can reach 97%).

From my practical experience in implementing a PC library the subpar performance of supervised generative PC for classification remains a difficulty. What's more, when using standard PC (in both inverted and uninverted settings), you have to be far more careful (vs. FPA) on account of the dynamics during inference being more complex; since standard PC takes in to account the current top-down beliefs at every time-step, something that is not done by the FPA.

As such you can easily experience divergence, or a failure to converge. This is likely why I haven't seen a single example of standard PC evaluated on a deep/complex inverted model. All the instances you see of "PC" evaluated on RNNs, CNN, deep MLPs etcs are FPA-PC (or the alternatives I mentioned above).

3

liukidar t1_iwcnuo2 wrote

Hello. Thank you for your reply. I will go into the details as well since I think we're creating a good review of PC that may help all different kinds of people that are interested.

I think we should divide the literature into two sets: FPA PC and PC. All the papers we cited (Salvatori, Song, Millidge) belongs indeed to the FPA PC. The aim of those papers was basically to give theoretical proof to show that PC was able to replicate BP in the brain (despite using a lot of assumptions on how this can be done).

However, note that the goal of the papers you have cited is to provide an equivalence or approximation between PC and BP, and not to use PC with FPA as a general-purpose algorithm. In fact, the same authors have then realised several papers that do NOT use FPA, and are applied to different machine learning tasks. I believe that the original idea of creating a general library to run these experiments is more focused towards applications, and not towards reimplementing the experiments that show equivalence and approximations of PC. Something interesting to replicate, still from the same authors, is the following: https://arxiv.org/pdf/2201.13180.pdf. And I am not aware of any library that has implemented something similar in an efficient way.

In relation to the accuracy, I'm not sure about what reported by Kinghorn, but already in Whittington 2017, you can see that they get a 98% accuracy on MNIST with standard PC. So the performance of PC on those it's not to be doubted.

​

I agree there's a lack of evaluations on deeper and more complex architectures. However here you can see an example of what you called IL can do: https://arxiv.org/abs/2211.03481 .

3

maizeq t1_iwcv5uh wrote

Thanks, and yes I agree, this might be useful to others.

As an aside, I have no qualms against standard generative PC (such as the paper you linked, and any other papers they have realised in that vein, indeed I'm a fan!). However, the discussion in this thread is about the equating of BP with PC, and in this regard, arguing "PC approximates backpropagation" when you really mean "this other heavily modified algorithm that was inspired by PC approximates backprop", is misleading. It is akin to saying an apple looks like an orange, if you throw away the apple and buy another orange.

It feels particularly egregious, when it turns out this modified algorithm is computationally equivalent to backpropagation, and as such the various neuroscientific justifications one applies may no longer hold (e.g. generative modelling is more sample efficient, or cortical hierarchies in the brain are characterised by top-down non-linear effects).

>In relation to the accuracy, I'm not sure about what reported byKinghorn, but already in Whittington 2017, you can see that they get a98% accuracy on MNIST with standard PC. So the performance of PC onthose it's not to be doubted.

Yes, this is the 97% value I referred to in my comment, if you look at the Whittington 2017 paper you will see this refers to an inverted architecture. In this case for a small ANN trained with standard PC without the FPA assumption.

Again, it's important to distinguish between the BP=PC literature, which this thread is related to, and other PC literature. I have no doubt plenty of interesting papers and insights exist in the latter!

2

Ambitious_Smile_981 t1_iwdrmam wrote

I don't see the problem of differentiating inverted and non-inverted architectures, as they are both generative models. The difference lies in what you are generating. In one case, you generate the label, and give as prior information the image, in the other, you generate the image giving the label as prior information.
Both have their advantages and disadvantages, but I don't see why the 'inverted' one is not interesting.

As of the BP = PC literature, I think that showing that by simply introducing a temporal scheduling for the weight updates of PC, we are able to obtain exact BP is interesting. I agree that this variation of PC loses all the advantages that PC has over BP, but it is still important to know that it is possible to derive exact backprop from a variational free energy.

1

miguelstar98 t1_ixasxko wrote

Alright, This is gonna be a long reply. 

TLDR; Probably a dead end (well at least current implementations are), but my goodness was it fun to research. In fact, it was a blast! I only had a single day off so take everything with a truck of salt.

Software Designer's perspective: 

Erlang is definitely the right tool to use if you want a programming language that was built to perform distributed parallel computing, with fault detection, repair, and consistency built-in (see "Systems that run forever self-heal and scale" by Joe Armstrong (2013) ) though to be honest the creator of Erlang himself has said that everything that makes Erlang what it is can be implemented in other languages. But should you?….well in my opinion, after looking at the google trends for Erlang, Clojure, and Rust languages that are similarly built to solve specific problems it might be more worthwhile to just use a language that was built with ease of use, simplicity, and high popularity in mind because who actually wants to learn to build and maintain the software in these languages. You can always make a program faster, you can't always make it easier to learn, read or maintain.

From a hardware design perspective: Honestly silicon itself might be a dead end. We seem to be converging on carbon based switches both in terms of computers and as meat brains.

From the Biologist's perspective: The whole concept of neural networks was biologically inspired. Taking inspiration from biology, specifically the human brain is obviously the correct course of action, not because biology or the brain is special but because when you attempt to solve any problem odds are good that there is already a solution. It might be a buggy approximation, but a solution nonetheless and to be perfectly honest evolution by natural selection is terrible at making good solutions, it’s better to just idealize away evolution’s solution and just do better.

From my personal perspective: I hope you can help clear up my understanding but what is the difference between predictive coding and model ensembles? I know that probably sounds like a dumb question, but can’t we just take a bunch of models that are really good at particular tasks and have a software layer that controls when to use which model and then combine their outputs to solve any general problem? Also if I need fault tolerance or I need to run inference, can’t I just use a cluster computer, why not 2? Isn’t this a solved problem when training large language models? 

Connected Papers

https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html

1

miguelstar98 t1_ixatok2 wrote

Finally, I put a lot of effort into this reply, solely because I could FEEL your enthusiasm in your post OP. It's the journey and the people we meet along the way that matters. Research is hard. Be Passionate. Personally, I'm now kinda interested in what would happen if we were to train an AI to be an operating system. It's all just function optimizations in the end anyway...I think...I probably shouldn't have sacrificed my sanity to write this, but I have no regrets. Also I only skimmed everyone’s comments so there might be overlap.

More links that may or may not be helpful

"Probabilistic scripts for automating common-sense tasks" by Alexander Lew

"We Really Don't Know How to Compute!" - Gerald Sussman (2011)

https://www.erlang-factory.com/upload/presentations/735/NeuroevolutionThroughErlang.pdf

This one is PDF of the handbook of neuroevolution through erlang. Libary Genesis can be sketchy sometimes

http://library.lol/main/68e1c70d7ad3dad73727820bcffaccaf

2

abhitopia OP t1_ixc6b1m wrote

Hey u/miguelstar98, OP here and still very enthusiastic. I have spend last 2 weeks studying predictive coding and still going through a lot of nuances. The more I think and read about it, the more confident I am about the utility of this project.

Btw, do you know what the comment was that got deleted by the moderator?

I shared the "neuroevolution though erlang" in my original post too. I really still think coding this (having read predictive coding) is so much easier in Erlang to make it fully asynchronous, and scalable. And worry about optimisation only later (e.g. using Rust NIFs or try to use cuda kernals)

>"Probabilistic scripts for automating common-sense tasks" by Alexander Lew
"We Really Don't Know How to Compute!" - Gerald Sussman (2011)

Haven't watched these, do you mind what you have in mind?

1

miguelstar98 t1_ixdfxqw wrote

The comment wasn't deleted by a moderator, it was my first attempt at the original reply before I realized that I that I had made a mistake (I have a tendency to ignore everything anyone has to say and just try to figure things out on my own first not because I arrogantly believe I'm better but because I know that I can look at things differently than other people) so I deleted it and quickly skimmed the other comments because I was running low on time.

The library genesis link is there because others might not have access to a institution or $$$ and paying for information before you even know if it's useful is inefficient towards learning.

I included those two videos and all of the other links is because (given the information I can gleam from you) I can and did reasonably predict that at least some of information within those links are outside of your comfort zone. Which would mean that after watching them you'll have explored the solution space of your particular problem more thoroughly. Exploring down rabbit holes should probably be done early on while it's still easy to change your mind.

The videos by Alexander Lew and Gerald Sussman are the first things I thought of when thinking about your problem. Will they be helpful? Maybe but I could be wrong.

What really interests me is that even after reading my reply you are confident, which means you think I'm wrong (which is so exciting!) but you haven't really answered my questions, or explained the source of your confidence or perhaps I haven't fully grasped enough of the nuances of the problem to even have useful responses for you. I'd love to help you, but I just don't see how it's not a dead end.

Don't worry about replying if you think I'm crazy just ignore everything I've said

1

abhitopia OP t1_ixdkoos wrote

Hey u/miguelstar98

> but you haven't really answered my questions, or explained the source of your confidence or perhaps I haven't fully grasped enough of the nuances of the problem to even have useful responses for you.

I am not sure which questions? Did you mean what you mentioned in your deleted post (which wasn't accessible to me)?

Anyways, I can see your original post now. Thanks for undeleting it.

>Software Designer's perspective:

I think actor model just makes a lot of sense to do asynchronous concurrent computations. Having said that, since Erlang is slow, I am actually considering using Actix library in Rust (The first step is for me to just write a pseudo code of the algorithm based on message passing)

​

>From a hardware design perspective:

I am not sure what you want to say. The difference here is not hardware but change in algorithm (BP vs PC). Afaik, BP requires synchronised forward and backward passes.

>From the Biologist's perspective:

I am not sure again. The intention isn't to say biological plausible is superior or we MUST imitate nature. It is rather something than current ML libraries don't do but seems doable in light of new PC research.

>From my personal perspective: I hope you can help clear up my understanding but what is the difference between predictive coding and model ensembles? I know that probably sounds like a dumb question, but can’t we just take a bunch of models that are really good at particular tasks and have a software layer that controls when to use which model and then combine their outputs to solve any general problem? Also if I need fault tolerance or I need to run inference, can’t I just use a cluster computer, why not 2? Isn’t this a solved problem when training large language models?

Hmm. Model ensembles and learning algorithms to train those models are two different topics. The focus here is not on the "inference" (FP) part which current libraries are really good at but the "learning" (BP) part. Not sure what else to say.
I highly recommend reading this tutorial on PC (and contrast against BP)

2

abhitopia OP t1_ixdnki4 wrote

u/maizeq - I have finished reading the Rosenbaum paper . It is certainly very accessible and useful paper to understand the details and nuances between various PC implementations. So thank you for sharing that.

The objective of the author seems to compare various versions of the algorithm and highlight subtle difference and does a great job at it. It does not however exploit the local synaptic plasticity in its implementation (and uses loops) which is exactly where l think lies the limitation of Pytorch, Jax, and Tensorflow.

For instance, one could imagine each node and each weight in an PC (non FPA) MLP network as a standalone process communicating with other nodes and weights process only via message passing to run completely asynchronously. Furthermore, we can limit the amount of commputation by thresholding the value of error nodes (so weight updates for connected weight processes with happen) in a sense enforcing sparsity.

May be I am wrong, I do not (yet) see why in this simple MLP it should be be possible to add new nodes (in a hot fashion), for example, if the activity in any node increases by certain threshold then scale up automatically preserving 2% activity per layer.

Contrast this with GPU based backward passes, a lot of wasteful computation can be prevented. At the very least, Backward doesn't need to weight for FP in the EM like learning algorithm that PC is.

P.S. - My motivation isn't PC==BP, but rather can PC replace BP and is it worth it.

1

BerenMillidge t1_iy814ur wrote

Hi, author of some of the papers linked here. Broadly, Maizeq is right to distinguish between FPA-PC and ‘standard PC’ (the ‘inverted vs generative direction of the PC net is a different orthogonal direction). The equivalence between PC and BP only holds exactly in the case with the FPA (or some equivalent set of assumptions — for instance in the original Whittington paper they use the precision ratio tending to 0. Of course all of these limits are in some sense extreme and eliminate some (but not all) of the major advantages of PC (in some sense this was inevitable since if they exactly equal BP then they must very roughly have the same advantages/disadvantages as it). The way to view these works, at least as I have come to view them, is as a idealised exploration of a specific limit of PC. In recent work (https://arxiv.org/pdf/2206.02629), we expand on this limit idea and show that all current EBM approximations to BP, such as PC, Equilibrium-prop and Contrastive Hebbian learning, can be expressed as a single ‘infinitesimal inference limit’.

Overall I disagree that the work in this vein is particularly misleading, although this is a subjective assessment. It is upfront about the assumptions you need to make to obtain equivalence to backprop, as well as how this departs from standard PC.

Of course, from a neuroscientific perspective, this limit is perhaps not the most realistic and so we are also exploring the ML performance of more ‘standard’ PC versions which are more biologically plausible and which don’t approximate backdrop (, as well as specifically understanding the special advantages and disadvantages of these algorithms. For instance, in a recent paper -- https://www.biorxiv.org/content/biorxiv/early/2022/05/18/2022.05.17.492325.full.pdf --, we propose a new understanding of standard PC as ‘prospective configuration’ and demonstrate how this version of PC can outperform backdrop in a number of its properties. We also have a more theoretical analysis of standard PC (https://arxiv.org/pdf/2207.12316) where we show that although it differs from backdrop, it can also converge to minima of a supervised loss function, and has close links to target-propagation and hence Gauss-Newton optimization. Our groups have also explored other potential advantages of PC over BP including the ability for it to learn arbitrary recurrent computation graphs (https://arxiv.org/pdf/2201.13180), the fact that you can significantly speed it up with incremental variants, and that you can get PC to perform a mix of iterative and amortised inference https://arxiv.org/pdf/2204.02169.

In terms of the hardware, I have also looked into this a little, and my feeling is that while PC has better parallelism properties than PC, it is unlikely to outperform BP on a GPU due to the need to iteratively perform the inference phase while BP just has a sequential forward and backward. GPUs are now getting very highly optimised for the exacts style of computations needed in BP for large scale ANNs. PC does possess a much higher degree of parallelism and locality than BP and on a sufficiently distributed architecture may eventually prove better, especially once we start building proper ‘neuromorphic’ processor-in-memory architectures. However this seems likely to be many years away. I haven’t read much about Erlang so I’m not sure if it possesses the degree of necessary parallelism. One possibility is that Erlang with Pc might allow you to move to a different point on the Pareto frontier of having lots of CPUs and developing learning algorithms comparable in performance with doing BP on a single GPU. I haven’t run any fermi-style estimates of whether this is feasible or not. We have some calculations about this in a forthcoming paper but this is on a highly abstract computation model of ‘parallel matrix multiplications’ and I haven’t figured out what the actual equivalent calculations for realistic hardware would look like.

2