kompootor t1_ja24h53 wrote
I will not debate the thesis. I will admit that every blog post using the premise of "The latest advancement in neural nets brings us one step closer to the Singularity" begins with an eyeroll of dread from me, but I will at least pick apart the first section for problems, which hopefully will be informative for those evaluating for themselves how seriously to take this person.
>Intelligence, as defined in this article, is the ability to compress data describing past events, in order to predict future outcomes and take actions that achieve a desired objective.
The first two elements of that is the definition for any model, which is exactly what both AI and deterministic regression algorithms all do. I think "take actions" would imply that AI model either makes a explicit recommendations, except that no premises are given in the definition for which there is context to give recommendations (to whom? for what?). Regardless, it seems to me less of a definition for "intelligence", in any useful sense, as it is of "model".
>Since its introduction, the theory of compression progress has been applied to a wide range of fields, including psychology, neuroscience, computer science, art, and music.
The problem is that Schmidhuber 2008 only exists as a preprint and later as a conference paper -- it was never peer-reviewed. The paper seems to justify that it is a widely applicable concept, but it wasn't apparent to me on a search that the theory was actually applied by someone to one of these fields in some manner that was substantial. I'm not saying it's a bad paper or theory, but that this essay doesn't really justify why it brings it up so much (particularly given the very limited definition of intelligence above, and just the way ANN are known by everybody to work), and giving a real example of it being useful would have helped.
>The equation E = mc^2
For the newbies out there, this is what's called a red flag.
The next paragraph actually helpfully links to Towards Data Science's page on Transformer, which is actually really good in that they illustrates, complete with animations, the mechanics of ANNs. So definitely check it out. The next sentence linked in the essay, however, is literally linking the paper that defined Transformer with the defining phrase used in the paper -- as if that would enlighten the reader of the essay somehow? The final sentence of the paragraph is once again a completely generic description of all ANNs ever.
>The weights that define it’s behavior only take up 6GB, meaning it stores only about 1 byte of information per image.
This is completely the wrong way to think about it if you're trying to understand these things, so I hope he actually knows this.
The next few paragraphs seem to be ok descriptors. Then we get to here:
>With just a small amount of data and scale, the model will learn basic word and sentence structure. Add in more data and scale, and it learns grammar and punctuation.
First, this is the connectivist problem/fallacy in early AI and cog sci -- the notion that because small neuronal systems could be emulated somewhat with neural nets, and because neural nets could do useful biological-looking things, that then the limiting factor to intelligence/ability is simple scale: more nodes, more connections, more power. Obviously this wasn't correct in either ANNs or BNNs. Further, in this paragraph he seems to have lost track of whether he was talking about the objective function in ChatGPT. Either way that's definitely not how any NLP works at all. Unfortunately this paragraph only gets worse. It's disappointing, since the preceding paragraphs had otherwise indicated to me that the writer probably knew a little about neural nets in practice.
>Just last week, a paper was published arguing that theory of mind may have spontaneously emerged
PREPRINT. Not published. No peer review yet. I won't comment on the paper myself as I am not a peer in the field. It's a dramatic claim and it will have proper evaluation.
This is all I'll do of this, as it's a long essay and I think there's enough that you all can judge for yourself from what I've evaluated of the first few paragraphs.
jamesj OP t1_ja2acnm wrote
Hey I appreciate the time to engage with the article and provide your thoughts. I'll respond to a few things.
>The first two elements of that is the definition for any model, which is exactly what both AI and deterministic regression algorithms all do.
Yes, under the framework used in the article, an agent using linear regression might be a little intelligent. It can take past state data and use it to make predictions about the future state, and use those predictions to act. That would be more intelligent than an agent which makes random actions.
>I'm not saying it's a bad paper or theory, but that this essay doesn't really justify why it brings it up so much
Yes, that is a fair point. I was worried that spending more time on it would have made it even longer than it already was. But one justification is that it is a good, practical, definition of intelligence that demystifies the process of intelligence to what kind of information processing must be taking place. It is built off of information theory work in information bottlenecks, and is directly related to the motivation for autoencoders.
>The problem is that Schmidhuber 2008 only exists as a preprint and later as a conference paper -- it was never peer-reviewed.
The paper isn't an experiment with data, it was first presented at a conference to put forward an interpretation. It's been cited 189 times. I think it is worth reading, the ideas can be understood pretty easily. But it isn't the only paper that discusses the connection between compression, prediction, and intelligence. Not everyone talks in the language of compression, they may use words like elegance, parameter efficiency, information bottlenecks, or whatever, but we are talking about the same ideas. This paper has some good references, it states, "Several authors [1,5,6,11,7,9] have suggested the relevance of compression to intelligence, especially the inductive inferential (or inductive learning) part of intelligence. M. Hutter even proposed a compression contest (the Hutter prize) which was “motivated by the fact that being able to compress well is closely related to acting intelligently”
>The equation E = mc2 For the newbies out there, this is what's called a red flag.
I was trying to use an example that people would be familiar with. All the example is pointing out is that the equations of physics are highly compressed representations of the data of past physical measurements, that allow us to predict lots of future physical measurements. That could be said of Maxwell's equations or the Standard Model or any successful physical theory. Most physicists like more compressed mathematical descriptions: though they usually would call it more elegant rather than use the language of compression.
>This is completely the wrong way to think about it if you're trying to understand these things, so I hope he actually knows this.
I don't think it is wrong to say that what the transformer "knows" about the images in its dataset has been compressed into its weights. In a very real sense, a transformer is very lossy compression algorithm which takes in a huge dataset and learns weights which represent patterns in the dataset. So no, I'm not saying that literally every image in the dataset was compressed down to 1.2 bytes each. I'm saying that whatever SD learned about the relationships of the pixels in an image to their text labels is stored in 1.2 bytes per dataset image in its weights. And you can actually use those weights as a good image compression codec. The fact that it has to do this in a limited number of parameters is one of the things that forces it to learn higher-level patterns and not rely on memorization or other simpler strategies. Illya Sutskever talks about this, and was part of a team that published on it, basically showing that there is a sweet spot for data/parameter where giving it more parameters improves performance to a point, but there is a point where adding even more decreases performance. His explanation for this is that by limiting the number of parameters, the model is forced to generalize. So in Schmidhubers language, the network is forced to make more compressed representations, so it overfits less and generalizes better.
>First, this is the connectivist problem/fallacy in early AI and cog sci -- the notion that because small neuronal systems could be emulated somewhat with neural nets, and because neural nets could do useful biological-looking things, that then the limiting factor to intelligence/ability is simple scale
My argument about this doesn't come from ML systems mimicking biology. It comes from looking at exponential graphs of cost, performance, model parameters, and so on, and projecting that exponential growth will likely continue for a while. The first airplane didn't fly like a bird, it did something a lot simpler than that. In the same way, I'd bet the first AGI will be a lot simpler than a brain. I could be wrong about that.
But, I'm not even claiming that scaling transformers will lead to AGI, or that AGI will definitely be developed soon. All I'm saying is that there is significant expert uncertainty in when AGI will be developed, and it is possible that it could be developed soon. If it were, that would probably be the most difficult type of AGI to align, which is a concern.
Viewing a single comment thread. View all comments