Submitted by Singularian2501 t3_y3d3lw in MachineLearning

Paper: https://arxiv.org/abs/2210.05359

Abstract:

>Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world -- their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm to ground language model reasoning in the physical world. Given a physical reasoning question, we use a computational physics engine (DeepMind's MuJoCo) to simulate the possible outcomes, and then use the simulation results as part of the input, which enables language models to perform reasoning. Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind's Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). Smaller language models armed with Mind's Eye can obtain similar performance to models that are 100x larger. Finally, we confirm the robustness of Mind's Eye through ablation studies.

https://preview.redd.it/ie7jdqhwmnt91.jpg?width=1092&format=pjpg&auto=webp&s=ebff5cab2c805549e85fb2eccfdadd0644d95d9f

https://preview.redd.it/3wrxbnhwmnt91.jpg?width=1180&format=pjpg&auto=webp&s=09aa2773a853ab564cfbb11811a18d21165a06e4

https://preview.redd.it/7frgfxhwmnt91.jpg?width=991&format=pjpg&auto=webp&s=0bfcb01b5707d6e1892fc3100b960a5f9c203707

https://preview.redd.it/k6mm4rhwmnt91.jpg?width=1191&format=pjpg&auto=webp&s=00c452f58e79b6ea003883826e50653f907221d5

134

Comments

You must log in or register to comment.

londons_explorer t1_is9c1xs wrote

This seems to basically be injecting a tiny amount of rule-based decision making into a language model...

The physics model is so limited that it can only work in a very tiny number of cases, and the results might as well be hardcoded prompts to inject.

26

thunderdome t1_is9cigc wrote

Really interesting. I wish they included more details on the text-to-code models. They share the table showing that increasing the size increases the accuracy, but they apparently never train it higher than 1.5B parameters? It would be interesting to know how much of the remaining error on their benchmark is due to error in the text-to-code generation vs the foundational models. Or even just a standalone accuracy measure.

Regardless, super cool to see the number 100 popping up in benchmarks like this.

5

NextAGI t1_is9kymu wrote

Not so surprising to me if you have read the paper where language models can even transfer the reasoning ability to language after being trained on something like code. https://arxiv.org/abs/2201.11473

7

rePAN6517 t1_is9rrfh wrote

Maybe we could use the new version of codex to program a human simulator and let LLMs use the human simulator to help answer questions anything related to people.

2

Co0k1eGal3xy t1_is9yf8p wrote

>Two baseballs X and Y are released from rest at the same height.
>
>X is heavier than Y.
>
>Which baseball will fall to the ground faster?

Isn't Mind's eye the ONLY wrong answer?

Acceleration due to gravity is constant, but the opposing force from air resistance is roughly proportional to the air displaced and does not change with mass.

I mean, this is the whole point of Apollo 15's test on the moon.

All of them have wrong explainations, but Mind's eye is the only one that incorrectly claims they will fall at the same rate under normal real world conditions.

​

Proof : Brian Cox visits the world's biggest vacuum | Human Universe - BBC

1

Even_Tangerine_800 t1_is9zy95 wrote

This is what I got: GPT-3 Answer.

Apparently, the model arrives at the wrong answer without mentioning the air resistance. I have tried many times the results are consistent.

Considering the free fall rules should be encoded in some text books (which should have been included in the pre-training datasets), these results are even more striking to me.

7

Even_Tangerine_800 t1_isa18wf wrote

Are the questions as simple as a = F/m = mg / m = g?

Anyways. If humans put effort into optimizing a tool for accurate simulation, we can treat it more like an alignment problem rather than pure scientific judgment.

You can update the knowledge in the physics engine if you want.

3

Co0k1eGal3xy t1_isa1gbn wrote

Oh I agree 100%. This paper is fantastic! (and it's an easy fix)

I definitely want to see further research in this, but the comparison they show here is probably not the comparison they wanted to show haha.

1

Lajamerr_Mittesdine t1_isa34ib wrote

All the answers are incomplete because they don't provide the assumptions necessary to arrive at a complete solution.

A more complete answer would look like this.

>Assuming just gravitational forces both the lighter and heavier baseballs both would fall at the same rate and then reach the surface at approximately the same time. This can be impacted however by additional forces that may be present such as an atmosphere providing additional resistances based on the surface area, density, and total mass of each object.

Though even that is an incomplete answer.

1

Co0k1eGal3xy t1_isa44ws wrote

>because they don't provide the assumptions necessary to arrive at a complete solution.

I agree, but when atmosphere it not mentioned, the default should be updated to STP (0°C temperature and 101.325 kPa pressure) in future.

2

Co0k1eGal3xy t1_isa8g7i wrote

>current language models (LMs) miss the grounded experience of humans in the real-world -- their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning.

That is my whole point. This paper trying to avoid "planet Question" and make language models work in the real world instead.

I'm not interested in arguing over this. The paper is good, it just needs a minor correction in a future revision.

2

Icy-Pause-574 t1_isb5gai wrote

But we are using physics engine / game engine to create the virtual world right?

I think this paper shows some potential of using the parallel world to help understand the real world, which is amazing.

I want to call it the milestone of mix-reality LMs.

7

AskMoreQuestionsOk t1_isb66lb wrote

Actually, I think you make a good point. If you think about understanding conversations and stories and problems like this, you need a model understanding of what it is that you are talking about to even begin to make an accurate assumption about what the prediction of the next state will be. - we make an incredible number of assumptions from our own experience when we make those internal models. How do we know if air friction is important to this problem?

1

yazriel0 t1_isbv4hp wrote

Why arent we doing for code domain? Generate programs, try to run them, auto correct the model?

This can probably be iterated with far more samples than a physical simulator

5

Co0k1eGal3xy t1_isc5m6k wrote

>But what about the basketball and the bowling ball? Shouldn't they have different accelerations? Technically, yes.
>
>[...]
>
>it turns out that there are many situations where a heavier object does indeed hit the ground before a lighter object (because of air resistance).

Your link says the heavy baseball and the light baseball would fall at different rates.

1

master3243 t1_iscb9mu wrote

My link also says that heavier objects can fall slower than light objects. As in the styrofoam board that was heavier than the small ball yet it fell slower.

In the absence of more detail such as the dynamics of the shapes and the inclusion of air drag or not, it is fair to say that the most correct answer to the "which" question is "both". I would only count the "heavy first" answer as correct IF it included the discussion on air drag, otherwise the correct answer is "both". But that's my opinion and not objectively the only way to interpret this.

Especially given a model that has so many physics articles/material included in it's dataset, it's a pretty big fail that it can't answer this properly.

1

Co0k1eGal3xy t1_iscd0no wrote

>In the absence of more detail such as the dynamics of the shapes

Baseball's have a standard diameter and shape.

It's theoretically possible that the heavier baseball has a "furry" surface or something like that, but it's such an unlikely case I didn't consider it when reading the paper.

​

>it's a pretty big fail that it can't answer this properly.

I emailed the authors and they said "there could be some pre conditions we have not presented in the screenshot" and that they would address it when they released a dataset.

Sounds like it's all sorted out now. No harm done.

1