Abstract:

>Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world -- their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm to ground language model reasoning in the physical world. Given a physical reasoning question, we use a computational physics engine (DeepMind's MuJoCo) to simulate the possible outcomes, and then use the simulation results as part of the input, which enables language models to perform reasoning. Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind's Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). Smaller language models armed with Mind's Eye can obtain similar performance to models that are 100x larger. Finally, we confirm the robustness of Mind's Eye through ablation studies.

https://preview.redd.it/ie7jdqhwmnt91.jpg?width=1092&format=pjpg&auto=webp&s=ebff5cab2c805549e85fb2eccfdadd0644d95d9f

https://preview.redd.it/3wrxbnhwmnt91.jpg?width=1180&format=pjpg&auto=webp&s=09aa2773a853ab564cfbb11811a18d21165a06e4

https://preview.redd.it/7frgfxhwmnt91.jpg?width=991&format=pjpg&auto=webp&s=0bfcb01b5707d6e1892fc3100b960a5f9c203707

https://preview.redd.it/k6mm4rhwmnt91.jpg?width=1191&format=pjpg&auto=webp&s=00c452f58e79b6ea003883826e50653f907221d5

Comments

You must log in or register to comment.

[deleted] t1_is9855m wrote on October 14, 2022 at 5:48 AM

#100,137

[deleted]

FirstOrderCat t1_is99yzw wrote on October 14, 2022 at 6:10 AM

#100,204

is benchmark available somewhere?..

londons_explorer t1_is9c1xs wrote on October 14, 2022 at 6:36 AM

#100,284

This seems to basically be injecting a tiny amount of rule-based decision making into a language model...

The physics model is so limited that it can only work in a very tiny number of cases, and the results might as well be hardcoded prompts to inject.

thunderdome t1_is9cigc wrote on October 14, 2022 at 6:42 AM

#100,297

Really interesting. I wish they included more details on the text-to-code models. They share the table showing that increasing the size increases the accuracy, but they apparently never train it higher than 1.5B parameters? It would be interesting to know how much of the remaining error on their benchmark is due to error in the text-to-code generation vs the foundational models. Or even just a standalone accuracy measure.

Regardless, super cool to see the number 100 popping up in benchmarks like this.

NextAGI t1_is9kymu wrote on October 14, 2022 at 8:42 AM

#100,651

Not so surprising to me if you have read the paper where language models can even transfer the reasoning ability to language after being trained on something like code. https://arxiv.org/abs/2201.11473

visarga t1_is9mk63 wrote on October 14, 2022 at 9:06 AM

#100,706

Replying to [deleted] (#100,137)

Not just simulation, LLMs can also benefit from other toys: search, code execution/REPL, sub-requests, calling external APIs.

visarga t1_is9mzus wrote on October 14, 2022 at 9:13 AM

#100,730

Replying to londons_explorer (#100,284)

We need a learned physics model, there's so much video to train on, it's one of the most neglected modalities.

rePAN6517 t1_is9rrfh wrote on October 14, 2022 at 10:20 AM

#100,928

Maybe we could use the new version of codex to program a human simulator and let LLMs use the human simulator to help answer questions anything related to people.

Co0k1eGal3xy t1_is9yf8p wrote on October 14, 2022 at 11:36 AM

#101,311

>Two baseballs X and Y are released from rest at the same height.
>
>X is heavier than Y.
>
>Which baseball will fall to the ground faster?

Isn't Mind's eye the ONLY wrong answer?

Acceleration due to gravity is constant, but the opposing force from air resistance is roughly proportional to the air displaced and does not change with mass.

I mean, this is the whole point of Apollo 15's test on the moon.

All of them have wrong explainations, but Mind's eye is the only one that incorrectly claims they will fall at the same rate under normal real world conditions.

Proof : Brian Cox visits the world's biggest vacuum | Human Universe - BBC

Even_Tangerine_800 t1_is9zy95 wrote on October 14, 2022 at 11:51 AM

#101,383

Replying to Co0k1eGal3xy (#101,311)

This is what I got: GPT-3 Answer.

Apparently, the model arrives at the wrong answer without mentioning the air resistance. I have tried many times the results are consistent.

Considering the free fall rules should be encoded in some text books (which should have been included in the pre-training datasets), these results are even more striking to me.

Co0k1eGal3xy t1_isa0dvo wrote on October 14, 2022 at 11:56 AM

#101,407

Replying to Even_Tangerine_800 (#101,383)

The heavier baseball falling to the ground faster is the correct answer. Maybe you misread my post?

It is a shame none of them mention air resistance.

Even_Tangerine_800 t1_isa18wf wrote on October 14, 2022 at 12:04 PM

#101,436

Replying to Co0k1eGal3xy (#101,407)

Are the questions as simple as a = F/m = mg / m = g?

Anyways. If humans put effort into optimizing a tool for accurate simulation, we can treat it more like an alignment problem rather than pure scientific judgment.

You can update the knowledge in the physics engine if you want.

Co0k1eGal3xy t1_isa1gbn wrote on October 14, 2022 at 12:06 PM

#101,441

Replying to Even_Tangerine_800 (#101,436)

Oh I agree 100%. This paper is fantastic! (and it's an easy fix)

I definitely want to see further research in this, but the comparison they show here is probably not the comparison they wanted to show haha.

Lajamerr_Mittesdine t1_isa34ib wrote on October 14, 2022 at 12:21 PM

#101,499

Replying to Co0k1eGal3xy (#101,407)

All the answers are incomplete because they don't provide the assumptions necessary to arrive at a complete solution.

A more complete answer would look like this.

>Assuming just gravitational forces both the lighter and heavier baseballs both would fall at the same rate and then reach the surface at approximately the same time. This can be impacted however by additional forces that may be present such as an atmosphere providing additional resistances based on the surface area, density, and total mass of each object.

Though even that is an incomplete answer.

Co0k1eGal3xy t1_isa44ws wrote on October 14, 2022 at 12:30 PM

#101,547

Replying to Lajamerr_Mittesdine (#101,499)

>because they don't provide the assumptions necessary to arrive at a complete solution.

I agree, but when atmosphere it not mentioned, the default should be updated to STP (0°C temperature and 101.325 kPa pressure) in future.

eigenlaplace t1_isa6z73 wrote on October 14, 2022 at 12:54 PM

#101,680

Replying to Co0k1eGal3xy (#101,311)

It’s a simple question, no mention of air anywhere… The correct answer is they fall at the same rate.

Co0k1eGal3xy t1_isa7e7b wrote on October 14, 2022 at 12:57 PM

#101,705

Replying to eigenlaplace (#101,680)

I live on the planet earth where most places have air. It is assumed that there is air if it is not mentioned otherwise.

eigenlaplace t1_isa7xzy wrote on October 14, 2022 at 1:01 PM

#101,730

Replying to Co0k1eGal3xy (#101,705)

I live on planet Question where most places have no air. Where is your god now?

Co0k1eGal3xy t1_isa8g7i wrote on October 14, 2022 at 1:05 PM

#101,763

Replying to eigenlaplace (#101,730)

>current language models (LMs) miss the grounded experience of humans in the real-world -- their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning.

That is my whole point. This paper trying to avoid "planet Question" and make language models work in the real world instead.

I'm not interested in arguing over this. The paper is good, it just needs a minor correction in a future revision.

Icy-Pause-574 t1_isb5gai wrote on October 14, 2022 at 4:53 PM

#103,319

Replying to londons_explorer (#100,284)

But we are using physics engine / game engine to create the virtual world right?

I think this paper shows some potential of using the parallel world to help understand the real world, which is amazing.

I want to call it the milestone of mix-reality LMs.

AskMoreQuestionsOk t1_isb66lb wrote on October 14, 2022 at 4:58 PM

#103,347

Replying to Co0k1eGal3xy (#101,763)

Actually, I think you make a good point. If you think about understanding conversations and stories and problems like this, you need a model understanding of what it is that you are talking about to even begin to make an accurate assumption about what the prediction of the next state will be. - we make an incredible number of assumptions from our own experience when we make those internal models. How do we know if air friction is important to this problem?

yazriel0 t1_isbv4hp wrote on October 14, 2022 at 7:44 PM

#104,665

Why arent we doing for code domain? Generate programs, try to run them, auto correct the model?

This can probably be iterated with far more samples than a physical simulator

master3243 t1_isc3kp7 wrote on October 14, 2022 at 8:41 PM

#105,068

Replying to Co0k1eGal3xy (#101,311)

They fall at the same rate

https://www.wired.com/2013/10/do-heavier-objects-really-fall-faster/

Co0k1eGal3xy t1_isc5m6k wrote on October 14, 2022 at 8:55 PM

#105,197

Replying to master3243 (#105,068)

>But what about the basketball and the bowling ball? Shouldn't they have different accelerations? Technically, yes.
>
>[...]
>
>it turns out that there are many situations where a heavier object does indeed hit the ground before a lighter object (because of air resistance).

Your link says the heavy baseball and the light baseball would fall at different rates.

[deleted] t1_isc9s5a wrote on October 14, 2022 at 9:23 PM

#105,395

[deleted]

master3243 t1_iscb9mu wrote on October 14, 2022 at 9:34 PM

#105,493

Replying to Co0k1eGal3xy (#105,197)

My link also says that heavier objects can fall slower than light objects. As in the styrofoam board that was heavier than the small ball yet it fell slower.

In the absence of more detail such as the dynamics of the shapes and the inclusion of air drag or not, it is fair to say that the most correct answer to the "which" question is "both". I would only count the "heavy first" answer as correct IF it included the discussion on air drag, otherwise the correct answer is "both". But that's my opinion and not objectively the only way to interpret this.

Especially given a model that has so many physics articles/material included in it's dataset, it's a pretty big fail that it can't answer this properly.

Co0k1eGal3xy t1_iscd0no wrote on October 14, 2022 at 9:46 PM

#105,579

Replying to master3243 (#105,493)

>In the absence of more detail such as the dynamics of the shapes

Baseball's have a standard diameter and shape.

It's theoretically possible that the heavier baseball has a "furry" surface or something like that, but it's such an unlikely case I didn't consider it when reading the paper.

>it's a pretty big fail that it can't answer this properly.

I emailed the authors and they said "there could be some pre conditions we have not presented in the screenshot" and that they would address it when they released a dataset.

Sounds like it's all sorted out now. No harm done.

master3243 t1_isch3m9 wrote on October 14, 2022 at 10:15 PM

#105,788

Replying to Co0k1eGal3xy (#105,579)

Great