Submitted by Balance- t3_124eyso in MachineLearning

GPT-4 and professional benchmarks: the wrong answer to the wrong question

OpenAI may have tested on the training data. Besides, human benchmarks are meaningless for bots.

Problem 1: training data contamination

To benchmark GPT-4’s coding ability, OpenAI evaluated it on problems from Codeforces, a website that hosts coding competitions. Surprisingly, Horace He pointed out that GPT-4 solved 10/10 pre-2021 problems and 0/10 recent problems in the easy category. The training data cutoff for GPT-4 is September 2021. This strongly suggests that the model is able to memorize solutions from its training set — or at least partly memorize them, enough that it can fill in what it can’t recall.

As further evidence for this hypothesis, we tested it on Codeforces problems from different times in 2021. We found that it could regularly solve problems in the easy category before September 5, but none of the problems after September 12.

In fact, we can definitively show that it has memorized problems in its training set: when prompted with the title of a Codeforces problem, GPT-4 includes a link to the exact contest where the problem appears (and the round number is almost correct: it is off by one). Note that GPT-4 cannot access the Internet, so memorization is the only explanation.

925

Comments

You must log in or register to comment.

rfxap t1_jdzfxd1 wrote

There are other benchmarks to look at though. Microsoft Research tried an early version of GPT-4 on LeetCode problems that were published after the training data cutoff date, and they got results similar to human performance in all difficulty categories: https://arxiv.org/abs/2303.12712 (page 21)

What should we make of that?

277

abc220022 t1_jdzrsbu wrote

Part of the sales pitch behind LeetCode is that you are working on problems that are used in real coding interviews at tech companies. I believe that most LeetCode problems were invented well before they were published on the LeetCode website, so they still could appear in some form in their training data.

350

keepthepace t1_jdzp4ge wrote

Could some parts of the dataset be copied into the LeetCode problem or is there a guarantee that these problems are 100% novel?

53

londons_explorer t1_jdzwcfo wrote

Problems like this are never 100% novel.

There are always elements and concepts of the problem and solution that have been copied from other problems.

The easiest way to see this is to ask a non-programmer to come up with a 'programming puzzle'. They'll probably come up with something like "Make an app to let me know when any of my instagram friends are passing nearby and are up for hanging out".

Compare that to a typical leetcode problem, and you'll soon see how leetcode problems are really only a tiny tiny corner of what is possible to do with computers.

93

currentscurrents t1_je13kdr wrote

True! But also, problems in general are never 100% novel. That's why metalearning works.

You can make up for poor reasoning abilities with lots of experience. This isn't bad exactly, but it makes testing their reasoning abilities tricky.

15

cegras t1_je0gfd7 wrote

If you google most leetcode problems I would bet a coffee that they've existed on the internet long before leetcode came into existence.

27

MrFlamingQueen t1_je0j29h wrote

It feels like majority of the people in this discussion have no idea what computer science is and what LeetCode tests.

As you mentioned, there are hundreds of websites devoted to teaching the leetcode design patterns and entire books devoted to learning and practicing these problems.

33

TheEdes t1_je149kf wrote

Yeah but if you were to come up with a problem in your head that didn't exist word for word then GPT-4 would be doing what they're advertising, however, if the problem was word for word anywhere in the training data then the testing data is contaminated. If the model can learn the design patterns for leetcode style questions by looking at examples of them, then it's doing something really good, if it can only solve problems that it has seen before, then it's nothing special, they just overfit a trillion parameters on a comparatively very small dataset.

8

cegras t1_je2k9dr wrote

ChatGPT is great at learning the nuances of english, i.e. synonyms and metaphors. But if you feed it a reworded leetcode question and it finds the answer within its neural net, has it learned to conceptualize? No, it just learned that synonym ...

8

TheEdes t1_je6tweq wrote

Sure but what's being advertised isn't sentience per se, at least with the leetcode part of their benchmarks. The issue here is that they claim that it can do X% on leetcode, but it seems like it's much less on new data. Even if it learned to find previous solutions and replace it with changes it should be able to perform well due to the nature of the problems.

1

MrFlamingQueen t1_je3kywp wrote

Agreed. It's very likely contamination. Even "new" LeetCode problems existed before they were published on the website.

2

cegras t1_je0jsud wrote

Do you know if ChatGPT was allowed to ingest PDFs found on the internet? Even if not, I'm sure there are many sections of famous textbooks reproduced in HTML or parsable form.

2

ianitic t1_je0mjqx wrote

Oh I haven't tested this on textbooks, but I have asked chatGPT to give me pages of a novel and it did word for word. I suspect it had to have trained on PDFs? I'm highly surprised I haven't seen any news of authors/publishers suing yet tbh.

It is obvious when a book is a part of its training set or not though based on the above test.

10

currentscurrents t1_je12d3k wrote

Nobody knows exactly what it was trained on, but there exist several datasets of published books.

>I'm highly surprised I haven't seen any news of authors/publishers suing yet tbh.

They still might. But they don't have a strong motivation; it doesn't really directly impact their revenue because nobody's going to sit in the chatgpt window and read a 300-page book one prompt at a time.

6

mcilrain t1_je1a7cl wrote

Current tech could be used to allow you to ask an AI assistant to read you a book.

3

MrFlamingQueen t1_je0w3ut wrote

Not sure on the training corpus, but like you mentioned, there's ton of other forms of textbooks and solution manuals to textbook problems on things like github, stackexchange, etc.

3

mcilrain t1_je19vif wrote

Even if it didn't ingest PDFs it probably ingested websites that scraped PDFs to spam search engine results.

1

SzilvasiPeter t1_je4pknf wrote

Should I bet a coffee? No way... that is too much of a deal.

1

VodkaHaze t1_je06t03 wrote

> LeetCode problems that were published after the training data cutoff date

A variation of those problems is likely on github before they're posted?

25

milktoasttraitor t1_jdzuw0z wrote

If you look at the prompt they show, they clearly gave it hints which tell it the exact approach to use in order to solve the problem. The problem is also a very slight derivative of another existing, very popular problem on the platform (“Unique Paths”).

This is impressive in another way, but not in the way they were trying to show. They didn’t show the other questions it got right, so no way of telling how good or bad the methodology was overall or what hints they gave it. For that question at least, it’s not good and it makes me skeptical of the results.

20

RubenC35 t1_je0ra56 wrote

Would they be a little bias? I mean Microsoft has spent loads of money in the idea of being the best.

2

salgat t1_je3eqx5 wrote

GPT4 is the world's best googler. As long as a similar solution existed on the internet in the past, there's a good chance GPT4 can pick it up, even if it's not on leetcode yet.

1

Nhabls t1_je93uvg wrote

The way they defined human performance there is just funny.

Dividing the number of accepted answers by total users.. might as well just make up a number

1

wazis t1_jdz4v8g wrote

If it is true (too lazy to check), it is not surprizing. If it is not than it is also not surprising

109

Seankala t1_jdz6kty wrote

Yeah I read through the whole thing and it's not surprising. Train-test contamination has been a problem for a while now.

67

hadaev t1_jdzcowi wrote

Well we usually expect it from not really ds peoples like biologists using ds methods and making such a trivial mistake.

It doesnt seems hard to search matches in text. Unlike other data types.

13

master3243 t1_jdzec5r wrote

Seeing how they made sure the bar exam and the math olympiad tests were recent ones that were explicitly stated to not be in the training dataset to avoid contamination, I trusted that all the other reported tests were also as carefully picked to avoid contamination.

14

MotionTwelveBeeSix t1_jdzurlg wrote

The bar exams recycle the same questions every year, there’s very little original about them. Its a test of pure memorization

26

jrkirby t1_jdzx1ef wrote

I'm guessing the hard part is that you can't "untrain" a model. They hadn't thought "I want to benchmark on these problems later" when they started. Then they spent 20K$+ compute on training. Then they wanted to test it. You can easily find the stuff you want to test on in your training dataset, sure. But you can't so easily remove it and train everything again from scratch.

7

Thorusss t1_je1z0ib wrote

>Then they spent 20K$+ compute on training.

Your estimate is a few magnitudes too low

9

AuspiciousApple t1_je2aij3 wrote

Idk, thousands of GPUs going brrrr for months, how much can it cost?

$10?

2

jrkirby t1_je2f63r wrote

2 million dollars or 20 million dollars is greater than 20 thousand. And it makes the main thesis more salient - the more money you've spent training, the less willing you'll be to retrain the entire model from scratch just to run some benchmarks the "proper" way.

1

wazis t1_jdzzs1q wrote

Well they can, but it is expensive

3

RossoMarra t1_je16mod wrote

I really think you are underestimating biologists.

2

is_it_fun t1_jdzs7dw wrote

Biologists are such trash nowadays when it comes to any kind of computational / math methods. Back in our grandfather's days they were really hardcore.

−8

mlresearchoor t1_je1mvf7 wrote

OpenAI blatantly ignored the norm to not train on the ~200 tasks collaboratively prepared by the community for BIG-bench. GPT-4 knows the BIG-bench canary ID afaik, which removes the validity of GPT-4 eval on BIG-bench.

OpenAI is cool, but they genuinely don't care about academic research standards or benchmarks carefully created over years by other folks.

92

obolli t1_je4juzh wrote

I think they used to. Things change when you come under the pressure of returning profits.

21

ghostfaceschiller t1_jdz6vzn wrote

I think this was shown awhile ago (like a week ago, which just feels like ten years)

While I do think this is important for several reasons, personally I don't see it as all that impactful for what I consider AI capable of going forward.

That's bc pretty much all my assumptions for the next couple years are based on the idea of systems that can loop and reflect on their own actions, re-edit code based on error messages, etc. Which they are very good at

76

Riboflavius t1_jdzb56p wrote

I was reading your reply and couldn't help thinking that the italics and then the missing period make it look like the end of it is already red-shifted because we're accelerating so fast.

24

-xXpurplypunkXx- t1_je315e6 wrote

In my experience, gpt tends to hallucinate the same incorrect response and refuses to make the directed corrections to code.

6

Simcurious t1_jdzatox wrote

That's not correct, the benchmark they used only contained codeforce problems from after 2021.

From Horace's tweets: >Considering the codeforces results in the paper (very poor!), they might have only evaluated it on recent problems.

45

muskoxnotverydirty t1_jdzi41h wrote

It's correct and it's not correct. The article mentions this, but then they say that it's likely that they weren't able to cleanly separate pre-2021 questions on non-coding benchmarks.

11

bjj_starter t1_jdzo3zq wrote

But that's pure speculation. They showed that a problem existed with training data, and OpenAI had already dealt with that problem and wasn't hiding it at all - GPT-4 wasn't tested on any of that data. Moreover, it's perfectly fine for problems like the ones it will be tested on to be in the training data, as in past problems. What's important is that what it's actually tested on is not in the training data. There is no evidence that it was tested on training data, at this point.

Moreover, the Microsoft Research team was able to repeat some impressive results in a similar domain on tests that didn't exist before the training data cut-off. There isn't any evidence that this is a problem with a widespread effect on performance. It's also worth noting that it seems pretty personal for the guy behind this paper, judging by the way he wrote his tweet.

3

muskoxnotverydirty t1_je027xh wrote

Yeah it's speculation. I agree.

> There is no evidence that it was tested on training data, at this point.

I think what the author is trying to say is that for some of these tests there's no evidence it was tested on training data but there's no evidence that it wasn't. But then the ability to generalize in the specific domain of the tests depends on that difference. If nothing else, it would be nice for those who publish test results to show how much they knew whether test data was in the training data. It seems to me that they could automate a search within the training set to see if exact wordage is used.

11

bjj_starter t1_je2ckb0 wrote

>If nothing else, it would be nice for those who publish test results to show how much they knew whether test data was in the training data.

Yes, we need this and much more information about how it was actually built, what the architecture is, what the training data was, etc. They're not telling us because trade secrets, which sucks. "Open" AI.

1

sb1729 t1_jdzgfff wrote

They mention that in the article.

3

Simcurious t1_jdzhbyu wrote

The title implies that they evaluated on data from before 2021 while the source says they didn't.

15

bjj_starter t1_jdzoafq wrote

This title is misleading. The only thing they found was that GPT-4 was trained on code questions it wasn't tested on.

42

Nhabls t1_je94xwx wrote

Not misleading. The fact it performs so differently on easy problems it has seen Vs not , specially when it fails so spectacularly on the latter does raise big doubts about how corrupted and unreliable their benchmarks might be

6

bjj_starter t1_je98wdx wrote

Okay, but an external team tested it on coding problems which only came into existence after its training finishes, and found human level performance. I don't think your theory explains how that could be the case.

1

Nhabls t1_je9anrq wrote

Which team is that? The one at Microsoft that made up the human performance figures in a completely ridiculous way? Basically "We didn't like that pass rates were too high for humans for the hard problems that the model fails on completely so we just divided the accepted number by the entire user base" oh yeah brilliant

The "human" pass rates are also composed of people learning to code trying to see if their solution works. Its a completely idiotic metric, why not go test randos on the street and declare that represents the human coding performance metric while we're at it

1

mrpickleby t1_jdzg5e8 wrote

Implies that AI will speed the dissemination of information but not necessarily be helpful in creating new thinking.

23

cegras t1_je0g90p wrote

How does the AI perform any better than a Google search? I'd say the AI is even more dangerous as it gives a single, authoritative sounding answer that you have to go to Google and secondary sources to verify anyways!

12

WarmSignificance1 t1_je1pdz9 wrote

I think that ChatGPT has shown how bad so many people are at Googling. And granted, sometimes ChatGPT is just far superior.

But when people say things like "I can ask it how to use a library and it's made me 10x faster over using Google", it just blows my mind. I can usually find the official docs and figure out how to use a library in about the same time as ChatGPT can tell me, without the risk of errors.

12

AquaBadger t1_je2c68z wrote

to be fair, google has gotten slower to find useful information due to the mass of ads and bought results clogging up searches now. But yes, google is still faster than chatgpt and if cleaned up would be even better

9

polygon_primitive t1_je0x04y wrote

For finding answers it's about the same as Google, sometimes better if you then verify the result with external sources, but that's mainly because Google has so badly corrupted their core search product while chasing profit. It's been pretty useful for me for doing the grunt work writing boiler plate code and refactoring stuff tho

3

currentscurrents t1_je14pi5 wrote

Clearly, the accuracy is going to have to get better before it can replace Google. It's pretty accurate when it knows what it's talking about, but if you go "out of bounds" the accuracy drops off a cliff without warning.

But the upside is that it can integrate information from multiple sources and you can interactively ask it questions. Google can't do that.

3

SWESWESWEh t1_je33t7z wrote

I've had a lot more luck solving novel coding problems with the GPT-4 version of chatGPT then Google. If you stick to older tech and libraries like Java and Spring that have been around forever, it's really good at solving fairly difficult problems if you just keep providing context. With Google, it's basically has someone done this exact thing on SO and gotten an answer, if not oh well

2

thelastpizzaslice t1_jdzv7pu wrote

I once asked it for a parody of Miss American Pie about Star Wars Episode 1 and it gave me Weird Al's song verbatim.

9

Gunhild t1_jdzzw8j wrote

Clearly a sign of intelligence; even the AI knows you don't mess with perfection.

25

nixed9 t1_je0cugt wrote

The next logical prompt would be “try again, and make it original.” What happened then?

3

thelastpizzaslice t1_je1pphc wrote

I asked it to write another one from Darth Maul's perspective after that and it did a ducking amazing job.

1

mrdevlar t1_jdzdi2t wrote

Proof that no matter where you go, it is always going to be possible to make simple mistakes.

7

sigmatrophic t1_jdzpk9m wrote

Honestly I paid for GTP 4... It's a bit better but felt like gtp3 before they dumbed it down.

5

krali_ t1_je053hd wrote

I'm considering it, if only for plugin support. Wolfram in particular.

2

currentscurrents t1_je15i85 wrote

That's still on a waitlist unfortunately.

GPT-4 is good but slow, at least for now I mostly still use the GPT-3.5 model.

2

ArnoF7 t1_je0dzqg wrote

Funnily, I actually found GPT-4 far worse than what I expected in terms of coding, especially after I looked at its impressive performance on other exams. I guess it’s still a progress in terms of LLM for coding, maybe just a little underwhelming compared to other standardized tests it aces? GPT-4’s performance on codeforces is borderline abhorrent.

And now you are telling me there is data leakage, so the actual performance would be even worse than what’s on paper???

5

meister2983 t1_je0s90f wrote

GPT-4 is an extremely good pattern matcher - probably one of the best ever made. Most exams made seem to be able to executed with straight-forward pattern matching (with no backtracking). The same thing applies to basic coding questions - it reasonably performs at the level of a human gluing stack overflow solutions together (with the obvious variable renaming/moving lines around/removing dead code/etc.)

It struggles at logical reasoning (when it can't "pattern match" the logical reasoning to something it's trained on).

Coding example:

  • Had no problem writing a tax calculator for ordinary income with progressive tax brackets
  • It struggles to write a program to calculate tax on long term capital gains (US tax code), which is very similar to the above, except has an offset (you start bracket indexing at ordinary income). I'd think this is actually pretty easy for a CS student especially if they saw the solution above -- GPT4 struggled though as it doesn't really "reason" about code the way a human would and would generate solutions obviously wrong to a human.
14

DaBobcat t1_je12b4q wrote

Here OpenAI and Microsoft were evaluating GPT4 on medical problems. In section 6.2 they specifically said that they found strong evidence that it was trained on "popular datasets like SQuAD 2.0 and the Newsgroup Sentiment Analysis datasets". In the appendix section B they explain how they measured whether it saw something in the training data. Point is, I think benchmarks are quite pointless if the training dataset is private and no one can verify that they did not train it on the test set, which they specifically said that in many cases it did

5

pale2hall t1_je2begu wrote

Chat GPT-4 can't remember it's writing a FireFox add-on not a Chrome Extension.

It's like the most amazing coder ever, but always half-drunk, and completely confident, and always. Here's how almost every single response started after the first....

  • Apologies for the incomplete response.
  • Apologies for the confusion. The Express server I provided earlier ...
  • I apologize for the inconvenience. After reviewing the code, I've noticed some inconsistencies in the code
  • I apologize for the confusion. It appears that the context menu was removed due to a typo in the content.js file.
  • I apologize for the confusion. To make the changes you requested, follow the steps below:
  • Apologies for the confusion, and thank you for providing the additional information. Here's an updated implementation that should resolve the issues:
  • I apologize for the confusion. Here's an updated solution that should display the response in the popup window and clear the input field on submit. Additionally, I added an indicator that shows the addon is thinking.
  • Apologies for the confusion, and thank you for the clarification. Based on your requirement, you can make the following changes:
  • Apologies for the confusion. You are correct that you cannot trigger the reviseMyComment() function in the content script without sending a message from the background script.
  • My apologies for the confusion. The error you are encountering is because the sendToOpenAI() function is not available in the content script content.js
  • Apologies for the confusion. I made an error in my previous response.
3

VertexMachine t1_jdzehvy wrote

Interesting. Potentially something that might be also used in the ongoing lawsuit against copilot?

2

visarga t1_jdzr4tp wrote

This paper scared me more than any other ML paper. I hoped we have 2-3 more years until what they show in there.

2

ReasonablyBadass t1_je0rwr3 wrote

Is it possible the older questions are by now about better known problems so more training data existed for them and the newer are about newer concepts, not really represented on the net yet?

1

nomadiclizard t1_je0u5ex wrote

Haha amateurs. I learned not to make that mistake when I split a pose estimation visual dataset into training and validation, but lots of the frames were almost-duplicates so it got contaminated that way. >.<

1

fiftyfourseventeen t1_je0z514 wrote

That's exactly what happened here lol, they only deduplicated by exact duplicate text so there was lots of similar data in both sets

2

thorax t1_je107vs wrote

I'm working on an extreme usage model for leveraging GPT4 to generate code, and it's rather good. Not perfect, but impressive is an understatement.

1

regalalgorithm t1_je1eu1e wrote

FYI, the GPT 4 paper has a whole section on contamination in the appendix - I found it to be pretty convince. Removing contaminatimg data did make it worse at some benchmarks, but also better at others, and overall it wasn't a huge effect.

1

StellaAthena t1_je3tz04 wrote

I found this analysis incredibly unconvincing. They used a weaker standard for deduplication than is standard as well as a weaker analysis than the one they did for the GPT-3 paper.

4

notforrob t1_je1lowh wrote

This inspired me to ask GPT-4:
"Can you generate a leetcode easy problem that has never been seen?"

And then ask it to solve the problem it creates. In the few cases I tried it failed miserably.

1

pmirallesr t1_je2tf2v wrote

Idk, the procedure to check for contamination described in the release report sounded solid at first glance, and I don't see how this news changes that

1

_sbmaruf t1_je369s5 wrote

Sorry for self posting my work here. But you can take a look at our recent work, https://arxiv.org/abs/2303.03004

1

WarmSignificance1 t1_je58y3c wrote

Looks interesting. Have you tried any of the GPT models against this benchmark?

1

_sbmaruf t1_je8iuvl wrote

We just released the dataset last week. We are in the process of training some autoregressive models.

1

Coffee_Crisis t1_je392lv wrote

If you search GitHub for unusual variable names or keywords you will often find code that looks very similar to the stuff GPT spits out, in some domains it’s much more copy paste than people think

1

purplebrown_updown t1_je3xwqa wrote

Question. I’m guessing they want to continuously feed more data to gpt so how do they avoid using up all their training. Is this what’s called data leakage?

1

joeiyoma t1_je42q2t wrote

Chatgpt always have the potential for error, 4 version has a reduced potential for error. My biggest worry is what it will do our creativity. Autopilot all the time!

1

Calamero t1_je4doo0 wrote

It will enable creative people to bring their ideas to reality. It won’t make people less creative. AI technology democratizes the execution part, making it easier for people from all walks of life to transform their visions into reality. It will augment human creativity rather than stifling it.

1

joeiyoma t1_je9f2ma wrote

That is the utopia, and we all want it!

2

NoRip7374 t1_je6e6rd wrote

At least some good news!

1

joeiyoma t1_je9fegm wrote

There is a lot of Buzz about prompt engineering, can it cut as a skill-set going forward or is just a hype that will out with time.

1

Seankala t1_jdz53mn wrote

It'd be nice to see the qualifications of the authors.

0

currentscurrents t1_je17y5v wrote

>Why are deep learning technologists so overconfident

>A Narayanan, S Kapoor

>Substack newsletter. AI Snake Oil

You can get your blogposts listed on Google Scholar?

1

Puzzleheaded_Acadia1 t1_je11l0o wrote

So does that mean that gpt 4 can't think critically? and if not can we make a new kind of ML like LLMs and llama that can think critically and integrated to gpt 4 so it becomes a multimodel that can "see" and think critically.

0

pengo t1_je7vr2t wrote

Yes, it can think critically, it just doesn't tell you whether it is or isn't at any one time.

1

HonkyTonkPolicyWonk t1_je1mqdp wrote

Well, yeah, ChatGTP is auto-suggest on steroids. It can’t create anything de novo. It reframes and regurgitates what others have done.

No surprises here

0

trajo123 t1_je2gie9 wrote

How much of the code that devs write on a typical day is truly novel and not just a rehash / combination / adaptation of existing stuff?

He who has not copied code from stackoverflow, let him cast the first insult at ChatGPT.

0

WarmSignificance1 t1_je57s2a wrote

So I actually think that senior devs copy and paste a lot less than everyone imagines.

I can’t remember the last time I’ve copied code from StackOverflow. Actually, I rarely even use StackOverflow at this point. Going directly to the official docs is always best.

1

kesisci123 t1_je0breh wrote

Big memorization machine.

−1

plocco-tocco t1_jdzpyf8 wrote

I do not see any evidence of this happening in the article. Also, OpenAI claims to have checked for contamination in every benchmark, so I don't see what the author's are trying to show here.

−2

Wtiaw t1_je04aq2 wrote

> Note that GPT-4 cannot access the Internet, so memorization is the only explanation

this is not true, it was shown through jailbreaks that it could access the internet

−5

truchisoft t1_je05j63 wrote

The funny thing about these posts is that this is clearly propaganda aimed to low effort people.

Anyone caring about this is either blinded by their own prejudice or just too dumb to even try GPT once themselves.

Everyone else does not need someone telling them that even GPT3.5 is incredible for coding (and a lot of other stuff), it is not perfect but it goes a long way, heck, I was even able to make a simple game in less than 3 hours using 99% GPT3.5 code and DALL-E sprites.

−10

austacious t1_je0g6oi wrote

A healthy skepticism in AIML from those in the field is incredibly important and relatively hard to come by. Having the attitude that 'This is great and everything is wonderful' does not lead to meaningful progress addressing very real issues. It's very productive to point out shortcomings of otherwise highly effective models.

12

truchisoft t1_je0un8g wrote

Oh no no, not my argument here, but the whole title wording looks like a sleazy attack, this is not criticism but seems like a hit piece, since like other commenters mention, other independent tests were ran on GPT4 already and people is already using GPT4 for coding.

0

st8ic t1_je0lxgn wrote

"bro it's great trust me" isn't exactly a scientific way to think about these issues.

9

visarga t1_je0zqxm wrote

ML people spend all day thinking about model limitations and errors, it's only normal that we are not so easily swayed by a non-peer reviewed paper declaring first contact with AGI. Especially from MS who owns 50% of OpenAI

6

truchisoft t1_je1huuz wrote

Point taken, this article tho is also filled with holes tho.

0