rfxap t1_jdzfxd1 wrote on March 28, 2023 at 9:29 AM

There are other benchmarks to look at though. Microsoft Research tried an early version of GPT-4 on LeetCode problems that were published after the training data cutoff date, and they got results similar to human performance in all difficulty categories: https://arxiv.org/abs/2303.12712 (page 21)

What should we make of that?

abc220022 t1_jdzrsbu wrote on March 28, 2023 at 11:54 AM

Part of the sales pitch behind LeetCode is that you are working on problems that are used in real coding interviews at tech companies. I believe that most LeetCode problems were invented well before they were published on the LeetCode website, so they still could appear in some form in their training data.

neonwatty t1_je06w3v wrote on March 28, 2023 at 1:58 PM

absolutely

keepthepace t1_jdzp4ge wrote on March 28, 2023 at 11:26 AM

Could some parts of the dataset be copied into the LeetCode problem or is there a guarantee that these problems are 100% novel?

londons_explorer t1_jdzwcfo wrote on March 28, 2023 at 12:36 PM

Problems like this are never 100% novel.

There are always elements and concepts of the problem and solution that have been copied from other problems.

The easiest way to see this is to ask a non-programmer to come up with a 'programming puzzle'. They'll probably come up with something like "Make an app to let me know when any of my instagram friends are passing nearby and are up for hanging out".

Compare that to a typical leetcode problem, and you'll soon see how leetcode problems are really only a tiny tiny corner of what is possible to do with computers.

currentscurrents t1_je13kdr wrote on March 28, 2023 at 5:30 PM

True! But also, problems in general are never 100% novel. That's why metalearning works.

You can make up for poor reasoning abilities with lots of experience. This isn't bad exactly, but it makes testing their reasoning abilities tricky.

cegras t1_je0gfd7 wrote on March 28, 2023 at 3:03 PM

If you google most leetcode problems I would bet a coffee that they've existed on the internet long before leetcode came into existence.

MrFlamingQueen t1_je0j29h wrote on March 28, 2023 at 3:20 PM

It feels like majority of the people in this discussion have no idea what computer science is and what LeetCode tests.

As you mentioned, there are hundreds of websites devoted to teaching the leetcode design patterns and entire books devoted to learning and practicing these problems.

TheEdes t1_je149kf wrote on March 28, 2023 at 5:34 PM

Yeah but if you were to come up with a problem in your head that didn't exist word for word then GPT-4 would be doing what they're advertising, however, if the problem was word for word anywhere in the training data then the testing data is contaminated. If the model can learn the design patterns for leetcode style questions by looking at examples of them, then it's doing something really good, if it can only solve problems that it has seen before, then it's nothing special, they just overfit a trillion parameters on a comparatively very small dataset.

cegras t1_je2k9dr wrote on March 28, 2023 at 11:09 PM

ChatGPT is great at learning the nuances of english, i.e. synonyms and metaphors. But if you feed it a reworded leetcode question and it finds the answer within its neural net, has it learned to conceptualize? No, it just learned that synonym ...

TheEdes t1_je6tweq wrote on March 29, 2023 at 8:46 PM

Sure but what's being advertised isn't sentience per se, at least with the leetcode part of their benchmarks. The issue here is that they claim that it can do X% on leetcode, but it seems like it's much less on new data. Even if it learned to find previous solutions and replace it with changes it should be able to perform well due to the nature of the problems.

MrFlamingQueen t1_je3kywp wrote on March 29, 2023 at 3:54 AM

Agreed. It's very likely contamination. Even "new" LeetCode problems existed before they were published on the website.

cegras t1_je0jsud wrote on March 28, 2023 at 3:25 PM

Do you know if ChatGPT was allowed to ingest PDFs found on the internet? Even if not, I'm sure there are many sections of famous textbooks reproduced in HTML or parsable form.

ianitic t1_je0mjqx wrote on March 28, 2023 at 3:42 PM

Oh I haven't tested this on textbooks, but I have asked chatGPT to give me pages of a novel and it did word for word. I suspect it had to have trained on PDFs? I'm highly surprised I haven't seen any news of authors/publishers suing yet tbh.

It is obvious when a book is a part of its training set or not though based on the above test.

currentscurrents t1_je12d3k wrote on March 28, 2023 at 5:22 PM

Nobody knows exactly what it was trained on, but there exist several datasets of published books.

>I'm highly surprised I haven't seen any news of authors/publishers suing yet tbh.

They still might. But they don't have a strong motivation; it doesn't really directly impact their revenue because nobody's going to sit in the chatgpt window and read a 300-page book one prompt at a time.

mcilrain t1_je1a7cl wrote on March 28, 2023 at 6:11 PM

Current tech could be used to allow you to ask an AI assistant to read you a book.

DreamWithinAMatrix t1_je3c6kl wrote on March 29, 2023 at 2:39 AM

There was that time Google was taken to court for scanning and indexing books for Google Books or whatever and Google won:

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,_Inc.

MrFlamingQueen t1_je0w3ut wrote on March 28, 2023 at 4:43 PM

Not sure on the training corpus, but like you mentioned, there's ton of other forms of textbooks and solution manuals to textbook problems on things like github, stackexchange, etc.

mcilrain t1_je19vif wrote on March 28, 2023 at 6:09 PM

Even if it didn't ingest PDFs it probably ingested websites that scraped PDFs to spam search engine results.

SzilvasiPeter t1_je4pknf wrote on March 29, 2023 at 12:15 PM

Should I bet a coffee? No way... that is too much of a deal.

VodkaHaze t1_je06t03 wrote on March 28, 2023 at 1:57 PM

> LeetCode problems that were published after the training data cutoff date

A variation of those problems is likely on github before they're posted?

milktoasttraitor t1_jdzuw0z wrote on March 28, 2023 at 12:23 PM

If you look at the prompt they show, they clearly gave it hints which tell it the exact approach to use in order to solve the problem. The problem is also a very slight derivative of another existing, very popular problem on the platform (“Unique Paths”).

This is impressive in another way, but not in the way they were trying to show. They didn’t show the other questions it got right, so no way of telling how good or bad the methodology was overall or what hints they gave it. For that question at least, it’s not good and it makes me skeptical of the results.

keepthepace t1_jdzm2ic wrote on March 28, 2023 at 10:52 AM

That articles with peer-review is not something that should be avoided, even by Microsoft AI, sorry, "Open"AI

hardmaru t1_jdznq2v wrote on March 28, 2023 at 11:11 AM

Not sure if this article has been peer reviewed

But saw some “peer reviews” on Twitter :)

See: https://twitter.com/sleepinyourhat/status/1638988283018465300

RubenC35 t1_je0ra56 wrote on March 28, 2023 at 4:12 PM

Would they be a little bias? I mean Microsoft has spent loads of money in the idea of being the best.

salgat t1_je3eqx5 wrote on March 29, 2023 at 3:00 AM

GPT4 is the world's best googler. As long as a similar solution existed on the internet in the past, there's a good chance GPT4 can pick it up, even if it's not on leetcode yet.

Nhabls t1_je93uvg wrote on March 30, 2023 at 9:14 AM

The way they defined human performance there is just funny.

Dividing the number of accepted answers by total users.. might as well just make up a number

wazis t1_jdz4v8g wrote on March 28, 2023 at 6:47 AM

If it is true (too lazy to check), it is not surprizing. If it is not than it is also not surprising

Seankala t1_jdz6kty wrote on March 28, 2023 at 7:11 AM

Yeah I read through the whole thing and it's not surprising. Train-test contamination has been a problem for a while now.

hadaev t1_jdzcowi wrote on March 28, 2023 at 8:41 AM

Well we usually expect it from not really ds peoples like biologists using ds methods and making such a trivial mistake.

It doesnt seems hard to search matches in text. Unlike other data types.

master3243 t1_jdzec5r wrote on March 28, 2023 at 9:06 AM

Seeing how they made sure the bar exam and the math olympiad tests were recent ones that were explicitly stated to not be in the training dataset to avoid contamination, I trusted that all the other reported tests were also as carefully picked to avoid contamination.

MotionTwelveBeeSix t1_jdzurlg wrote on March 28, 2023 at 12:21 PM

The bar exams recycle the same questions every year, there’s very little original about them. Its a test of pure memorization

jrkirby t1_jdzx1ef wrote on March 28, 2023 at 12:41 PM

I'm guessing the hard part is that you can't "untrain" a model. They hadn't thought "I want to benchmark on these problems later" when they started. Then they spent 20K$+ compute on training. Then they wanted to test it. You can easily find the stuff you want to test on in your training dataset, sure. But you can't so easily remove it and train everything again from scratch.

Thorusss t1_je1z0ib wrote on March 28, 2023 at 8:44 PM

>Then they spent 20K$+ compute on training.

Your estimate is a few magnitudes too low

AuspiciousApple t1_je2aij3 wrote on March 28, 2023 at 9:59 PM

Idk, thousands of GPUs going brrrr for months, how much can it cost?

$10?

jrkirby t1_je2f63r wrote on March 28, 2023 at 10:32 PM

2 million dollars or 20 million dollars is greater than 20 thousand. And it makes the main thesis more salient - the more money you've spent training, the less willing you'll be to retrain the entire model from scratch just to run some benchmarks the "proper" way.

wazis t1_jdzzs1q wrote on March 28, 2023 at 1:04 PM

Well they can, but it is expensive

RossoMarra t1_je16mod wrote on March 28, 2023 at 5:49 PM

I really think you are underestimating biologists.

[deleted] t1_jdzh4h0 wrote on March 28, 2023 at 9:47 AM

[deleted]

is_it_fun t1_jdzs7dw wrote on March 28, 2023 at 11:58 AM

Biologists are such trash nowadays when it comes to any kind of computational / math methods. Back in our grandfather's days they were really hardcore.

[deleted] t1_jdztsmd wrote on March 28, 2023 at 12:13 PM

[removed]

ppff01 t1_jdzhmhs wrote on March 28, 2023 at 9:54 AM

then*

marr75 t1_je14tki wrote on March 28, 2023 at 5:38 PM

me irl

Historical-Tree9132 t1_je17bln wrote on March 28, 2023 at 5:53 PM

0/24 0/12 on code problems it never seen before really surprised me

CollectionLeather292 t1_jdzzazr wrote on March 28, 2023 at 1:00 PM

Tl:dr

mlresearchoor t1_je1mvf7 wrote on March 28, 2023 at 7:30 PM

OpenAI blatantly ignored the norm to not train on the ~200 tasks collaboratively prepared by the community for BIG-bench. GPT-4 knows the BIG-bench canary ID afaik, which removes the validity of GPT-4 eval on BIG-bench.

OpenAI is cool, but they genuinely don't care about academic research standards or benchmarks carefully created over years by other folks.

obolli t1_je4juzh wrote on March 29, 2023 at 11:18 AM

I think they used to. Things change when you come under the pressure of returning profits.

mr_house7 t1_je5iuk0 wrote on March 29, 2023 at 3:47 PM

Microsoft is the one in charge now.

ghostfaceschiller t1_jdz6vzn wrote on March 28, 2023 at 7:15 AM

I think this was shown awhile ago (like a week ago, which just feels like ten years)

While I do think this is important for several reasons, personally I don't see it as all that impactful for what I consider AI capable of going forward.

That's bc pretty much all my assumptions for the next couple years are based on the idea of systems that can loop and reflect on their own actions, re-edit code based on error messages, etc. Which they are very good at

Riboflavius t1_jdzb56p wrote on March 28, 2023 at 8:17 AM

I was reading your reply and couldn't help thinking that the italics and then the missing period make it look like the end of it is already red-shifted because we're accelerating so fast.

ghostfaceschiller t1_jdzduen wrote on March 28, 2023 at 8:58 AM

Looolllll

-xXpurplypunkXx- t1_je315e6 wrote on March 29, 2023 at 1:14 AM

In my experience, gpt tends to hallucinate the same incorrect response and refuses to make the directed corrections to code.

ghostfaceschiller t1_je3abdo wrote on March 29, 2023 at 2:24 AM

Really? I def had that some with 3.5 but 4 has been v good. Not perfect obviously

joeiyoma t1_je42w1a wrote on March 29, 2023 at 7:23 AM

So you can imagine, when you are using it and have no clue!

ghostfaceschiller t1_je44ke8 wrote on March 29, 2023 at 7:47 AM

What?

Nhabls t1_je951xn wrote on March 30, 2023 at 9:32 AM

Are they now? Why are you writing empty stuff. Why is this inane stuff so upvoted. Jfc

ghostfaceschiller t1_jeagl0y wrote on March 30, 2023 at 4:23 PM

What?

Simcurious t1_jdzatox wrote on March 28, 2023 at 8:13 AM

That's not correct, the benchmark they used only contained codeforce problems from after 2021.

From Horace's tweets: >Considering the codeforces results in the paper (very poor!), they might have only evaluated it on recent problems.

muskoxnotverydirty t1_jdzi41h wrote on March 28, 2023 at 10:01 AM

It's correct and it's not correct. The article mentions this, but then they say that it's likely that they weren't able to cleanly separate pre-2021 questions on non-coding benchmarks.

bjj_starter t1_jdzo3zq wrote on March 28, 2023 at 11:16 AM

But that's pure speculation. They showed that a problem existed with training data, and OpenAI had already dealt with that problem and wasn't hiding it at all - GPT-4 wasn't tested on any of that data. Moreover, it's perfectly fine for problems like the ones it will be tested on to be in the training data, as in past problems. What's important is that what it's actually tested on is not in the training data. There is no evidence that it was tested on training data, at this point.

Moreover, the Microsoft Research team was able to repeat some impressive results in a similar domain on tests that didn't exist before the training data cut-off. There isn't any evidence that this is a problem with a widespread effect on performance. It's also worth noting that it seems pretty personal for the guy behind this paper, judging by the way he wrote his tweet.

muskoxnotverydirty t1_je027xh wrote on March 28, 2023 at 1:24 PM

Yeah it's speculation. I agree.

> There is no evidence that it was tested on training data, at this point.

I think what the author is trying to say is that for some of these tests there's no evidence it was tested on training data but there's no evidence that it wasn't. But then the ability to generalize in the specific domain of the tests depends on that difference. If nothing else, it would be nice for those who publish test results to show how much they knew whether test data was in the training data. It seems to me that they could automate a search within the training set to see if exact wordage is used.

bjj_starter t1_je2ckb0 wrote on March 28, 2023 at 10:14 PM

>If nothing else, it would be nice for those who publish test results to show how much they knew whether test data was in the training data.

Yes, we need this and much more information about how it was actually built, what the architecture is, what the training data was, etc. They're not telling us because trade secrets, which sucks. "Open" AI.

sb1729 t1_jdzgfff wrote on March 28, 2023 at 9:37 AM

They mention that in the article.

Simcurious t1_jdzhbyu wrote on March 28, 2023 at 9:50 AM

The title implies that they evaluated on data from before 2021 while the source says they didn't.

bjj_starter t1_jdzoafq wrote on March 28, 2023 at 11:18 AM

This title is misleading. The only thing they found was that GPT-4 was trained on code questions it wasn't tested on.

Nhabls t1_je94xwx wrote on March 30, 2023 at 9:31 AM

Not misleading. The fact it performs so differently on easy problems it has seen Vs not , specially when it fails so spectacularly on the latter does raise big doubts about how corrupted and unreliable their benchmarks might be

bjj_starter t1_je98wdx wrote on March 30, 2023 at 10:25 AM

Okay, but an external team tested it on coding problems which only came into existence after its training finishes, and found human level performance. I don't think your theory explains how that could be the case.

Nhabls t1_je9anrq wrote on March 30, 2023 at 10:48 AM

Which team is that? The one at Microsoft that made up the human performance figures in a completely ridiculous way? Basically "We didn't like that pass rates were too high for humans for the hard problems that the model fails on completely so we just divided the accepted number by the entire user base" oh yeah brilliant

The "human" pass rates are also composed of people learning to code trying to see if their solution works. Its a completely idiotic metric, why not go test randos on the street and declare that represents the human coding performance metric while we're at it

[deleted] t1_je15c4c wrote on March 28, 2023 at 5:41 PM

[deleted]

All-DayErrDay t1_je1g2d8 wrote on March 28, 2023 at 6:47 PM

Exactly!

mrpickleby t1_jdzg5e8 wrote on March 28, 2023 at 9:33 AM

Implies that AI will speed the dissemination of information but not necessarily be helpful in creating new thinking.

cegras t1_je0g90p wrote on March 28, 2023 at 3:01 PM

How does the AI perform any better than a Google search? I'd say the AI is even more dangerous as it gives a single, authoritative sounding answer that you have to go to Google and secondary sources to verify anyways!

WarmSignificance1 t1_je1pdz9 wrote on March 28, 2023 at 7:45 PM

I think that ChatGPT has shown how bad so many people are at Googling. And granted, sometimes ChatGPT is just far superior.

But when people say things like "I can ask it how to use a library and it's made me 10x faster over using Google", it just blows my mind. I can usually find the official docs and figure out how to use a library in about the same time as ChatGPT can tell me, without the risk of errors.

AquaBadger t1_je2c68z wrote on March 28, 2023 at 10:11 PM

to be fair, google has gotten slower to find useful information due to the mass of ads and bought results clogging up searches now. But yes, google is still faster than chatgpt and if cleaned up would be even better

polygon_primitive t1_je0x04y wrote on March 28, 2023 at 4:49 PM

For finding answers it's about the same as Google, sometimes better if you then verify the result with external sources, but that's mainly because Google has so badly corrupted their core search product while chasing profit. It's been pretty useful for me for doing the grunt work writing boiler plate code and refactoring stuff tho

currentscurrents t1_je14pi5 wrote on March 28, 2023 at 5:37 PM

Clearly, the accuracy is going to have to get better before it can replace Google. It's pretty accurate when it knows what it's talking about, but if you go "out of bounds" the accuracy drops off a cliff without warning.

But the upside is that it can integrate information from multiple sources and you can interactively ask it questions. Google can't do that.

SWESWESWEh t1_je33t7z wrote on March 29, 2023 at 1:34 AM

I've had a lot more luck solving novel coding problems with the GPT-4 version of chatGPT then Google. If you stick to older tech and libraries like Java and Spring that have been around forever, it's really good at solving fairly difficult problems if you just keep providing context. With Google, it's basically has someone done this exact thing on SO and gotten an answer, if not oh well

hardmaru t1_jdz5yb8 wrote on March 28, 2023 at 7:02 AM

Hi /u/Balance-

Can you fix the formatting of the URL in your post?

The URL should be https://aisnakeoil.substack.com/p/gpt-4-and-professional-benchmarks

Balance- OP t1_jdz6brh wrote on March 28, 2023 at 7:07 AM

Whoops, done!

hardmaru t1_jdz6md2 wrote on March 28, 2023 at 7:11 AM

Thanks!

thelastpizzaslice t1_jdzv7pu wrote on March 28, 2023 at 12:26 PM

I once asked it for a parody of Miss American Pie about Star Wars Episode 1 and it gave me Weird Al's song verbatim.

Gunhild t1_jdzzw8j wrote on March 28, 2023 at 1:05 PM

Clearly a sign of intelligence; even the AI knows you don't mess with perfection.

nixed9 t1_je0cugt wrote on March 28, 2023 at 2:39 PM

The next logical prompt would be “try again, and make it original.” What happened then?

currentscurrents t1_je1ai1i wrote on March 28, 2023 at 6:13 PM

I asked it for a parody and got something similar to, but different from Weird Al's song: https://pastebin.com/FKrZiEi9

When I asked it to be original I got quite different lyrics: https://pastebin.com/uwpqAnyz

Here's the actual lyrics for reference. This reminds me of how you can get LLMs to be less toxic/biased just by telling them to treat people fairly.

thelastpizzaslice t1_je1pphc wrote on March 28, 2023 at 7:47 PM

I asked it to write another one from Darth Maul's perspective after that and it did a ducking amazing job.

mrdevlar t1_jdzdi2t wrote on March 28, 2023 at 8:53 AM

Proof that no matter where you go, it is always going to be possible to make simple mistakes.

jabowery t1_je107nj wrote on March 28, 2023 at 5:09 PM

See these entries in the Hutter Prize FAQ:

Why aren't cross-validation or train/test-set used for evaluation?

Why is (sequential) compression superior to other learning paradigms?

Why is Compressor Length superior to other Regularizations?

Why not use Perplexity, as most big language models do?

Is Ockham's razor and hence compression sufficient for AGI?

sigmatrophic t1_jdzpk9m wrote on March 28, 2023 at 11:31 AM

Honestly I paid for GTP 4... It's a bit better but felt like gtp3 before they dumbed it down.

krali_ t1_je053hd wrote on March 28, 2023 at 1:45 PM

I'm considering it, if only for plugin support. Wolfram in particular.

currentscurrents t1_je15i85 wrote on March 28, 2023 at 5:42 PM

That's still on a waitlist unfortunately.

GPT-4 is good but slow, at least for now I mostly still use the GPT-3.5 model.

ArnoF7 t1_je0dzqg wrote on March 28, 2023 at 2:46 PM

Funnily, I actually found GPT-4 far worse than what I expected in terms of coding, especially after I looked at its impressive performance on other exams. I guess it’s still a progress in terms of LLM for coding, maybe just a little underwhelming compared to other standardized tests it aces? GPT-4’s performance on codeforces is borderline abhorrent.

And now you are telling me there is data leakage, so the actual performance would be even worse than what’s on paper???

meister2983 t1_je0s90f wrote on March 28, 2023 at 4:18 PM

GPT-4 is an extremely good pattern matcher - probably one of the best ever made. Most exams made seem to be able to executed with straight-forward pattern matching (with no backtracking). The same thing applies to basic coding questions - it reasonably performs at the level of a human gluing stack overflow solutions together (with the obvious variable renaming/moving lines around/removing dead code/etc.)

It struggles at logical reasoning (when it can't "pattern match" the logical reasoning to something it's trained on).

Coding example:

Had no problem writing a tax calculator for ordinary income with progressive tax brackets
It struggles to write a program to calculate tax on long term capital gains (US tax code), which is very similar to the above, except has an offset (you start bracket indexing at ordinary income). I'd think this is actually pretty easy for a CS student especially if they saw the solution above -- GPT4 struggled though as it doesn't really "reason" about code the way a human would and would generate solutions obviously wrong to a human.

DaBobcat t1_je12b4q wrote on March 28, 2023 at 5:22 PM

Here OpenAI and Microsoft were evaluating GPT4 on medical problems. In section 6.2 they specifically said that they found strong evidence that it was trained on "popular datasets like SQuAD 2.0 and the Newsgroup Sentiment Analysis datasets". In the appendix section B they explain how they measured whether it saw something in the training data. Point is, I think benchmarks are quite pointless if the training dataset is private and no one can verify that they did not train it on the test set, which they specifically said that in many cases it did

pale2hall t1_je2begu wrote on March 28, 2023 at 10:05 PM

Chat GPT-4 can't remember it's writing a FireFox add-on not a Chrome Extension.

It's like the most amazing coder ever, but always half-drunk, and completely confident, and always. Here's how almost every single response started after the first....

Apologies for the incomplete response.
Apologies for the confusion. The Express server I provided earlier ...
I apologize for the inconvenience. After reviewing the code, I've noticed some inconsistencies in the code
I apologize for the confusion. It appears that the context menu was removed due to a typo in the content.js file.
I apologize for the confusion. To make the changes you requested, follow the steps below:
Apologies for the confusion, and thank you for providing the additional information. Here's an updated implementation that should resolve the issues:
I apologize for the confusion. Here's an updated solution that should display the response in the popup window and clear the input field on submit. Additionally, I added an indicator that shows the addon is thinking.
Apologies for the confusion, and thank you for the clarification. Based on your requirement, you can make the following changes:
Apologies for the confusion. You are correct that you cannot trigger the reviseMyComment() function in the content script without sending a message from the background script.
My apologies for the confusion. The error you are encountering is because the sendToOpenAI() function is not available in the content script content.js
Apologies for the confusion. I made an error in my previous response.

VertexMachine t1_jdzehvy wrote on March 28, 2023 at 9:08 AM

Interesting. Potentially something that might be also used in the ongoing lawsuit against copilot?

Nhabls t1_je94npl wrote on March 30, 2023 at 9:26 AM

Idk why people downvoted you, you are right.

VertexMachine t1_je9dyb9 wrote on March 30, 2023 at 11:25 AM

Because reddit? :D

visarga t1_jdzr4tp wrote on March 28, 2023 at 11:47 AM

This paper scared me more than any other ML paper. I hoped we have 2-3 more years until what they show in there.

ReasonablyBadass t1_je0rwr3 wrote on March 28, 2023 at 4:16 PM

Is it possible the older questions are by now about better known problems so more training data existed for them and the newer are about newer concepts, not really represented on the net yet?

nomadiclizard t1_je0u5ex wrote on March 28, 2023 at 4:30 PM

Haha amateurs. I learned not to make that mistake when I split a pose estimation visual dataset into training and validation, but lots of the frames were almost-duplicates so it got contaminated that way. >.<

fiftyfourseventeen t1_je0z514 wrote on March 28, 2023 at 5:02 PM

That's exactly what happened here lol, they only deduplicated by exact duplicate text so there was lots of similar data in both sets

thorax t1_je107vs wrote on March 28, 2023 at 5:09 PM

I'm working on an extreme usage model for leveraging GPT4 to generate code, and it's rather good. Not perfect, but impressive is an understatement.

regalalgorithm t1_je1eu1e wrote on March 28, 2023 at 6:40 PM

FYI, the GPT 4 paper has a whole section on contamination in the appendix - I found it to be pretty convince. Removing contaminatimg data did make it worse at some benchmarks, but also better at others, and overall it wasn't a huge effect.

StellaAthena t1_je3tz04 wrote on March 29, 2023 at 5:28 AM

I found this analysis incredibly unconvincing. They used a weaker standard for deduplication than is standard as well as a weaker analysis than the one they did for the GPT-3 paper.

notforrob t1_je1lowh wrote on March 28, 2023 at 7:22 PM

This inspired me to ask GPT-4:
"Can you generate a leetcode easy problem that has never been seen?"

And then ask it to solve the problem it creates. In the few cases I tried it failed miserably.

pmirallesr t1_je2tf2v wrote on March 29, 2023 at 12:16 AM

Idk, the procedure to check for contamination described in the release report sounded solid at first glance, and I don't see how this news changes that

_sbmaruf t1_je369s5 wrote on March 29, 2023 at 1:52 AM

Sorry for self posting my work here. But you can take a look at our recent work, https://arxiv.org/abs/2303.03004

WarmSignificance1 t1_je58y3c wrote on March 29, 2023 at 2:43 PM

Looks interesting. Have you tried any of the GPT models against this benchmark?

_sbmaruf t1_je8iuvl wrote on March 30, 2023 at 4:37 AM

We just released the dataset last week. We are in the process of training some autoregressive models.

Coffee_Crisis t1_je392lv wrote on March 29, 2023 at 2:14 AM

If you search GitHub for unusual variable names or keywords you will often find code that looks very similar to the stuff GPT spits out, in some domains it’s much more copy paste than people think

jer_pint t1_je3ffzz wrote on March 29, 2023 at 3:06 AM

on a sort of related note, I tested gpt4's ability to play wordle, and it was pretty bad. I think it has to do with the fact that wordle only existed after gpt cutoff: https://www.jerpint.io/blog/gpt-wordle/

purplebrown_updown t1_je3xwqa wrote on March 29, 2023 at 6:16 AM

Question. I’m guessing they want to continuously feed more data to gpt so how do they avoid using up all their training. Is this what’s called data leakage?

joeiyoma t1_je42q2t wrote on March 29, 2023 at 7:20 AM

Chatgpt always have the potential for error, 4 version has a reduced potential for error. My biggest worry is what it will do our creativity. Autopilot all the time!

Calamero t1_je4doo0 wrote on March 29, 2023 at 10:02 AM

It will enable creative people to bring their ideas to reality. It won’t make people less creative. AI technology democratizes the execution part, making it easier for people from all walks of life to transform their visions into reality. It will augment human creativity rather than stifling it.

joeiyoma t1_je9f2ma wrote on March 30, 2023 at 11:37 AM

That is the utopia, and we all want it!

NoRip7374 t1_je6e6rd wrote on March 29, 2023 at 7:05 PM

At least some good news!

[deleted] t1_je6q2mw wrote on March 29, 2023 at 8:21 PM

[removed]

joeiyoma t1_je9fegm wrote on March 30, 2023 at 11:40 AM

There is a lot of Buzz about prompt engineering, can it cut as a skill-set going forward or is just a hype that will out with time.

Seankala t1_jdz53mn wrote on March 28, 2023 at 6:50 AM

It'd be nice to see the qualifications of the authors.

hardmaru t1_jdz62h3 wrote on March 28, 2023 at 7:03 AM

https://scholar.google.com/citations?user=0Bi5CMgAAAAJ&hl=en&oi=ao

https://scholar.google.com/citations?user=MsKX_6kAAAAJ&hl=en&oi=ao

Seankala t1_jdz64gw wrote on March 28, 2023 at 7:04 AM

Thanks!

currentscurrents t1_je17y5v wrote on March 28, 2023 at 5:57 PM

>Why are deep learning technologists so overconfident

>A Narayanan, S Kapoor

>Substack newsletter. AI Snake Oil

You can get your blogposts listed on Google Scholar?

AsliReddington t1_je0mg2b wrote on March 28, 2023 at 3:41 PM

It's a smarter talking parrot is all.

Puzzleheaded_Acadia1 t1_je11l0o wrote on March 28, 2023 at 5:17 PM

So does that mean that gpt 4 can't think critically? and if not can we make a new kind of ML like LLMs and llama that can think critically and integrated to gpt 4 so it becomes a multimodel that can "see" and think critically.

pengo t1_je7vr2t wrote on March 30, 2023 at 1:20 AM

Yes, it can think critically, it just doesn't tell you whether it is or isn't at any one time.

HonkyTonkPolicyWonk t1_je1mqdp wrote on March 28, 2023 at 7:29 PM

Well, yeah, ChatGTP is auto-suggest on steroids. It can’t create anything de novo. It reframes and regurgitates what others have done.

No surprises here

trajo123 t1_je2gie9 wrote on March 28, 2023 at 10:42 PM

How much of the code that devs write on a typical day is truly novel and not just a rehash / combination / adaptation of existing stuff?

He who has not copied code from stackoverflow, let him cast the first insult at ChatGPT.

WarmSignificance1 t1_je57s2a wrote on March 29, 2023 at 2:35 PM

So I actually think that senior devs copy and paste a lot less than everyone imagines.

I can’t remember the last time I’ve copied code from StackOverflow. Actually, I rarely even use StackOverflow at this point. Going directly to the official docs is always best.

kesisci123 t1_je0breh wrote on March 28, 2023 at 2:32 PM

Big memorization machine.

plocco-tocco t1_jdzpyf8 wrote on March 28, 2023 at 11:35 AM

I do not see any evidence of this happening in the article. Also, OpenAI claims to have checked for contamination in every benchmark, so I don't see what the author's are trying to show here.

Wtiaw t1_je04aq2 wrote on March 28, 2023 at 1:39 PM

> Note that GPT-4 cannot access the Internet, so memorization is the only explanation

this is not true, it was shown through jailbreaks that it could access the internet

gorobotgorobot t1_je0eptw wrote on March 28, 2023 at 2:51 PM

Really? Can you link to examples of that?

truchisoft t1_je05j63 wrote on March 28, 2023 at 1:48 PM

The funny thing about these posts is that this is clearly propaganda aimed to low effort people.

Anyone caring about this is either blinded by their own prejudice or just too dumb to even try GPT once themselves.

Everyone else does not need someone telling them that even GPT3.5 is incredible for coding (and a lot of other stuff), it is not perfect but it goes a long way, heck, I was even able to make a simple game in less than 3 hours using 99% GPT3.5 code and DALL-E sprites.

austacious t1_je0g6oi wrote on March 28, 2023 at 3:01 PM

A healthy skepticism in AIML from those in the field is incredibly important and relatively hard to come by. Having the attitude that 'This is great and everything is wonderful' does not lead to meaningful progress addressing very real issues. It's very productive to point out shortcomings of otherwise highly effective models.

truchisoft t1_je0un8g wrote on March 28, 2023 at 4:34 PM

Oh no no, not my argument here, but the whole title wording looks like a sleazy attack, this is not criticism but seems like a hit piece, since like other commenters mention, other independent tests were ran on GPT4 already and people is already using GPT4 for coding.

st8ic t1_je0lxgn wrote on March 28, 2023 at 3:38 PM

"bro it's great trust me" isn't exactly a scientific way to think about these issues.

visarga t1_je0zqxm wrote on March 28, 2023 at 5:06 PM

ML people spend all day thinking about model limitations and errors, it's only normal that we are not so easily swayed by a non-peer reviewed paper declaring first contact with AGI. Especially from MS who owns 50% of OpenAI

truchisoft t1_je1huuz wrote on March 28, 2023 at 6:58 PM

Point taken, this article tho is also filled with holes tho.

Comments