Submitted by mvujas t3_zo5imc in MachineLearning

I was reading a little bit about ChatGPT training which led me to a realization how smart of a move making it free to use actually is. We basically know that during the training ChatGPT uses human feedback, which is relatively expensive to get. However, by making it free to use and providing users an option to give feedback opens a door to massive amounts of training data for a relatively cheap price per training sample (the cost of running server). This approach is quite fascinating to me, and makes me wonder about other similar examples of this, so I would like to hear them in the comments if you have any?

28

Comments

You must log in or register to comment.

CriticalTemperature1 t1_j0l6du5 wrote

Most people aren't labelling outputs as good or bad so how do they get any reward or training signals from these beta users

17

mvujas OP t1_j0l9aht wrote

That is true, but it's a similar case with crowdsourcing, they have some clever things there such as honeypots and weighted expertise scores or whatever they are called in order to make the most of the data. But I would even argue that continuing a conversation is a form of positive feedback or even coming back to the website

6

Nameless1995 t1_j0lri08 wrote

I just had a thought. I think resampling with "try again" button itself can be used as a feedback (a noisy signal for the "user didn't like the earlier version"). Moreover if a user switches back to the earlier sample that can be another feedback (the earlier version being preferred more). They can get a lot of data from these. I expect users to be using "try again" more frequently that upvotes/downvotes.

9

Aggravating-Act-1092 t1_j0lhqqx wrote

I’d agree. You can probably even ask ChatGPT to review the follow up someone gives it and assign a score based on that.

Personally if it gives me buggy code I point it out and try to fix for example, that’s clear negative. I also sometimes write thank you to it when I’m happy with its answer.

6

fimari t1_j0rpmx0 wrote

Probably the same way Google detect good search results - people stop searching when the result is good and people stop fiddle around if they have what they want.

2

mvujas OP t1_j0llkjy wrote

Oh, just reading the answer again, there is actually a feedback button in the top right corner of each answer, but I would assume that even if a small percentage of users is using this button, it ends up costing less than paying people to do this manually

3

humanbeingmusic t1_j0nc16t wrote

Was thinking the same thing, that is a reinforcement signal for sure, lots of other data to imply

1

mettle t1_j0mrkqp wrote

lots of implicit signals to look at based on what the user does after.

2

30katz t1_j0m3iuj wrote

Just analyzing questions and gleaning what could be going on would be a gold mine

I’m sure Google can come up with a lot of very profitable metrics

1

RandomIsAMyth t1_j0s0ydc wrote

I don't think that's right. Human inputs are great training signals. Fine tuning chatgpt on them (basically trying to predict what the human would have said) has a pretty high value.

They are running ChatGPT for something like 100k$ a day but getting millions of data points. They think that the data they get are worth these 100k$. A new version will come soon and they will probably be able to make better and better training data out of the crowdsourcing experiment.

If supervised learning is the way to go, make the labelling large and big. For free, on the simplest website ever. I think they nailed it.

1

ChuckSeven t1_j0u2grw wrote

> ut I would even argue that continuing a conversation is a form of positive feedback or even coming back to the websit

It is way cheaper to take real conversations and have a crowdworker label it for being a good conversation or a bad conversation.

1

jasondads1 t1_j0kzfdk wrote

The made the same approach with dalle2 before charging for it.

5

mvujas OP t1_j0l0l1n wrote

Does dalle2 use human feedback in any form other than labeling false positives? I haven't played much with dalle2 to be honest, but I can definitely see how they could have been collecting data for a future iteration of the model that may use reinforcement learning in some form.

2

rikliem t1_j0l8wa6 wrote

When generating an image. The one you download they take it as positive feedback . My theory is that if you repeat a prompt twice or more they probably can label it as bad result. They could also use the enlarging of pictures after they are generated as additional feedback

5

DoctorFuu t1_j0lp9js wrote

"Click all the images of cars to prove you're not a bot"

3

mvujas OP t1_j0lrh8u wrote

That's a good one!

2

bartvanh t1_j0mkewt wrote

I wonder how long that will stay around, since by now we must be pretty close to the point (if not over it) where a bot is better at spotting cars than meatbags

4

Agreeable_Bid7037 t1_j0lb3ww wrote

ChatGPT can probably provide some similar examples lol.

2

akRonkIVXX t1_j0nzbkl wrote

Pretty sure this is how the google voice to text got so good.

2

ChuckSeven t1_j0u2e2c wrote

It reminded me of Tesla's data engine.

2

mvujas OP t1_j0v81dn wrote

I haven’t thought much of that but after reading everything about it, it seems just like an example I was looking for! Very clever approach!

1

CalligrapherFine6407 t1_j0o20wo wrote

Side Question:

Why does ChatGPT always sound so confident even when it's wrong?

1

Nameless1995 t1_j0olnhi wrote

One reason for confidence-sounding responses could be that internet data (in which it is trained) generally consists of confident sounding answers. Many humans are also confidently think they are righ while being wrong. Besides it doesn't have the ability nor is it exactly trained to model "truthfulness". So it may just maintain the confident-sounding style indiscriminately whether it's speaking truth or fiction (although it can probably adopt a "less confident" attitude if explicitly asked to role play as such but then it may just be less confident indiscriminately).

While OpenAI may have found some ways to make it more cautious (not necessarily adopting less confident styles, but denying response when more "uncertain" (probably based on perplexity or something IDK exactly how the enforce cautiousness)):

See:

https://openai.com/blog/chatgpt/

> ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers. Fixing this issue is challenging, as: (1) during RL training, there’s currently no source of truth; (2) training the model to be more cautious causes it to decline questions that it can answer correctly; and (3) supervised training misleads the model because the ideal answer depends on what the model knows, rather than what the human demonstrator knows. > ChatGPT sometimes writes plausible-sounding but incorrect or nonsensical answers. Fixing this issue is challenging, as: (1) during RL training, there’s currently no source of truth; (2) training the model to be more cautious causes it to decline questions that it can answer correctly; and (3) supervised training misleads the model because the ideal answer depends on what the model knows, rather than what the human demonstrator knows.

3