Viewing a single comment thread. View all comments

dojoteef t1_j1uwubj wrote

Nice job!

Though, to produce a better comparison it's best to show two examples side-by-side (one by a human, the other by the model, in a randomized order of course). The reason is that most people are not trained to analyze short snippets of text out of context. People trained to do that, e.g. English teachers, can better distinguish generated text without a baseline to compare against, but most people (crowd sourced evaluation) will likely produce a very biased analysis not reflective of the real ability for humans to distinguish between the two.

For a more thorough investigation of this phenomenon you can check out our research:

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

26

respeckKnuckles t1_j1v440q wrote

I'm not sure how the side by side comparison answers the same research question. If they are told one is AI and the other isn't, the reasoning they use will be different. It's not so much "is this AI?" as it is "which is more AI-like?"

20

dojoteef t1_j1v4j4r wrote

You don't need to tell them one is AI or model generated. Could be two model generated texts or two human written texts. Merely having another text for comparison allows people to better frame the task since otherwise they essentially need to imagine a baseline for comparison, which people rarely do.

−3

respeckKnuckles t1_j1v66iq wrote

You say it allows them to "better frame the task", but is your goal to have them maximize their accuracy, or to capture how well they can distinguish AI from human text in real-world conditions? If the latter, then this establishing of a "baseline" leads to a task with questionable ecological validity.

7

Ulfgardleo t1_j1vcqri wrote

  1. you are asking humans to solve this task untrained, which is not the same as the human ability to distinguish the two.

  2. you are then also making it harder by phrasing the task in a way that makes it difficult for the human brain to solve it.

2

respeckKnuckles t1_j1vempm wrote

> you are asking humans to solve this task untrained, which is not the same as the human ability to distinguish the two.

This is exactly my point. There are two different research questions being addressed by the two different methods. One needs to be aware of which they're addressing.

> you are then also making it harder by phrasing the task in a way that makes it difficult for the human brain to solve it.

In studying human reasoning, sometimes this is exactly what you want. In fact, for some work in studying Type 1 vs. Type 2 reasoning, we actually make the task harder (e.g. by adding WM or attentional constraints) in order to elicit certain types of reasoning. You want to see how they will perform in conditions where they're not given help. Not every study is about how to maximize human performance. Again, you need to be aware of what your study design is actually meant to do.

7

Ulfgardleo t1_j1vjc6q wrote

I don't think this is one of those cases. The question we want to answer is whether texts are good enough that humans will not pick up on it. Making the task as hard as possible for humans is not indicative of real world performance once people get presented these texts more regularly.

1

londons_explorer t1_j1vecdr wrote

You could get a similar outcome by discarding results of the first 2 or so examples of each session as 'practice' ones, then recording data from the rest.

2