Submitted by LettucePrime t3_119s0zp in Futurology
Regarding AI cheating in academia & the human effort it takes to discern AI generated text from written text:
A lot of very very smart people are doing lots & lots of good work writing AI-assisted AI detector bots, or Digitally Watermarking AI text, both projects beyond my feeble human ken. I haven't seen it discussed before, but shouldn't the onus of delineating man from machine be on the side providing the AI chatbot? Shouldn't they be providing public record of the raw text generated by their public toy in a database, easily checked & cross-referenced by existing plagiarism tools?
I know it's not beyond any of these companies: for all their sci-fi machinations, language models ultimately return a few KBs of output, & we're talking about the likes of Microsoft, Alphabet, & Meta. They built the infrastructure for the social media era.
If security is an issue, sell your clients a secure platform for your chatbot, managed by their organization. AI is already difficult to monetize as it is - it's why Silicon Valley ignored LLMs for the entire 2010's. Am I missing something in my assessment? This seems like a no brainer solution & these firms should be pressured to adopt it, largely for the good of society, if nothing else.
adt t1_j9nv4zj wrote
>shouldn't the onus of delineating man from machine be on the side providing the AI chatbot?
It is.
Here's a very long read, but it will explain how OpenAI is building in watermarking for use by govt + themselves + maybe academia.
https://scottaaronson.blog/?p=6823
>'to watermark, instead of selecting the next token randomly, the idea will be to select it pseudorandomly, using a cryptographic pseudorandom function, whose key is known only to OpenAI. That won’t make any detectable difference to the end user, assuming the end user can’t distinguish the pseudorandom numbers from truly random ones. But now you can choose a pseudorandom function that secretly biases a certain score—a sum over a certain function g evaluated at each n-gram (sequence of n consecutive tokens), for some small n—which score you can also compute if you know the key for this pseudorandom function'
And why they wouldn't just stick it in a database of logs:
>'Some might wonder: if OpenAI controls the server, then why go to all the trouble to watermark? Why not just store all of GPT’s outputs in a giant database, and then consult the database later if you want to know whether something came from GPT? Well, the latter could be done, and might even have to be done in high-stakes cases involving law enforcement or whatever. But it would raise some serious privacy concerns: how do you reveal whether GPT did or didn’t generate a given candidate text, without potentially revealing how other people have been using GPT? The database approach also has difficulties in distinguishing text that GPT uniquely generated, from text that it generated simply because it has very high probability (e.g., a list of the first hundred prime numbers).'