DizzyNobody
DizzyNobody t1_ja2pthy wrote
Reply to comment by turnip_burrito in Raising AGIs - Human exposure by Lesterpaintstheworld
What about running it in the other direction: have the judge LLMs screen user input/prompts. If the user is being mean or deceptive, their prompts never make it to the main LLM. Persistently "bad" users get temp banned for increasing lengths of time, which creates an incentive for people to behave when interacting with the LLM.
DizzyNobody t1_ja2uka9 wrote
Reply to comment by turnip_burrito in Raising AGIs - Human exposure by Lesterpaintstheworld
I wonder if you can combine the two - have a judge that examines both input and output. Perhaps this is one way to mitigate the alignment problem. The judge/supervisory LLM could be running on the same model / weights as the main LLM, but with a much more constrained objective - prevent the main LLM from behaving in undesirable ways either by moderating its input and even by halting the main LLM when undesirable behaviour is detected. Perhaps it could even monitor the main LLM's internal state, and periodically use that to update its own weights.