Submitted by coconautico t3_11c1hzc in MachineLearning
Hey Reddit,
tl;dr: To democratize the technology behind virtual assistants, we can play a Q&A game to build a collaborative dataset that will enable the creation of culturally and politically unbiased virtual assistants.
As AI becomes more ubiquitous in our lives, we need to democratize it, ensuring that the next generation of virtual assistants, such as chatGPT or BingChat, are not solely controlled by one company, group or country, as it would allow them to skew our reality more easily, by deploying politically and culturally biased assistants at large scale, as we have seen with OpenAI.
While one could argue that over time companies and startups will emerge and create their own alternatives, these could be few, as creating such virtual assistants is not only a matter of massive raw data and computation, but it requires the creation of very specific datasets (many of them created by experts from multiple fields) with the goal of "fine-tuning" Large Language Models (LLMs) into virtual assistants.
Because of this, there is an international collaborative effort to create a public, multilingual, and high-quality dataset through a Q&A game, that will enable the creation of other virtual assistants outside the control of these companies.
At this very moment, we already have more data than OpenAI had when it launched its first version of ChatGPT. However, the current dataset is strongly biased towards Spanish and English speakers, as they are the only ones who have contributed to it so far. Therefore, we need to encourage people from other countries and cultures to play this Q&A game in order to create a truly multilingual dataset with expert knowledge of all kinds, from all over the world. (This would allow the virtual assistant to even answer questions that have not been answered in their language).
For Spanish and English is already a reality. Let's make a reality for other languages too by writing a few of questions/answers in the OpenAssistant game!
firejak308 t1_ja16y0h wrote
My main concern with this is how the "Reply as Assistant" texts are generated. That task is orders of magnitude more difficult than labeling an existing reply/prompt or coming up with a new prompt, because it often requires doing background research about the question and summarizing it effectively. If I were to actually try to fill out one of the Reply as Assistant tasks, I would much rather just copy-paste the Google Knowledge Panel or the Wikipedia summary or the ChatGPT output. How do we know that people aren't doing those kinds of things, which could introduce plagiarism concerns?