Submitted by fromnighttilldawn t3_y11a7r in MachineLearning
freezelikeastatue t1_irvgsg4 wrote
I can say this: out of all the AI and ML research papers I’ve read, the data sources folks are using, such as The Pile or Phaseshift.io (?) for Reddit data, are not particularly valid.
I’ve been pouring over a lot of the raw data and have found so many errors that I think it would be difficult or disingenuous to say that the models created from those data sets are viable for use. Now when you look at overall correctness, you’ll find statistically the AI and ML architecture can overcome those issues. However, when it comes to the reliability in fidelity of the data, it’s either too inconsistent or wildly wrong in its assertions. Another way to say it is; the validity of outcomes produced by AI and ML architecture that utilize public raw data should be questioned.
Just because you disagree or have never heard of it from this point of view, doesn’t mean it’s wrong ….
pm_me_your_pay_slips t1_irw5s8a wrote
What are those errors you have observed in the datasets like the Pile?
freezelikeastatue t1_irwan1x wrote
Grammatical errors and mischaracterizing the context of 1 to 2 token words. Such as acronyms, slang, etc. Additionally, I think how the raw data is structured is prohibitive of true optimization. That’s more a theory of mine than anything but I’ve built models from scratch and they’ve outperformed these models for my specific application every time.
My personal raw data is what you would call curated but more so every cell was meticulously verified and validated. Additionally there aren’t stray variables or additional characters such as spaces or underscores that could be confused as part of the real data. I know AI has done an exceptional job at cleansing data but it still isn’t 100%. I’m still better at manually cleansing data than any Software in existence and I’ve used a majority of them.
shoegraze t1_irwlpoz wrote
But surely with a dataset as large as the Pile and enough weights, the model will be able to learn at least decently well how to interpret misspellings and abbreviations. If anything wouldn’t this data “issue” help improve a LLM’s robustness? Not sure I see what the issue is in the context of LLMs, but to be fair I agree with you if you’re trying to train a small model on a small amount of context-specific text data (but then you shouldn’t be using the Pile should you?)
freezelikeastatue t1_irx3brk wrote
Yeah so this gets pretty philosophical and theoretical real quick. Also, interpretation of data is unique to every individual. I did place the constraint of my purposes only which are, admittedly, not necessary for such large model sets and I can achieve similar if not better results with a smaller, more defined model.
I also have not created a curated data set on the level of CLIP or OpenAI or OPT. I’ve tried scaling my data by applying a text generator to each parameter of data that I had and replicate a faux variable exponentially to generate the number of parameters by 1/1000th of the number of parameters in GPT-3’s model but got noise in return.
My summation is viability of the model is wholly dependent upon the unique properties and ensured individuality of each variable. I can say I have achieved higher benchmarks with regards to few and zero shot settings, with the highest being 89.2% on few shot but it was a very specified data set.
_Arsenie_Boca_ t1_irx6bvg wrote
I guess this is part of the bitter lesson. Sacrificing some quality for quantity seems to pay off in many cases
freezelikeastatue t1_irx78wt wrote
It pays off in the general sense of text generation and image generation. The errors and chaos is what makes it beautiful. I’m not sure how others are using the data for more technical applications but it seems to be working, whatever they’re doing. My warning to everybody who reads this is download all the scripts and code you can of the diffusers and encoders and decoders and models, because all that shit is going to become proprietary very soon. you must understand that while those who created the source code did create it under those open licenses that make it free, they have the absolute authority to remove it as we are slowly starting to see.
_Arsenie_Boca_ t1_irx86or wrote
I see your point but I wouldnt see it too pessimistically. If anything the not-so-open policy of OpenAI has lead to many initiatives that aim to demcratize AI. If they decide to go commercial as well, others will take their place.
freezelikeastatue t1_irx9anz wrote
Agreed and I think what the civilian developer core has done in spite of OpenAIs promise of OPEN AI is a testament. But we cannot forget invention, patents, and capitalism. We’re early in understanding just what this technology does but we as individuals don’t have the computational resources capitalistic organizations do. The models that are out now and freely available are so fucking lucrative, its not even funny. If one were so inclined, which I am, you can develop software without one software developer. Simple code yes but multiple instances of simplicity compounded becomes quite complex. And wouldn’t you agree that building a software system is best when done incrementally and object oriented?
_Arsenie_Boca_ t1_irxaphg wrote
While I am optimistic about the open-ness of AI, I am much more pessimistic regarding its capabilities. I dont believe AI could replace a team of software engineers anytime soon.
visarga t1_irzod3c wrote
Not a whole team, not even a whole job, but plenty of tasks can be automated. By averaging over many developers there is a cumulative impact.
But on the other hand software has been cannibalising itself for 70 years and we're still accelerating, there's always space at the top.
freezelikeastatue t1_irxh2lr wrote
Not for anything new, no
Viewing a single comment thread. View all comments