freezelikeastatue

freezelikeastatue t1_iseurai wrote

Excellent summation! However, you do draw an interesting point, don’t use it to make up data, use it to predict growth, especially for tumors in the brain. I’m not sure you have the proper data sets to feed a model to predict biological growth, but I think that would be an application for your specific use case provided above.

2

freezelikeastatue t1_irx9anz wrote

Agreed and I think what the civilian developer core has done in spite of OpenAIs promise of OPEN AI is a testament. But we cannot forget invention, patents, and capitalism. We’re early in understanding just what this technology does but we as individuals don’t have the computational resources capitalistic organizations do. The models that are out now and freely available are so fucking lucrative, its not even funny. If one were so inclined, which I am, you can develop software without one software developer. Simple code yes but multiple instances of simplicity compounded becomes quite complex. And wouldn’t you agree that building a software system is best when done incrementally and object oriented?

1

freezelikeastatue t1_irx78wt wrote

It pays off in the general sense of text generation and image generation. The errors and chaos is what makes it beautiful. I’m not sure how others are using the data for more technical applications but it seems to be working, whatever they’re doing. My warning to everybody who reads this is download all the scripts and code you can of the diffusers and encoders and decoders and models, because all that shit is going to become proprietary very soon. you must understand that while those who created the source code did create it under those open licenses that make it free, they have the absolute authority to remove it as we are slowly starting to see.

1

freezelikeastatue t1_irx3brk wrote

Yeah so this gets pretty philosophical and theoretical real quick. Also, interpretation of data is unique to every individual. I did place the constraint of my purposes only which are, admittedly, not necessary for such large model sets and I can achieve similar if not better results with a smaller, more defined model.

I also have not created a curated data set on the level of CLIP or OpenAI or OPT. I’ve tried scaling my data by applying a text generator to each parameter of data that I had and replicate a faux variable exponentially to generate the number of parameters by 1/1000th of the number of parameters in GPT-3’s model but got noise in return.

My summation is viability of the model is wholly dependent upon the unique properties and ensured individuality of each variable. I can say I have achieved higher benchmarks with regards to few and zero shot settings, with the highest being 89.2% on few shot but it was a very specified data set.

−1

freezelikeastatue t1_irwan1x wrote

Grammatical errors and mischaracterizing the context of 1 to 2 token words. Such as acronyms, slang, etc. Additionally, I think how the raw data is structured is prohibitive of true optimization. That’s more a theory of mine than anything but I’ve built models from scratch and they’ve outperformed these models for my specific application every time.

My personal raw data is what you would call curated but more so every cell was meticulously verified and validated. Additionally there aren’t stray variables or additional characters such as spaces or underscores that could be confused as part of the real data. I know AI has done an exceptional job at cleansing data but it still isn’t 100%. I’m still better at manually cleansing data than any Software in existence and I’ve used a majority of them.

−1

freezelikeastatue t1_irvgsg4 wrote

I can say this: out of all the AI and ML research papers I’ve read, the data sources folks are using, such as The Pile or Phaseshift.io (?) for Reddit data, are not particularly valid.

I’ve been pouring over a lot of the raw data and have found so many errors that I think it would be difficult or disingenuous to say that the models created from those data sets are viable for use. Now when you look at overall correctness, you’ll find statistically the AI and ML architecture can overcome those issues. However, when it comes to the reliability in fidelity of the data, it’s either too inconsistent or wildly wrong in its assertions. Another way to say it is; the validity of outcomes produced by AI and ML architecture that utilize public raw data should be questioned.

Just because you disagree or have never heard of it from this point of view, doesn’t mean it’s wrong ….

19