Submitted by hapliniste t3_10g5r52 in MachineLearning

With ChatGPT going mainstream and the general push to make products out of LM, a problem remain about the cost of running such models.

To me, it seems counterproductive to put both language modelling and knowledge inside the model weights.

Is it time to shift to retrieval LM like Retro to keep the cost down while offering the same products?

It would possibly allow Google or others to offer a free assistant service, using embeddings similarity search to retrieve results from the Internet so the model itself could possibly even run on edge devices?

What are your thoughts about that subject?

38

Comments

You must log in or register to comment.

hapliniste OP t1_j50pe93 wrote

Also, I think this could help improve the actual "logic" of the model by focusing the small LM on that task while the search part would serve the role of knowledge base.

Another benefit could be the ability to cite its sources.

It really seems like a no brainer to me.

12

currentscurrents t1_j525hto wrote

Retrieval language models do have some downsides. Keeping a copy of the training data around is suboptimal for a couple reasons:

  • Training data is huge. Retro's retrieval database is 1.75 trillion tokens. This isn't a very efficient way of storing knowledge, since a lot of the text is irrelevant or redundant.

  • Training data is still a mix of knowledge and language. You haven't achieved separation of the two types of information, so it doesn't help you perform logic on ideas and concepts.

  • Most training data is copyrighted. It's currently legal to train a model on copyrighted data, but distributing a copy of the training data with the model puts you on much less firm ground.

Ideally I think you want to condense the knowledge from the training data down into a structured representation, perhaps a knowledge graph. Knowledge graphs are easy to perform logic on and can be human-editable. There's also already an entire sub-field studying them.

19

BadassGhost t1_j55rxme wrote

I think the biggest reason to use retrieval is to solve the two biggest problems:

  • Hallucination
  • long-term memory.

Make the retrieval database MUCH smaller than Retro, and constrain it to respectable sources (textbooks, nonfiction books, scientific papers, and Wikipedia. You could either not do textbooks/books, or you could make deals with publishers. Then add to the dataset (or have a second dataset) everything it sees in a certain context in production. For example, add all user chat history to the dataset for ChatGPT.

Could use cross-attention in RETRO (maybe with some RLHF like ChatGPT), or just software engineer some prompt manipulation based on embedding similarities.

You could imagine ChatGPT variants that have specialized knowledge that you can pay for. Maybe an Accounting ChatGPT has accounting textbooks and documents in its retrieval dataset, and accounting companies pay a premium for it.

1

Ok-Cartoonist8114 t1_j52mjrw wrote

Here is a great paper from IBM following the retriever-reader paradigm. Love those "light" models that can be specialized by switching index.

IMO the loss of ChatGPT is still interesting for retriever-reader approachs to generate either human like or structured answers from input documents.

Here is a tool I made to create retriever-reader pipeline in a minute: Cherche, would recommend also Haystack on github !

7

IntrepidTieKnot t1_j547lq2 wrote

I made a tool that chops documents in chunks, creates embeddings for the chunks via GPT-3 and stores the embeddings in a REDIS database. When I make a query, I create an embedding for that and look up my stored embeddings via cosine similarity.

My question is: isn't that the same as your tool does? In other words: what can you do with Cherche what I cannot do like I described? Is it that I don't need GPT-3 for the same result? Or what is it?

2

Ok-Cartoonist8114 t1_j54l5yh wrote

Your pipeline is fine! Cherche is not fancy, it just allow to create hybrid pipelines that rely both on language models and lexical matching which can help a lot. Also Cherche is primarly design for computing embeddings with Sentence Transformers which have a better ratio <precision / number of parameters>.

3

blimpyway t1_j51wv3h wrote

Retrieval should work also on entire interaction history with a particular user. Not only tracking beyond token window but having available all "interesting stuff" from users perspective.

5

sammysammy1234 t1_j50vmjq wrote

The advantage of using chatgpt is that it can give more human-like answers and doing prompt engineering is much easier than labeling a lot of data.

However, I do agree that it is a very costly model, and in many applications a simpler one could be enough.

I don't know for sure because Chatgpt's capabilities are currently being explored, and there are other models coming up, so there is no tellibg what the scenario will be in a few months. Maybe we will jist switch to using third party models, similarly to how no one programs their own compilers.

3

wind_dude t1_j50x6ad wrote

Yea, unless they master continual learning, the models will get stale quick, or need to rely on iterative training, very expensive and slow. I don't see hardware catching up soon.

I think you'll still need to run a fairly sophisticated LLM as the base model for a query based archetecture. But you can probably reduce the cost of running it by distilling it, and curating the input data. I actually don't think there has been a ton of research on curating the input data before training (OpenAI did something similar curating responses in chatGPT with the RLHF, so similar concept), although concerns/critiques may arise of what junk, which is why it hasn't been looked at in depth before. I believe SD did this in the latest checkpoint removing anything "pornographic", which is over censorship.

You look at something like CC that makes up a fairly large portion of the training data, run it through a classifier to remove junk before training. And even CC text, a lot of it is probably landing type pages, or even a blocked by paywall msging. To my knowledge the percent of these making up CC hasn't even been looked at, let alone trimmed from the training datasets used.

3

dancingnightly t1_j52k7sv wrote

Yup, I fully believe retrieval of sources will go up in value over time, in addition to the benefits you have outlined. Because when lots of things are AI generated, being able to trust and see a source has value (even for some AI summary answer say)

1