michaelthwan_ai OP t1_jct4sdj wrote on March 19, 2023 at 10:54 AM

#2,269,331

Demo page: https://searchgpt-demo.herokuapp.com/

Github : https://github.com/michaelthwan/searchGPT

searchGPT is a search engine or question-answer bot based on LLM to give natural language answers. You may see the footnote which is the reference of sources from the web. Below there is a explainability view to show how the response is related to the sources.

Why Grounded though?

Because it is impossible for the LLM to learn everything during the training, thus real-time factual information is needed for reference.
This project tried to reproduce work like Bing and perplexity AI which have external references to support the answer of LLM.

Some examples of good grounded answer from searchGPT and wrong ungrounded answer from ChatGPT is mentioned in the github.

taci7c0ff33 t1_jcth2g9 wrote on March 19, 2023 at 1:11 PM

#2,269,933

Very cool!

michaelthwan_ai OP t1_jcthy7x wrote on March 19, 2023 at 1:19 PM

#2,269,992

Replying to taci7c0ff33 (#2,269,933)

Thank you! :D

squareOfTwo t1_jctkqt2 wrote on March 19, 2023 at 1:43 PM

#2,270,133

someone had to do this as OSS MIT, great, thx!

ProbablyBunchofAtoms t1_jctnndu wrote on March 19, 2023 at 2:07 PM

#2,270,262

Looks dope

michaelthwan_ai OP t1_jctooa2 wrote on March 19, 2023 at 2:15 PM

#2,270,307

Replying to squareOfTwo (#2,270,133)

Of course we do! Open-source projects are cool

KingsmanVince t1_jctpr5l wrote on March 19, 2023 at 2:24 PM

#2,270,360

Replying to michaelthwan_ai (#2,269,331)

Not sure this is frontend problem or not, but the python code is printed without identation.

rowleboat t1_jctpu8c wrote on March 19, 2023 at 2:24 PM

#2,270,364

Replying to michaelthwan_ai (#2,269,331)

Can this use a SQL database as an external reference?

Tostino t1_jctq5az wrote on March 19, 2023 at 2:27 PM

#2,270,381

Replying to rowleboat (#2,270,364)

Look into llama-index

[deleted] t1_jctq5xu wrote on March 19, 2023 at 2:27 PM

#2,270,383

[deleted]

ramtingxf t1_jctqpvo wrote on March 19, 2023 at 2:31 PM

#2,270,416

Google need to be worried

michaelthwan_ai OP t1_jcturwz wrote on March 19, 2023 at 3:01 PM

#2,270,619

Replying to KingsmanVince (#2,270,360)

I believe it is a frontend problem. We are not frontend developers thus but we think that Gradio is too plain to show the result, thus we built a minimal UI.
That markdown (``` <code> ```) is currently not supported to pretty print like ChatGPT one.

michaelthwan_ai OP t1_jctv2tm wrote on March 19, 2023 at 3:03 PM

#2,270,635

Replying to Tostino (#2,270,381)

Thank you.
Due to people close to me and my googling, my choices of indexer is like this

pyterrier -> faiss -> native embedding

Then I found llama-index, but it currently won't give extra values to me so I didn't adopt.

I have stories on pros/cons on those lib...

michaelthwan_ai OP t1_jctvcas wrote on March 19, 2023 at 3:05 PM

#2,270,647

Replying to rowleboat (#2,270,364)

Theoretically yes but in exact the objective you want to do is crucial.

SQL database don't support similarity/elastic search, which is very useful in natural language. It may limit what you can do or make your product less good.

michaelthwan_ai OP t1_jctvmqe wrote on March 19, 2023 at 3:07 PM

#2,270,656

Replying to [deleted] (#2,270,383)

Cool! Thanks for the sharing.

During my development, I've also found 5+ projects, some open-source and some are closed, where they are doing similar things.

In exact, it is called retrieval-based language model.

Some discussion on that:

https://ai.stanford.edu/blog/retrieval-based-NLP/

michaelthwan_ai OP t1_jctw0gx wrote on March 19, 2023 at 3:09 PM

#2,270,670

Replying to ramtingxf (#2,270,416)

It is a little bit exaggerated but thanks! I believe Bing and some companies are using similar tech, but a highly polished one, to solve similar issues.

derek_ml t1_jcudkzh wrote on March 19, 2023 at 5:12 PM

#2,271,537

Any chance you will add this as a Hugging Face Space?

Educational_Ice151 t1_jcueag5 wrote on March 19, 2023 at 5:16 PM

#2,271,573

This looks great 👍

Shared to r/aipromptprogramming

I_will_delete_myself t1_jcuofiw wrote on March 19, 2023 at 6:22 PM

#2,272,003

Nice job

fuzwz t1_jcupmn6 wrote on March 19, 2023 at 6:30 PM

#2,272,061

How many websites did you index in your search engine?

nenkoru t1_jcus6rg wrote on March 19, 2023 at 6:47 PM

#2,272,195

Replying to [deleted] (#2,270,383)

Made a few issues and a pull request for changes in the source code adding support for DuckDuckGo. So if anyone willing to ditch Bing as a dependency and OpenAI(in the future) make sure to keep an eye on this project.

I liked the idea that it's all within a terminal. No need to open a browser and ask for questions. Pretty useful for searching without switching a cognitive context from a vim tab with the code to a browser. In december I did something similar with just a wrapper around OpenAI completion and was asking questions about coding. In combination with codequestion it was pretty useful. This one(XOXO) makes it a much pleasant experience.

Cheers!

[deleted] t1_jcuxs8m wrote on March 19, 2023 at 7:24 PM

#2,272,453

Replying to nenkoru (#2,272,195)

[deleted]

rjog74 t1_jcv2mee wrote on March 19, 2023 at 7:56 PM

#2,272,710

This is great !!!

Secret-Fox-5238 t1_jcv4t2r wrote on March 19, 2023 at 8:11 PM

#2,272,825

But you haven’t written a search engine????

Secret-Fox-5238 t1_jcv5dhh wrote on March 19, 2023 at 8:14 PM

#2,272,848

Replying to michaelthwan_ai (#2,270,647)

This is completely false. Elastic was invented by SQL. You use things like “LIKE” and a few other choice keywords. Just google them or go to Microsoft directly and look at sql select statements. You can string together CTE’s which immediately gives you elasticity. So, sorry, but this is a nonsensical response

BalorNG t1_jcv99cz wrote on March 19, 2023 at 8:42 PM

#2,273,068

Just like humans, LLMs learn patterns and relationships, not "facts" unless you make it memorize it by repeating training data over and over, but it makes other aspects of the system to degrade.

So, LLMs should be given all the tools humans use to augment their thought - spreadsheets, calculators, databases, CADs, etc and allow them to interface them quickly and efficiently.

[deleted] t1_jcvbkg2 wrote on March 19, 2023 at 8:58 PM

#2,273,205

[deleted]

egoistpizza t1_jcvcl5h wrote on March 19, 2023 at 9:05 PM

#2,273,266

Hi! Your project and other projects based on this topic constitute a valid response to active curiosity on this subject. It will be in the interest of society for AI-powered search engines to enter the active development process and gather their unique user base. The only doubt is that as OpenAI and other AI "for-profit" companies close their projects to external analysis and development over time (see GPT-4), AI-powered applications will become closed boxes and the development potential of these projects will be limited. The active protest reactions that we can show on this issue can lose its effect over time, the masses can close their eyes in the face of hype and demand products that are harmful to us in the long run. For this reason, I think that the protest in this area should be made as a mass as soon as possible.

I may have stretched the subject a bit too much, I liked your project and other similar projects quite a lot. Not only did it answer the test question I just asked, it also corrected my grammatical errors in the question, causing me to be a little surprised swh. My request is that we, as a society, do not forget about the potential that we are losing by getting immersed in leading projects. AI-powered applications are great, but we must not forget our rights that these companies take away from us day by day.

phazei t1_jcvcn08 wrote on March 19, 2023 at 9:05 PM

#2,273,269

Replying to michaelthwan_ai (#2,270,619)

If you can have it add a class and add "white-space: pre" to the css, it should probably fix it if it's just a frontend issue.

VelvetyPenus t1_jcveijf wrote on March 19, 2023 at 9:18 PM

#2,273,408

Encountered error You exceeded your current quota, please check your plan and billing details.

hassan789_ t1_jcvuoze wrote on March 19, 2023 at 11:22 PM

#2,274,414

My first try .. I got an error:

Encountered error You exceeded your current quota, please check your plan and billing details.

michaelthwan_ai OP t1_jcw1cnv wrote on March 20, 2023 at 12:11 AM

#2,274,870

Replying to hassan789_ (#2,274,414)

added some credits to it . Used up all. I will monitor the usage

Trolann t1_jcwc0n0 wrote on March 20, 2023 at 1:33 AM

#2,275,598

Replying to nenkoru (#2,272,195)

A (scarier) command line tool that may be up your alley is pls. I'd only use it in a safe VM for now, but minor edits could give you a way to confirm before execution.

michaelthwan_ai OP t1_jcws6h8 wrote on March 20, 2023 at 3:48 AM

#2,276,680

Replying to Secret-Fox-5238 (#2,272,848)

ChatGPT said what I want to say.

>I apologize for any confusion or misinformation in my previous response. You are correct that SQL databases do support various text search and similarity matching features, including the use of keywords like LIKE and CTE (Common Table Expressions) to enable more flexible and efficient querying.
>
>While it's true that specialized tools like Elasticsearch, Solr, or Algolia may offer additional features and performance benefits for certain natural language processing tasks, SQL databases can still be a powerful and effective tool for storing and querying structured and unstructured data, including text data.
>
>Thank you for bringing this to my attention and allowing me to clarify my previous response.

mrcet007 t1_jcxge76 wrote on March 20, 2023 at 9:02 AM

#2,277,804

Replying to derek_ml (#2,271,537)

whats the benefit?

nenkoru t1_jcxjhy6 wrote on March 20, 2023 at 9:47 AM

#2,277,925

Replying to Trolann (#2,275,598)

Yep, understandable. Made a Dockerfile for the project so that it could be run in an isolated environment. Checkout a pull request from me

michaelthwan_ai OP t1_jcxrcfu wrote on March 20, 2023 at 11:28 AM

#2,278,308

Replying to BalorNG (#2,273,068)

I agree with you. 3 thoughts from me

- I think one direction of the so-called safety AI to give a genuine answer, is to give it factual/external info. I mean 1) a Retrieval-based model like searchGPT 2) API calling like toolformer (e.g. check weather API)

- LLM, is essentially a compression problem (I got the idea in lambdalabs). But it cannot remember everything. Therefore an efficient way to solve so are retrieval methods to search a very large space (like pagerank/google search), then obtain a smaller result set and let the LLM organize and filter related content from it.

- Humans are basically like that right? But if we got a query, we may need to read books (external retrieval) which is pretty slow. However, humans have a cool feature, long-term memory, to store things permanently. Imagine if an LLM can select appropriate things during your queries/chat and store them as a text or knowledge base inside it, then it is a knowledge profile to permanently remember the context bonded between you and the AI, instead of the current situation that ChatGPT will forget everything after a restart.

michaelthwan_ai OP t1_jcxrilh wrote on March 20, 2023 at 11:30 AM

#2,278,319

Replying to Secret-Fox-5238 (#2,272,825)

haha. Nice point.

I'm not sure whether it fulfil the definition of a search engine, but this work essentially mimics your experiences during googling: Google->got n websites->surf and find info one by one.

SearchGPT (or e.g. new Bing) attempted to automate this process. (Thus Google is unhappy)

michaelthwan_ai OP t1_jcxrjbm wrote on March 20, 2023 at 11:30 AM

#2,278,320

Replying to Educational_Ice151 (#2,271,573)

Thank you!

michaelthwan_ai OP t1_jcxsd0x wrote on March 20, 2023 at 11:39 AM

#2,278,379

Replying to egoistpizza (#2,273,266)

Thank you for your comprehensive input.
- I have mixed feeling about opening/closing the technology. There are pros/cons to it. For example, we, especially people in this field have a strong curiosity about how giant technology solves their problems (like chatgpt). Therefore open-sourcing them will bring us rapid development in related fields (like the current AI development). However, I also understand that, malicious usage is also highly possible when doing so. For example, switching the reward function from chatgpt model from positive to negative may make a safe AI into the worst AI ever.
- Humans seem to not be able to stop technological advancement. Those technologies will come sooner or later.
- Yes I agree to preserve our rights today and the society should carefully think about how to deal with this unavoidable (AI-powered) future.

BalorNG t1_jcxtq26 wrote on March 20, 2023 at 11:54 AM

#2,278,441

Replying to michaelthwan_ai (#2,278,308)

There is a problem with context length, but than given the fact that us humans have even less context length and can get carried away in conversation... I think 32kb context length is actually much greater leap in GPT4 than other metrics if you want it to tackle more complex tasks, but it is "double gated". Again, even humans have problems with long context even in pretty "undemanding" tasks like reading fiction, that's why books have chapters I presume :) Btw, anterograde amnesia is a good example how humans would look like w/o longterm memory, heh.

Anyway, I'm sure a set of more compact models trained on much more high-quality data is the way to go - or at least fine-tuned by high-quality data, coupled with APIs and other symbolic tools, and multimodality (sketches, graphs, charts) as input AND output is absolutely nessesary to have a system that can be more than "digital assistant".

michaelthwan_ai OP t1_jcxx6ib wrote on March 20, 2023 at 12:28 PM

#2,278,639

Replying to BalorNG (#2,278,441)

Yeah great summary related to the memory.

My next target may be related to compact models (which preserve good results), as I also believe it is the way to go :D

BalorNG t1_jcy0trr wrote on March 20, 2023 at 1:01 PM

#2,278,896

Replying to michaelthwan_ai (#2,278,639)

Yea, I'm sure that compact-ish distilled, specialised models trained on high quality, multimodal data is the way to go.

What's interesting, once generative models get good enough to produce synthetic data that is OF HIGHER QUALITY than laion/common crawl/etc, it should improve model quality which should allow to generate better synthetic data... not exactly singularity, but certainly one aspect of it :)

egoistpizza t1_jcy1jxt wrote on March 20, 2023 at 1:07 PM

#2,278,946

Replying to michaelthwan_ai (#2,278,379)

Thanks for your reply. If technological developments are opened to the masses, as you said, the speed of development will jump. We're talking about a much higher rate of technological development than a closed development environment can provide. It will never reach its potential for development under the monopoly of companies that use technology and science like a cow for profit.

On the other hand, the current developments and potentials under the monopoly of these companies are more conducive to malicious use. The company, which, like OpenAI, was built on the axis of control and good purposes in the development of artificial intelligence, has now become Microsoft's cow. Microsoft, which fired the ethics team before the introduction of GPT-4, and similar companies prefer to use artificial intelligence to gain power and worship power in unethical ways from the very beginning.

Rather than protecting the public against a potential that could be used for malicious purposes, these companies may use this potential to serve "their" unethical purposes for their own profit. In this case, they turn into "bad guys" in order to prevent malicious people from using the technological potential for their own benefit.

Artificial intelligence and technological development potential should not be monopolized by anyone. In this way, we are responsible for raising awareness ourselves and raising the awareness of the masses by doing our part. The current hype should not blind people.

michaelthwan_ai OP t1_jcy73od wrote on March 20, 2023 at 1:51 PM

#2,279,304

Replying to BalorNG (#2,278,896)

Your idea sounds like GAN - maybe one model will generate high-quality synthetic data and another one try to 'discriminate' it, then they may output an ultra-high quality one finally (for another model to eat). And an AI model community is formed to self-improve...

BalorNG t1_jcy7l5d wrote on March 20, 2023 at 1:55 PM

#2,279,330

Replying to michaelthwan_ai (#2,279,304)

Yea, in a way something like this was already done with LLAMA-Alpaca finetune - they used chatgpt to generate instuct finetune dataset, what, while far from pefrect, worked pretty damn well.

derek_ml t1_jcyin91 wrote on March 20, 2023 at 3:19 PM

#2,280,001

Replying to mrcet007 (#2,277,804)

Pros:

The code and the app are closer, its easier for users to duplicate, explore, make issues/prs etc.
Its easier to discover given there is a large community there
Deployment is easier

Cons:

Github is a bit more advanced for prs/issues etc
Heroku is probably more configurable

michaelthwan_ai OP t1_jcyo94y wrote on March 20, 2023 at 3:57 PM

#2,280,371

Added an "examples of prompts" on the top for showcases!

kalakau t1_jd6uha4 wrote on March 22, 2023 at 7:08 AM

#2,300,998

Replying to Trolann (#2,275,598)

that's got to be my favorite disclaimer i've ever read

"it's unpredictable and could do anything"

[deleted] t1_jdgvy9w wrote on March 24, 2023 at 9:07 AM

#2,332,181

Replying to michaelthwan_ai (#2,269,331)

[removed]

Comments