Submitted by GoodluckH t3_10bzll1 in MachineLearning

Many models these days focus on code generation. But I was wondering if there's anything for understanding existing codebase?

I know that Codex or ChatGPT can understand what a function does, but what about a complex codebase with imports and nested calls? Are these models capable of understanding the relationship between functions?

I'm trying to build a side project where you give it a production level codebase, it does some magic, then I can ask AI anything about things in this codebase with high accuracy.

15

Comments

You must log in or register to comment.

IntrepidTieKnot t1_j4dihgh wrote

I also like to do this. At the moment I think it is only possible if you re-train the model with that said code base. But I am happy to hear how it can be done from someone with more knowledge.

2

seventyducks t1_j4din09 wrote

You should check out https://gptduck.com

2

m98789 t1_j4e5du6 wrote

Gptduck is a cool project, but it only extracts embeddings of portions of the code which are typically just used for search, clustering or recommendation.

That is, the system will convert your question into an embedding, then simply do something like a dot product to get rankings of all other code embeddings to find the most semantically similar to your query. The top one would be presented as the answer.

So it would feel more like an advanced search rather than a ChatGPT-like Q&A experience.

More info on OpenAI’s GPT embeddings:

https://beta.openai.com/docs/guides/embeddings/what-are-embeddings

4

GoodluckH OP t1_j4espcc wrote

Wow, that's really cool. But I can actually ask things like "what does XYZ do?", and it can give me some explanations like ChatGPT.

Clearly, they are using more than OpenAI's embedding to make this possible. I read if from Twitter that GPTDuck also uses LangChain which I'm not so familiar with.

Any idea how they're able to go from advanced search to conversational?

thank you for your insight!

1

m98789 t1_j4eutfz wrote

Can you please link me to the tweet you are referring to?

From my understanding of Q&A from LangChain is it can answer “what” questions like “What did XYZ say…” but not “why” because the “what” questions are really just text similarity searching.

But maybe there is more to it, so I’d like to see the tweet.

1

m98789 t1_j4f135j wrote

Got it, this is how I believe it was implemented:

  • Stage 0: All code was split into chunks and had their embeddings taken, and saved into one table for lookups, e.g., code in one field and embedding in the adjacent field.
  • Stage 1: semantic search to find code. Take your query and encode it into an embedding. Then apply dot product over all the code embeddings in the table to find semantically similar code chunks.
  • Stage 2: combine all the top-K similar chunks into one string or list we can call the “context”.
  • Stage 3: stuff the context into a prompt as a preamble, then append the actual question you want to ask.
  • Stage 4: execute the prompt to a LLM like gpt-3 and collect the answer and show it to the user.
2

GoodluckH OP t1_j4hiypg wrote

Ahh this makes a lot of sense. Regarding stage 0, how do you split codes? Like just by lines or have some methods to extract functions and classes?

I wrote some script that allows you to extract Python functions using regex, but this is def not scalable to other languages…

1

Naive-Progress4549 t1_j4f9x8o wrote

The professor Romain Robbes has a research group focusing on this, you might look at his papers or also contact him!

1

ApolloniusOfPerga420 t1_j4fqa2g wrote

You could probably just do this with Codex. It’s zero-shot performance is very high.

1

MysteryInc152 t1_j4lv0d5 wrote

Codex and chatGPT can understand more than just functions. The issue with them is the limited token window.

1