Submitted by niclas_wue t3_10cgm8d in MachineLearning

Hi there,

I wanted to share my new project with you, it is called arxiv-summary.com. Right now, I find it really difficult to keep up with all the important new publications in our field. Especially, it is sometimes difficult to get an overview of a paper to decide if it's worth reading. I really like arxiv-sanity by Andrej Karpathy, but even with that, it can still take some time to understand the main ideas and contributions from the abstract. With arxiv-summary, my goal is to make ML research papers more "human-parsable".

The website works by fetching new papers daily from arxiv.org, using PapersWithCode to filter out the most relevant ones. Then, I parse the papers' pdf and LaTeX source code to extract relevant sections and subsections. GPT-3 then summarizes each section and subsection as bullet points, which are finally compiled into a blog post and uploaded to the site.

You can check out the site at arxiv-summary.com and see for yourself. There's also a search page and an archive page where you can get a chronological overview. If you have any feedback or questions, I'd be happy to hear them. Also, if you work at OpenAI and could gift me some more tokens, that would be much appreciated :D

Thanks and happy reading!

40

Comments

You must log in or register to comment.

ml-research t1_j4fpav0 wrote

Thanks for sharing!

> The website works by fetching new papers daily from arxiv.org, using PapersWithCode to filter out the most relevant ones.

What do you mean by "relevant"? What kinds of papers do you fetch?

7

niclas_wue OP t1_j4fqqy6 wrote

Thanks for asking! My first prototype collected all new arxiv papers in certain ML-related categories via the API, however I quickly realized that this would be way to costly. Right now, I collect all papers from PapersWithCode's "Top" (last 30 days) and the "Social" Tab, which is based on Twitter likes and retweets. Finally, I filter using this formula:

p.number_of_likes + p.number_of_retweets > 20 or p.number_github_stars > 100

In rare cases, when the paper is really long or not parsable with "grobid", I will exclude the paper for now.

10

transgalpower t1_j4fwztn wrote

It coule be nice if u ket people donate compute power. That way as a comunity we could keep it runing.

Also would be nice if it worked on mobile. Idk why but it says the domain isnt safe

5

niclas_wue OP t1_j4fyzia wrote

Yes, in the long run, there needs to be some sort of monetization to afford the API tokens. For now, I just want to see if people find it useful at all.

Thanks for letting me know, for me it works on mobile, but I will look into that.

3

Iunaml t1_j4g88q7 wrote

Sometime I wish we could just write papers directly in a summarized manner (and not automatically)

I'm starting my bullet-point conference soon..

11

Reasonable_Ladder922 t1_j4hewm2 wrote

Your arxiv-summary.com project sounds like a great idea and a very useful tool for people in the field of machine learning. It's great that you're using PapersWithCode to filter out the most relevant papers, and that you're using GPT-3 to summarize the papers' sections and subsections.

The fact that the website is able to fetch new papers daily and parse their pdf and LaTeX source code to extract relevant sections and subsection, and then summarize those with GPT-3, it will make it more accessible for people to quickly understand the main ideas and contributions from the abstract.

It's great to hear that you have a search page and an archive page where users can get a chronological overview, this will help people to keep track of new publications in their field.

I wish you the best of luck with your project and I'm sure it will be a great resource for many people in the field of machine learning.

1

RuairiSpain t1_j4i2q1b wrote

I suspect companies would pay a subscription for this! Individuals no, especially me ;)

Great work, thank you.

3

RuairiSpain t1_j4i4t54 wrote

I left academia in the 1990s. When did paper titles becomes so vague? "In my day", you had a good idea what the paper was about just from the title. Reading the first 30-40 papers here, what are authors trying to do? Be comedians?

I need a more up-to-date buzzword thesaurus of research fields and fashions, so I can interpret the context/semantics of these titles! I feel old 😫

5

niclas_wue OP t1_j4i9r9w wrote

Thanks for your ideas. Building a paid experience for companies is a great idea, I will consider it.

Category tagging like „computer vision“, „natural language processing“ etc. should be relatively straightforward. Will implement this in the next couple of days :)

More paper specific tags could be generated using GPT-3, I think that would make sense, when the database is a bit larger. Right now, I would guess that most tags would be unique to a single paper.

3

kroust2020 t1_j4j6vw1 wrote

Looks really cool. Great job. Is the code open-sourced? I'd like to learn how to do that sort of things

2

blueSGL t1_j4jy4w4 wrote

don't need to check the comment history that style of formatting shines like a red light now, woe betide anyone that naturally types like that. They'll need to change things up to avoid being accused of being a bot!

(or on the other extreme, people get so used to reading things formatted with that cadence, it starts being unconsciously replicated in their own work.)

3

niclas_wue OP t1_j4k17s2 wrote

Thank you, I am glad you like it! At the moment, only the web server is public. You can find it here: https://github.com/niclaswue/arxiv-smry It is a Hugo server with a blog theme. Every blog is a markdown file. When a new file is pushed to git it automatically gets published on the blog.

The rest is basically a bunch of (messy) Python scripts for extracting the text, then asking GPT-3 for a summary and compiling the answers to a markdown file. Finally, I use GitPython to automatically push new summaries to the repo.

3

FreddieM007 t1_j4l58y4 wrote

Great idea! There is a lot of potential! The biggest challenge for me is not just reading the most important papers but finding them. You already did the heavy lifting by downloading papers and computing the gpt3 embedding. With that you can build an index and add searching. You could cluster papers into categories to let the user browse. You could umap the papers etc. In the long term I would want it to be comprehensive and include all papers. In terms of costs, perhaps you can partner with arxiv directly. They should be interested to use your project...

4

Yidam t1_j4pr9p9 wrote

can i upload papers to use its wizardy?

3

niclas_wue OP t1_j4q13uw wrote

That’s actually a really good idea. Would you be willing to pay for such a feature? Something like 1$ per paper? That would cover the cost for the GPT tokens

3

Yidam t1_j4qohth wrote

>That’s actually a really good idea. Would you be willing to pay for such a feature? Something like 1$ per paper? That would cover the cost for the GPT tokens

That would bankrupt me (though i'm already basically bankrupt) others may find that acceptable however. Can it be applied on books? Book chapters? How does it deal with equations, does it need it to be in latex or pdf ok too? Does the pdf need to be converted to text?

1

niclas_wue OP t1_j4r5wb1 wrote

Yes, it can be applied to every document, a book would be more expensive, because it has more text and thus more input tokens. The pdf needs to be converted to text, because the API only accepts text, some equations which can be written using Unicode are directly put into the network and it can understand. Other equations are currently skipped. So far I have spent almost 100$ in tokens to summarize the papers, so there need to be some paid features in the near future or a reduction in the amount of papers.

1

randomusername11010 t1_j4yljwd wrote

Could you parse the citations to find which papers are cited the most to determine the most relevant papers rather than relying on papers with code?

2

niclas_wue OP t1_j4yukoz wrote

Yes, it is possible to use citations as a measure of a paper's impact. However, when a paper is newly published, there are typically no citations yet, so this would result in a delayed signal. Retweets and GitHub stars provide a faster indication of a paper's impact. I believe that speed is important because, as a paper becomes older, there are already many reviews and articles written by humans that (at least for now) provide a better summary of the paper.

2