Aphix t1_jc739fl wrote on March 14, 2023 at 3:02 PM

#2,228,683

Link to GitHub?

usc-ur OP t1_jc73mgl wrote on March 14, 2023 at 3:04 PM

#2,228,709

Replying to Aphix (#2,228,683)

https://github.com/citiususc/pyplexity . My bad :)

Aphix t1_jc799qb wrote on March 14, 2023 at 3:42 PM

#2,228,981

Replying to usc-ur (#2,228,709)

All good, might want to link the other ones you posted in those other post comments, too (Smarty-GPT, etc) -- most of the time you can just link the GitHub directly from the post here.

usc-ur OP t1_jc7guwj wrote on March 14, 2023 at 4:30 PM

#2,229,371

Replying to Aphix (#2,228,981)

>All good, might want to link the other ones you posted in those other post comments, too (Smarty-GPT, etc) -- most of the time you can just link the GitHub directly from the post here.

My bad, quite a newbie: https://github.com/citiususc/Smarty-GPT

gargolito t1_jc889m6 wrote on March 15, 2023 at 12:22 AM

#2,230,918

so... what are perplexity filters ?

usc-ur OP t1_jc9ss05 wrote on March 15, 2023 at 9:15 AM

#2,233,738

Replying to gargolito (#2,230,918)

A perplexity filter allows to remove sentences by likelihood to a given language model. In there you need to "play" with the parameter or threshold

Aphix t1_jca204m wrote on March 15, 2023 at 11:12 AM

#2,234,188

Replying to usc-ur (#2,233,738)

Mind elaborating?

usc-ur OP t1_jca4ot7 wrote on March 15, 2023 at 11:41 AM

#2,234,342

Replying to Aphix (#2,234,188)

Sure! The idea is that you create a language model from a given corpus (let's say BNC) and then you use a similarity measure, in this case, perplexity, but can be another one to test how well your sample (sentence) "fits" into the model distribution. Since we assume the distribution is correct, this allows us to identified malformed sentences. You can also check the paper here: https://www.cambridge.org/core/journals/natural-language-engineering/article/an-unsupervised-perplexitybased-method-for-boilerplate-removal/5E589D838F1D1E0736B4F52001150339#article

Pyplexity: Useful tool to clean scraped text (better than BS4!)

Comments