Aphix t1_jca204m wrote on March 15, 2023 at 11:12 AM

Mind elaborating?

usc-ur OP t1_jca4ot7 wrote on March 15, 2023 at 11:41 AM

Sure! The idea is that you create a language model from a given corpus (let's say BNC) and then you use a similarity measure, in this case, perplexity, but can be another one to test how well your sample (sentence) "fits" into the model distribution. Since we assume the distribution is correct, this allows us to identified malformed sentences. You can also check the paper here: https://www.cambridge.org/core/journals/natural-language-engineering/article/an-unsupervised-perplexitybased-method-for-boilerplate-removal/5E589D838F1D1E0736B4F52001150339#article