matth0x01 t1_j2x49gm wrote on January 4, 2023 at 3:48 PM

Reply to comment by unkz in [R] Massive Language Models Can Be Accurately Pruned in One-Shot by starstruckmon

Thanks - I think I got it. Kind of new to me why language models use perplexity instead of log-likelihood which is a monotonic function of perplexity.

From Wikipedia it seems that perplexity is in unit "words" instead of "nats/bits", which might be more interpretable.

Are there other advantages I overlook?

unkz t1_j2x7ggd wrote on January 4, 2023 at 4:09 PM

That’s basically it, cross entropy (sum of negative log likelihood) and perplexity are related by

Perplexity = 2^entropy

So the main two things are, interpretability (perplexity is a measure of how many words the model is choosing from at any point), and scale (small changes in cross entropy result in large changes in perplexity).