Submitted by starstruckmon t3_1027geh in MachineLearning
bloc97 t1_j2s05hy wrote
It's curious that a 40% pruning of OPT-175 decreases perplexity, but the same effect is not seen in BLOOM... Could be a fluke but might warrant further investigation.
omniron t1_j2stl7w wrote
Just shows we have a huge amount to learn about how these systems actually work
mycall t1_j50h4l7 wrote
It probably is definitely complicated. There are many DAGs to reach similar or repeating patterns, or connections are suboptimal and thus never needed. How do you choose which to keep and which to delete.
learn-deeply t1_j2u53ek wrote
My unsubstantiated hypothesis: BLOOM is severely undertrained, so most neurons aren't contributing at all to the final result compared to OPT-175.
ElectronicCress3132 t1_j2v4vy4 wrote
Could you elaborate what you mean by undertrained?
learn-deeply t1_j2vac5q wrote
The model hasn't reached convergence, and/or the train dataset was too small.
matth0x01 t1_j2u5rwm wrote
Sorry - What's meant by perplexity here?
prototypist t1_j2uskwt wrote
It's a metric comparing the model's generative probabilities / text predictions vs. the actual text.
matth0x01 t1_j2vxl6g wrote
Thanks! Hm, seems to be a measure of sharpness for the predicted words?
unkz t1_j2v9edv wrote
matth0x01 t1_j2vx7z4 wrote
Yes, I know the concept, but where's the connection to the pruning approach here?
unkz t1_j2wzgf3 wrote
Perplexity is one of the key evaluation metrics for how well a language model understands language. Pruning one model decreases perplexity (makes the model better), which is interesting.
matth0x01 t1_j2x49gm wrote
Thanks - I think I got it. Kind of new to me why language models use perplexity instead of log-likelihood which is a monotonic function of perplexity.
From Wikipedia it seems that perplexity is in unit "words" instead of "nats/bits", which might be more interpretable.
Are there other advantages I overlook?
unkz t1_j2x7ggd wrote
That’s basically it, cross entropy (sum of negative log likelihood) and perplexity are related by
Perplexity = 2^entropy
So the main two things are, interpretability (perplexity is a measure of how many words the model is choosing from at any point), and scale (small changes in cross entropy result in large changes in perplexity).
Viewing a single comment thread. View all comments