Viewing a single comment thread. View all comments

bloc97 t1_j2s05hy wrote

It's curious that a 40% pruning of OPT-175 decreases perplexity, but the same effect is not seen in BLOOM... Could be a fluke but might warrant further investigation.

29

omniron t1_j2stl7w wrote

Just shows we have a huge amount to learn about how these systems actually work

22

mycall t1_j50h4l7 wrote

It probably is definitely complicated. There are many DAGs to reach similar or repeating patterns, or connections are suboptimal and thus never needed. How do you choose which to keep and which to delete.

1

learn-deeply t1_j2u53ek wrote

My unsubstantiated hypothesis: BLOOM is severely undertrained, so most neurons aren't contributing at all to the final result compared to OPT-175.

13

matth0x01 t1_j2u5rwm wrote

Sorry - What's meant by perplexity here?

3

prototypist t1_j2uskwt wrote

It's a metric comparing the model's generative probabilities / text predictions vs. the actual text.

4

matth0x01 t1_j2vxl6g wrote

Thanks! Hm, seems to be a measure of sharpness for the predicted words?

1

unkz t1_j2v9edv wrote

1

matth0x01 t1_j2vx7z4 wrote

Yes, I know the concept, but where's the connection to the pruning approach here?

2

unkz t1_j2wzgf3 wrote

Perplexity is one of the key evaluation metrics for how well a language model understands language. Pruning one model decreases perplexity (makes the model better), which is interesting.

1

matth0x01 t1_j2x49gm wrote

Thanks - I think I got it. Kind of new to me why language models use perplexity instead of log-likelihood which is a monotonic function of perplexity.

From Wikipedia it seems that perplexity is in unit "words" instead of "nats/bits", which might be more interpretable.

Are there other advantages I overlook?

1

unkz t1_j2x7ggd wrote

That’s basically it, cross entropy (sum of negative log likelihood) and perplexity are related by

Perplexity = 2^entropy

So the main two things are, interpretability (perplexity is a measure of how many words the model is choosing from at any point), and scale (small changes in cross entropy result in large changes in perplexity).

1