omniron t1_j2stl7w wrote on January 3, 2023 at 6:39 PM

Just shows we have a huge amount to learn about how these systems actually work

mycall t1_j50h4l7 wrote on January 19, 2023 at 3:27 PM

It probably is definitely complicated. There are many DAGs to reach similar or repeating patterns, or connections are suboptimal and thus never needed. How do you choose which to keep and which to delete.

learn-deeply t1_j2u53ek wrote on January 3, 2023 at 11:33 PM

My unsubstantiated hypothesis: BLOOM is severely undertrained, so most neurons aren't contributing at all to the final result compared to OPT-175.

ElectronicCress3132 t1_j2v4vy4 wrote on January 4, 2023 at 3:50 AM

Could you elaborate what you mean by undertrained?

learn-deeply t1_j2vac5q wrote on January 4, 2023 at 4:34 AM

The model hasn't reached convergence, and/or the train dataset was too small.

matth0x01 t1_j2u5rwm wrote on January 3, 2023 at 11:38 PM

Sorry - What's meant by perplexity here?

prototypist t1_j2uskwt wrote on January 4, 2023 at 2:18 AM

It's a metric comparing the model's generative probabilities / text predictions vs. the actual text.

matth0x01 t1_j2vxl6g wrote on January 4, 2023 at 8:49 AM

Thanks! Hm, seems to be a measure of sharpness for the predicted words?

unkz t1_j2v9edv wrote on January 4, 2023 at 4:26 AM

https://en.wikipedia.org/wiki/Perplexity

matth0x01 t1_j2vx7z4 wrote on January 4, 2023 at 8:44 AM

Yes, I know the concept, but where's the connection to the pruning approach here?

unkz t1_j2wzgf3 wrote on January 4, 2023 at 3:16 PM

Perplexity is one of the key evaluation metrics for how well a language model understands language. Pruning one model decreases perplexity (makes the model better), which is interesting.

matth0x01 t1_j2x49gm wrote on January 4, 2023 at 3:48 PM

Thanks - I think I got it. Kind of new to me why language models use perplexity instead of log-likelihood which is a monotonic function of perplexity.

From Wikipedia it seems that perplexity is in unit "words" instead of "nats/bits", which might be more interpretable.

Are there other advantages I overlook?

unkz t1_j2x7ggd wrote on January 4, 2023 at 4:09 PM

That’s basically it, cross entropy (sum of negative log likelihood) and perplexity are related by

Perplexity = 2^entropy

So the main two things are, interpretability (perplexity is a measure of how many words the model is choosing from at any point), and scale (small changes in cross entropy result in large changes in perplexity).

[R] Massive Language Models Can Be Accurately Pruned in One-Shot

bloc97 t1_j2s05hy wrote on January 3, 2023 at 3:31 PM