maizeq t1_jd6u4kb wrote on March 22, 2023 at 7:03 AM

Reply to comment by artsybashev in [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101

The sparsity they describe in this link entails randomly pruning weights (i.e. not specific channels like depthwise convolutions), which is what Graphcore is calling "unstructured".

osdd_alt_123 t1_jd6ufjz wrote on March 22, 2023 at 7:07 AM

Nvidia has 2:4 structured sparsity in the Ampere architecture and one or two below as well, if memory serves.

So in a block of 4, you have to have 2 dropped and 2 retained. It's how they claim their 2x throughput at the hardware level.

You can, however, emulate sparsity in a variety of other ways that are higher than the hardware level. Hope this helps.

maizeq t1_jd76a7x wrote on March 22, 2023 at 9:57 AM

Ah I see, thank you for the clarification.

brownmamba94 t1_jd8lqry wrote on March 22, 2023 at 4:46 PM

Also, the N:M sparsity structure is much more constrained in terms of mask diversity compared to unstructured sparsity. In Table 1 in the N:M Transposable sparsity paper, they present the mask diversity constraint between different sparsity techniques (both unstructured and structured), and as expected unstructured sparsity achieves the best. I think this is important especially for dynamic sparse training because now the algorithm has a much larger search space to explore sparse subnetworks. Also, imposing structured sparsity like N:M sparsity tends to reduce the expressivity of a weight matrix at higher sparsity levels, which can be a constraint if you want to get high compression ratios.