GamerMinion t1_jddeprr wrote
When you say "FLOP-equivalent, does that also mean compute-time equivalent?
I ask this because on GPUs, models like EfficientNet, which technically have far less flops and parameters can be way slower than a standard ResNet of same accuracy because they're that much less efficiently parallelizable.
Did you look into inference latency on GPUs in your paper?
brownmamba94 t1_jddhxdb wrote
Hi yes, this is a great question. When we say FLOP-equivalent, we're saying on an ideal hardware which can accelerate unstructured weight sparsity, the total compute-time would also be equivalent. Except, we're showing we can actually improve the accuracy of the original dense model for the same compute budget with these Sparse Iso-FLOP Transformations (e.g., Sparse Wide, Sparse Parallel, etc.).
In Section 4 of our paper, we actually make comparisons for inference and training on hardwares with and without support for sparsity acceleration.
In theory, there should be no increase in wall-clock time, but on GPUs there'd be a significant increase. However, emerging hardware accelerators like Cerebras CS-2 are doing hardware-software co-design for sparse techniques, which can allow us to take advantage of sparse acceleration during training.
GamerMinion t1_jddlqit wrote
Yes, theory is one thing, but you can't build ASICs for everything due to the cost involved.
Did you look into sparsity at latency-equivalent scales? i.e. same latency, bigger but sparser model.
I would be very interested to see results like that, especially for GPU-like accelerators (e.g. Nvidia's AGX computers use their ampere GPU architecture), as latency is a primary focus in high-value computer vision applications such as in autonomous driving.
[deleted] t1_jddutrc wrote
[removed]
Viewing a single comment thread. View all comments