Note #2: We are revising the name to Sparse-IFT. We appreciate the candid feedback and look forward to hearing any additional feedback you have on our research.

Note: Thank you r/MachineLearning for providing so many awesome naming alternatives! We'll revisit the acronym and update accordingly.

We are excited to announce the availability of our paper on arxiv on Sparse Iso-FLOP Transformations (Sparse-IFT), which increases accuracy and maintains the same FLOPs as the dense model using sparsity. In this research, we replace dense layers with Sparse-IFT and significantly improve computer vision and natural language processing tasks without modifying training hyperparameters

Some of the highlights of this work include ResNet-18 on ImageNet achieving a 3.5% accuracy improvement and GPT-3 Small on WikiText-103 reducing perplexity by 0.4, both matching larger dense model variants that have 2x or more FLOPs.

Sparse-IFT is simple to use, provides a larger search space to find optimal sparse masks, and is parameterized by a single hyperparameter - the sparsity level.

This is independent of the research we posted yesterday, which demonstrates the ability to reduce pre-training FLOPs while maintaining accuracy on downstream tasks.

This is the first work (that we know of!) to demonstrate the use of sparsity for improving the accuracy of models via a set of sparse transformations.

https://preview.redd.it/qznj00gex6qa1.jpg?width=3536&format=pjpg&auto=webp&v=enabled&s=2e3ee31dd58f76ab7e2c24105081574f772ed0b1

Comments

You must log in or register to comment.

mouldygoldie t1_jdaa3nv wrote on March 22, 2023 at 11:17 PM

I think I'd look for a different acronym to SIFT, given that's a very well known feature detector and descriptor in computer vision...

BrotherAmazing t1_jdazqji wrote on March 23, 2023 at 2:22 AM

Came here to say that.

It’d almost be like choosing the name “IBM” for your company then starting off with “Not to be confused with the International Business Machines publicly traded company IBM,…”

jakderrida t1_jdb95pw wrote on March 23, 2023 at 3:38 AM

How about SPIT, or Sparse Parameter Iso-FLOP Transformations)?

or would SPLIT: Sparse Performance-focused Lightweight Iso-FLOP Transformations work?Or let's choose whatever's SAFIST, or Sparse Accuracy-focused FLOP-Isometric Structural Transformations?

Who cares that I obviously had to shoehorn "Structural" in there just to get my pun across?

VictorMollo t1_jdf0kf5 wrote on March 23, 2023 at 10:35 PM

Sparse Widget Iso-Flop Transformations (Tailored). SWIFT-Tailored 🎶🎵🧑‍🎤

jakderrida t1_jdf8zip wrote on March 23, 2023 at 11:34 PM

Whoever is downvoting you just doesn't get it.

My joke was that "structural" was so meaningless that it's obviously a backronym solely in service of my pun.

/r/VictorMollo 's joke is that we should all just go off the deep-end and double down on blatantly obvious backronyms.

Notice he used the word "Widget" instead of freaking "Weighted"? He obviously chose to Taylor it that way because he appreciates my puns.

[deleted] t1_jdbds0h wrote on March 23, 2023 at 4:22 AM

[removed]

[deleted] t1_jdbg14k wrote on March 23, 2023 at 4:44 AM

[removed]

PacmanIncarnate t1_jdaqc00 wrote on March 23, 2023 at 1:12 AM

Perhaps SpIF-T?

brownmamba94 t1_jdd1otu wrote on March 23, 2023 at 3:03 PM

Hi thank you for the feedback. This was a genuine oversight and we will correct the paper with a new acronym in the revised version of the manuscript. You can expect the changes soon. I look forward to any feedback you have on the research itself, cheers!

mouldygoldie t1_jddorlu wrote on March 23, 2023 at 5:29 PM

Good to hear! I admit I've not actually read the paper - I'll add it to the list and get back if I have any pointers

SorrowInCoreOfWin t1_jda9zul wrote on March 22, 2023 at 11:16 PM

Scale-Invariant Feature Transforms?

currentscurrents t1_jdag0js wrote on March 22, 2023 at 11:58 PM

We're really running out of acronyms at this point.

[deleted] t1_jdaamk4 wrote on March 22, 2023 at 11:20 PM

[deleted]

[deleted] t1_jdacm7a wrote on March 22, 2023 at 11:34 PM

[deleted]

MisterManuscript t1_jdawa9m wrote on March 23, 2023 at 1:56 AM

Feels like the authors are trying to piggyback on the pre-existing fame of Scale-Invariant Feature Transform. Out of all other names that could have been chosen, why try to override an existing name?

Addendum: if you're lucky, Google just might cut you some slack. If not, then expect their lawyers to come at you with a cease-and-desist.

Addendeum 2: from a deleted reply from one of the authors/person from Cerebras asking why Google might come after them with a cease-and-desist: SIFT's patent is owned by Google. They may consider trademark violation, or something similar.

tdgros t1_jdbu6w7 wrote on March 23, 2023 at 7:48 AM

the SIFT patent expired in March, 2020. It's included in openCV now (it used to be in a "non free" extension of openCV)

MisterManuscript t1_jdbuih4 wrote on March 23, 2023 at 7:52 AM

I stand corrected regarding the patent. The naming conflict, on the other hand, is here to stay.

Armanoth t1_jdc3lyf wrote on March 23, 2023 at 10:08 AM

Yeah, whenever there is papers that try to redefine/takeover existing well known acronyms, i just get the sense that the goal is publicity through controversy.

I dont believe its just a coincidence, especially not when its an acronym so prominent. I mean who tries to coin a term without doing a basic Google search, let alone pick an acronym that is so well-known in the same field.

[deleted] t1_jdb99kw wrote on March 23, 2023 at 3:39 AM

[removed]

[deleted] t1_jdbpst7 wrote on March 23, 2023 at 6:45 AM

[removed]

elisiyumali t1_jdapqms wrote on March 23, 2023 at 1:08 AM

Whoa...this is the first time I've seen weight sparsity being used to actually improve accuracy! :O The paper was a pleasant read, and the method is simple but novel. Nice work.. I look forward to experimenting with these transformations in my own work once the code is out...

brownmamba94 t1_jdaq0gn wrote on March 23, 2023 at 1:09 AM

Hi, thanks for acknowledging the novelty of our work and finding our paper a good read. We look forward to releasing our code so yourself and others can experiment with the different SIFT transformations. And yes, first time sparsity is being used to improve the accuracy!

Maximum t1_jdd2hjg wrote on March 23, 2023 at 3:08 PM

Under which license?

brownmamba94 t1_jddzjgc wrote on March 23, 2023 at 6:37 PM

Thanks for your inquiry. We are working with our legal team to figure out the best path forward, but most likely, we'll be releasing under some permissive license that allows you to use the code for your applications.

[deleted] t1_jddfdkq wrote on March 23, 2023 at 4:30 PM

[removed]

GamerMinion t1_jddeprr wrote on March 23, 2023 at 4:26 PM

When you say "FLOP-equivalent, does that also mean compute-time equivalent?

I ask this because on GPUs, models like EfficientNet, which technically have far less flops and parameters can be way slower than a standard ResNet of same accuracy because they're that much less efficiently parallelizable.

Did you look into inference latency on GPUs in your paper?

brownmamba94 t1_jddhxdb wrote on March 23, 2023 at 4:46 PM

Hi yes, this is a great question. When we say FLOP-equivalent, we're saying on an ideal hardware which can accelerate unstructured weight sparsity, the total compute-time would also be equivalent. Except, we're showing we can actually improve the accuracy of the original dense model for the same compute budget with these Sparse Iso-FLOP Transformations (e.g., Sparse Wide, Sparse Parallel, etc.).

In Section 4 of our paper, we actually make comparisons for inference and training on hardwares with and without support for sparsity acceleration.

In theory, there should be no increase in wall-clock time, but on GPUs there'd be a significant increase. However, emerging hardware accelerators like Cerebras CS-2 are doing hardware-software co-design for sparse techniques, which can allow us to take advantage of sparse acceleration during training.

GamerMinion t1_jddlqit wrote on March 23, 2023 at 5:10 PM

Yes, theory is one thing, but you can't build ASICs for everything due to the cost involved.

Did you look into sparsity at latency-equivalent scales? i.e. same latency, bigger but sparser model.

I would be very interested to see results like that, especially for GPU-like accelerators (e.g. Nvidia's AGX computers use their ampere GPU architecture), as latency is a primary focus in high-value computer vision applications such as in autonomous driving.

[deleted] t1_jddutrc wrote on March 23, 2023 at 6:07 PM

[removed]

brain_diarrhea t1_jdcovin wrote on March 23, 2023 at 1:34 PM

Someone's getting a cease and desist

Character_Internet_3 t1_jdcqsa1 wrote on March 23, 2023 at 1:48 PM

Why sift, is a computer vision reserved name

iantimmis t1_jddhpwx wrote on March 23, 2023 at 4:45 PM

DIFFERENT NAME

CatalyzeX_code_bot t1_je774q1 wrote on March 29, 2023 at 10:16 PM

Found relevant code at https://github.com/CerebrasResearch/Sparse-IFT + all code implementations here

To opt out from receiving code links, DM me

Under_Over_Thinker t1_jde5e2f wrote on March 23, 2023 at 7:14 PM

Perplexity going from 20.8 to 20.4. Is that a significant improvement? Also, I am not sure if perplexity is representative enough to evaluate LLMs.

Tejalapeno t1_jdb3u06 wrote on March 23, 2023 at 2:54 AM

Man it would be cool if the comments here actually focused on the paper contents and not the use of an acronym for an outdated algorithm. Because the results are extremely important for future scaling

Armanoth t1_jdc2vt0 wrote on March 23, 2023 at 9:58 AM

While the paper is good and definetly presents some novel approach. Re-using existing acronyms, especially such prominent ones. The main purpose of these acronyms to allow for readers to easily identify and reference existing methods.

If your choice of acronym forces all subsequent research to have to elaborate on which SIFT is mentioned, it is not only a poor choice but also a point of confusion. And existing papers that mention SIFT are retroactively affected.

As many in this thread has pointed out, there are other equally catchy, non-overlapping acronyms that could have been chosen.

pm_me_your_pay_slips t1_jdeyz79 wrote on March 23, 2023 at 10:24 PM

Sure, my next paper will introduce Transformers, a new method for distillation of neural network models.

[deleted] t1_jdb57hr wrote on March 23, 2023 at 3:05 AM

[deleted]