Hey, I'm a casual observer of the DL space, what are the biggest technique changes or discoveries that are now used everywhere? From my view:

Pretraining - reuse large data sets in the same domain (2010)
ReLU - simple to train non-linear function (2010)
Data Augmentation - how to make up more data (including noise, random erasing) (2012-)
Dropout - how to not overfit (2014)
Attention - how to model long range dependencies (2014)
Batch normalisation - how to avoid class of training issues (2015)
Residual connections - how to go deeper (2015)
Layer normalisation - how to avoid class of training issues (2016)
Transformers - how to do sequence modelling (2017)
Large Language Models - how to use implicit knowledge in language (2019)

What's the other improvements or discoveries? More general the idea the better.

Edit: added attention, pretraining, data augmentation, batch normalisation, contrastive methods

Comments

CremeEmotional6561 t1_iuyt5ae wrote on November 4, 2022 at 12:40 AM

#438,914

LSTMs - how to train sequences (1997)

ziad_amerr t1_iuyxzxz wrote on November 4, 2022 at 1:15 AM

#439,124

Check out GANs, One shot learning, Read about CoAtNets, RoBERTa, StyleGAN, XLNet, DoubleU Net and others

mhddjazz t1_iuz3px3 wrote on November 4, 2022 at 1:56 AM

#439,365

NERF, Diffusion

JackandFred t1_iuzb389 wrote on November 4, 2022 at 2:51 AM

#439,689

I feel like if your going to include transformers you should include the attention is all you need paper.

cautioushedonist t1_iuzeog4 wrote on November 4, 2022 at 3:20 AM

#439,820

Not as famous and might not qualify as a 'trick' but I'll mention "Geometric Deep Learning" anyway.

It tries to explain all the successful neural nets (CNN, RNN, Transformers) on a unified, universal mathematical framework. The most exciting extrapolation of this being that we'll be able to quickly discover new architectures using this framework.

Link - https://geometricdeeplearning.com/

BeatLeJuce t1_iuzz1ku wrote on November 4, 2022 at 7:11 AM

#440,539

Layer norm is not about fitting better, but training more easily (activations don't explode which makes optimization more stable).

Is your list limited to "discoveries that are now used everywhere"? Because there are a lot things that would've made it onto your list if you'd compiled it at different points in time but are now discarded (i.e., i'd say they are fads). E.g. GANs.

Other things are currently hyped but it's not clear how they'll end up long term:

Diffusion models are another thing that are currently hot.

Combining Multimodal inputs, which I'd say are "clip-like things".

There's self-supervision as a topic as well (with "contrastive methods" having been a thing).

Federated learning is likely here to stay.

NeRF will likely have a lasting impact, too.

FoundationPM t1_iv018k3 wrote on November 4, 2022 at 7:44 AM

#440,612

Quite clean. 2020-2022 is empty, because you don't see progress these years?

Gere1 t1_iv0505o wrote on November 4, 2022 at 8:41 AM

#440,740

Does someone know a good ablation study of the mentioned techniques. I've seen results where neither dropout nor layer normalization did much. So I wonder if these 2 techniques are a believe or still crucial.

PassionatePossum t1_iv05451 wrote on November 4, 2022 at 8:42 AM

#440,743

Replying to JackandFred (#439,689)

I would only include as a historical reference. It is certainly not a "must read" paper. It is written so poorly that you are better off to just look at the code.

ukshin-coldi t1_iv0593t wrote on November 4, 2022 at 8:45 AM

#440,747

Your dates are wrong, these were all discovered by Schmidhuber in the 90s.

carlthome t1_iv0gzvw wrote on November 4, 2022 at 11:20 AM

#441,237

Interesting to mention layer normalisation over batch normalisation. I thought the latter was "the thing" and that layernorm, groupnorm, instancenorm etc. were follow-ups.

and1984 t1_iv0qjbs wrote on November 4, 2022 at 12:50 PM

#441,826

Replying to cautioushedonist (#439,820)

TIL

ukshin-coldi t1_iv0qocf wrote on November 4, 2022 at 12:51 PM

#441,838

Replying to PassionatePossum (#440,743)

Any good resources for writing a well written ML paper?

acertainmoment t1_iv1ddh0 wrote on November 4, 2022 at 3:33 PM

#443,149

Replying to carlthome (#441,237)

yup, same thoughts. BatchNorm was the OG norm. The cousins came later

Intelligent-Aioli-43 t1_iv1lgvq wrote on November 4, 2022 at 4:26 PM

#443,548

Replying to ukshin-coldi (#441,838)

Check out MLRC

samlhuillier3 t1_iv2wzqy wrote on November 4, 2022 at 9:41 PM

#445,972

Diffusion and GANs!!

redditrantaccount t1_iv3oxg8 wrote on November 5, 2022 at 1:16 AM

#447,146

Data augmentation to more explicitely define invariant transformations as well as to reduce dataset labeling costs.

windoze OP t1_iv4y3f6 wrote on November 5, 2022 at 10:03 AM

#448,760

Replying to FoundationPM (#440,612)

It's empty because I've not kept up to date, and also impact won't be seen until more people build on it.

flaghacker_ t1_iv5jf05 wrote on November 5, 2022 at 1:57 PM

#449,694

Replying to PassionatePossum (#440,743)

What's wrong with it? They explain all the components of their model in enough detail (in particular the multi head attention stuff), provide intuition behind certain decisions, include clear results, they have nice pictures, ... What could have been improved about it?

BrisklyBrusque t1_iv6negg wrote on November 5, 2022 at 6:38 PM

#451,381

Replying to cautioushedonist (#439,820)

Is this different from the premise that neural networks are universal function approximators?

BrisklyBrusque t1_iv6ogqg wrote on November 5, 2022 at 6:45 PM

#451,431

2007-2010: Deep learning begins to win computer vision competitions. In my eyes, this is what put deep learning on the map for a lot of people, and kicked off the renaissance we see today.

2016ish: categorical embeddings/entity embeddings. For tabular data with categorical variables, categorical embeddings are faster and more accurate than one-hot-encoding, and preserve the natural relationships between factors by mapping them to a low dimensional space

BrisklyBrusque t1_iv6otss wrote on November 5, 2022 at 6:47 PM

#451,443

Replying to BeatLeJuce (#440,539)

I recall that experimenters disagreed on why batchnorm worked in the first place? has the consensus settled?

BeatLeJuce t1_iv7co26 wrote on November 5, 2022 at 9:36 PM

#452,351

Replying to BrisklyBrusque (#451,443)

No. But we all agree that it's not due to internal covariate shift.

cautioushedonist t1_ivcx548 wrote on November 7, 2022 at 1:08 AM

#460,462

Replying to BrisklyBrusque (#451,381)

Yes, it's different.

Universal function approximation sort of guarantees/implies that you can approximate any mapping function given the right config/weights of neural nets. It doesn't really guide us to the correct config.

blunzegg t1_iwl1d81 wrote on November 16, 2022 at 12:46 PM

#544,198

- Kernel tricks: How can purely mathematical approaches beat neural networks in terms of efficiancy? (This is actually an open problem for a long time, you can check Neural Tangent Kernels, Reproducing Kernel Hilbert Spaces for examples and Universal Approximation Property for neural networks )

- I was mainly here for Geometric Deep Learning but another user has already posted it. You should definitely check http://geometricdeeplearning.com . As a mathematician-to-be, I strongly believe that this is the future of ML/DL . Hit me up if you wanna discuss this statement further.

[D] What are the major general advances in techniques?