ganzzahl

ganzzahl t1_jdovu3h wrote

I'm also very interested in this – does anyone have papers similar to Chinchilla, but without the training FLOPs restriction, and instead comparing identical dataset sizes?

An aside: I feel like I remember some older MT papers where LSTMs outperformed Transformers for some low resource languages, but I think that's outdated – using transfer learning, multilingual models and synthetic data, I'm fairly certain Transformers always outperform nowadays.

1

ganzzahl t1_jdouip7 wrote

You're definitely missing the entire T5 (encoder-decoder) family of models. From the UL2 paper , it seems encoder-decoder models are more powerful than decoder-only models (such as the GPT family), especially if you're most interested in inference latency.

I do very much wonder if OpenAI has tested equally-sized T5 models, and if there's some secret reason they have found as to why we should stick with GPT models, or if they just are doubling down on "their" idea, even if it is slightly inferior. Or maybe there are newer papers I don't know about.

10

ganzzahl t1_ittzcuq wrote

Ahh, I missed that when reading your post. What a time to be alive!

My quick question for you is just this: Why is it that we don't see any projects with similar speedups using custom CUDA kernels or custom ONNX operators? Is there any inherent speed advantage of using Triton or is the high barrier of entry to writing CUDA kernels the reason that no one has "gotten around" to doing something like this in pure CUDA?

5