ganzzahl t1_jdovu3h wrote on March 26, 2023 at 1:00 AM

Reply to comment by itshouldjustglide in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700

I'm also very interested in this – does anyone have papers similar to Chinchilla, but without the training FLOPs restriction, and instead comparing identical dataset sizes?

An aside: I feel like I remember some older MT papers where LSTMs outperformed Transformers for some low resource languages, but I think that's outdated – using transfer learning, multilingual models and synthetic data, I'm fairly certain Transformers always outperform nowadays.

ganzzahl t1_jdouip7 wrote on March 26, 2023 at 12:50 AM

Reply to comment by michaelthwan_ai in [N] March 2023 - Recent Instruction/Chat-Based Models and their parents by michaelthwan_ai

You're definitely missing the entire T5 (encoder-decoder) family of models. From the UL2 paper , it seems encoder-decoder models are more powerful than decoder-only models (such as the GPT family), especially if you're most interested in inference latency.

I do very much wonder if OpenAI has tested equally-sized T5 models, and if there's some secret reason they have found as to why we should stick with GPT models, or if they just are doubling down on "their" idea, even if it is slightly inferior. Or maybe there are newer papers I don't know about.

ganzzahl t1_itwsfkh wrote on October 26, 2022 at 9:39 PM

Reply to comment by pommedeterresautee in [P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels by pommedeterresautee

This is perhaps an entirely dumb question that I will be able to answer for myself after I read through the Triton docs, but I'll ask anyway: Could one implement custom ONNX operators using Triton, or can it only be used in a Python environment?

ganzzahl t1_ittzcuq wrote on October 26, 2022 at 8:38 AM

Reply to comment by pommedeterresautee in [P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels by pommedeterresautee

Ahh, I missed that when reading your post. What a time to be alive!

My quick question for you is just this: Why is it that we don't see any projects with similar speedups using custom CUDA kernels or custom ONNX operators? Is there any inherent speed advantage of using Triton or is the high barrier of entry to writing CUDA kernels the reason that no one has "gotten around" to doing something like this in pure CUDA?

ganzzahl t1_ittx10s wrote on October 26, 2022 at 8:04 AM

Reply to [P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels by pommedeterresautee

Any work on using this with autoregressive encoder-decoder transformers? I always love reading and trying out your work!