Submitted by benanne t3_107g3yf in MachineLearning
londons_explorer t1_j3qd9yb wrote
> too early to consider diffusion as a serious alternative to autoregression for generative language modelling at scale
This blog post explores lots of ideas and has conjectures about why they may or may not work...
But it seems this stuff could just be tried.... Burn up some TPU credits and simply run each of the types of model you talk about and see which does best.
Hard numbers are better than conjecture. Then focus future efforts on improving the best numbers.
benanne OP t1_j3qvx05 wrote
My blog posts are mostly shower thoughts expanded into long form, so naturally they tend to be a bit speculative. I have in fact tried a bunch of stuff in the diffusion language modelling space, which culminated in the CDCD paper: https://arxiv.org/abs/2211.15089 as well as this theoretical note on simplex diffusion: https://arxiv.org/abs/2210.14784 -- if the style of the blog post isn't your cup of tea, this might be more to your liking :)
Completely agree re: hard numbers, by the way (I spent quite a bit of time Kaggling during my PhD, see some of my earlier blog posts), but a single researcher can only do so many experiments. Part of the motivation for writing these blog posts is to draw attention to areas of research I think are interesting, and hopefully encourage some people to delve deeper into them as well! Pointing out open questions can be quite conducive to that, in my experience.
Viewing a single comment thread. View all comments