Viewing a single comment thread. View all comments

starstruckmon t1_j6d3lsr wrote

I can guarantee the next paper out of this Google team is going to be a diffusion model ( instead of AudioLM ) conditioned on MuLan embeddings.

The strength of the Google model is the text understanding which is coming from the MuLan embeddings. While the strength of the work you highlighted is the quality from the diffusion model.

It's the obvious next step following the same path as Dalle1->Dalle2.

1