starstruckmon t1_j6d3lsr wrote on January 29, 2023 at 2:22 PM

Reply to comment by Maximum-Nectarine-13 in [D] MusicLM: Generating Music From Text by carlthome

I can guarantee the next paper out of this Google team is going to be a diffusion model ( instead of AudioLM ) conditioned on MuLan embeddings.

The strength of the Google model is the text understanding which is coming from the MuLan embeddings. While the strength of the work you highlighted is the quality from the diffusion model.

It's the obvious next step following the same path as Dalle1->Dalle2.