aperrien t1_j0dsyu9 wrote on December 15, 2022 at 10:30 PM

I can't believe that that running the Fourier sound transformations through Stable Diffusion and transforming them back into sound actually works. At this point, I really am calling into question what the SD model is actually capturing. Creativity? Pattern Consistency? This technology may have legs far beyond what I initially assumed.

xoexohexox t1_j0dv1fc wrote on December 15, 2022 at 10:44 PM

It's just a table of weighted averages my dude

visarga t1_j0mzjox wrote on December 17, 2022 at 10:12 PM

Let me tell you one weird trick all artists hate. It's actually averages of gradients collected from training examples, not averages of the training examples themselves. Gradients represent what has been learned from each example, and can be added together regardless of the content of the examples without becoming all jumbled up.

For instance, one can add the gradient derived from an image of a duck to that derived from an image of a horse. This is only possible in the space of gradients, as opposed to the space of images. If it weren't for this trick we would not be discussing art in this sub.

But are gradients derived from an image subject to copyright restrictions, even when all mixed up over billions of examples? All individual influences are almost "averaged out" by the large numbers of examples. That's how SD breaks training examples into first principles and then can generate an astronaut on a horse even though it has never seen that - only possible if you go back to all the way to basic concepts.

blueSGL t1_j0e5p87 wrote on December 15, 2022 at 11:59 PM

Called it 7 months ago.

I bet if you do a log plot it just destroys the bass.

Edit: Thinking on, this is one dimensional with a second dimension of time, you could slice the audio into three frequency bands and use RGB encoding to 3x the frequency range fidelity without having to change the context window size.

[deleted] t1_j0ej7y9 wrote on December 16, 2022 at 1:40 AM

[deleted]

blueSGL t1_j0elvgk wrote on December 16, 2022 at 2:00 AM

I've not got the hardware needed for fine tuning stable diffusion (or even dreambooth) so I can't test it.

I've only got 10gig of VRAM not the 16 minimum needed.

Umbristopheles t1_j0dpnht wrote on December 15, 2022 at 10:08 PM

It sings like The Sims....

Kinexity t1_j0de3x5 wrote on December 15, 2022 at 8:52 PM

This sounds quite analogical to running Doom on a Samsung smart fridge or running a Turing Machine in Power Point. It's not useful but definitely pretty cool.

TFenrir OP t1_j0dk5rr wrote on December 15, 2022 at 9:31 PM

I mostly agree, but I think there is some opportunity here. Using img2img in real time to extend audio forever, and the relationship between images and audio in general are quite interesting - would a model that is only trained on these images provide a "better" result? Would different fine tuned models give different experiences? How is this impacted by other improvements to models?

Sigura83 t1_j0h41pl wrote on December 16, 2022 at 4:29 PM

Holy shit.

It sounds pretty good, and this is just version 1.

Converting sound to images and using diffusion models on those is brilliant. It does a great beat, that's for sure.

Sigura83 t1_j0h6f0k wrote on December 16, 2022 at 4:44 PM

"Trance inspired by rain falling". OMG NON STOP MELODIES.

I'm living in the Future!!!

Oh yeah, a shit ton of stuff is spectrographic data. Things like Molecules for instance. This could be used for drug generation, I think... uh... damn my lack of skills...

Riffusion: Stable diffusion fine tuned on spectrograms (image representations of music) creates prompt based music, in real time

Comments