Comments

You must log in or register to comment.

aperrien t1_j0dsyu9 wrote

I can't believe that that running the Fourier sound transformations through Stable Diffusion and transforming them back into sound actually works. At this point, I really am calling into question what the SD model is actually capturing. Creativity? Pattern Consistency? This technology may have legs far beyond what I initially assumed.

23

xoexohexox t1_j0dv1fc wrote

It's just a table of weighted averages my dude

6

visarga t1_j0mzjox wrote

Let me tell you one weird trick all artists hate. It's actually averages of gradients collected from training examples, not averages of the training examples themselves. Gradients represent what has been learned from each example, and can be added together regardless of the content of the examples without becoming all jumbled up.

For instance, one can add the gradient derived from an image of a duck to that derived from an image of a horse. This is only possible in the space of gradients, as opposed to the space of images. If it weren't for this trick we would not be discussing art in this sub.

But are gradients derived from an image subject to copyright restrictions, even when all mixed up over billions of examples? All individual influences are almost "averaged out" by the large numbers of examples. That's how SD breaks training examples into first principles and then can generate an astronaut on a horse even though it has never seen that - only possible if you go back to all the way to basic concepts.

3

blueSGL t1_j0e5p87 wrote

Called it 7 months ago.

I bet if you do a log plot it just destroys the bass.

Edit: Thinking on, this is one dimensional with a second dimension of time, you could slice the audio into three frequency bands and use RGB encoding to 3x the frequency range fidelity without having to change the context window size.

17

[deleted] t1_j0ej7y9 wrote

[deleted]

8

blueSGL t1_j0elvgk wrote

I've not got the hardware needed for fine tuning stable diffusion (or even dreambooth) so I can't test it.

I've only got 10gig of VRAM not the 16 minimum needed.

3

Kinexity t1_j0de3x5 wrote

This sounds quite analogical to running Doom on a Samsung smart fridge or running a Turing Machine in Power Point. It's not useful but definitely pretty cool.

4

TFenrir OP t1_j0dk5rr wrote

I mostly agree, but I think there is some opportunity here. Using img2img in real time to extend audio forever, and the relationship between images and audio in general are quite interesting - would a model that is only trained on these images provide a "better" result? Would different fine tuned models give different experiences? How is this impacted by other improvements to models?

11

Sigura83 t1_j0h41pl wrote

Holy shit.

It sounds pretty good, and this is just version 1.

Converting sound to images and using diffusion models on those is brilliant. It does a great beat, that's for sure.

2

Sigura83 t1_j0h6f0k wrote

"Trance inspired by rain falling". OMG NON STOP MELODIES.

I'm living in the Future!!!

Oh yeah, a shit ton of stuff is spectrographic data. Things like Molecules for instance. This could be used for drug generation, I think... uh... damn my lack of skills...

1