Viewing a single comment thread. View all comments

MohamedRashad OP t1_irvng87 wrote

prompt of stable diffusion (for example) is the text that will result in the image I want.

the text that I will get from an Image Captioning model doesn't have to be the correct prompt to get the same image from stable diffusion (I hope I am explaining what I am thinking right).

11

Blutorangensaft t1_irvo011 wrote

Wikipedia: " Stable Diffusion is a deep learning, text-to-image model released by startup StabilityAI in 2022. It is primarily used to generate detailed images conditioned on text descriptions"

If we take the example prompt "a photograph of an astronaut riding a horse" (see Wiki), I don't see how that is much different from an image caption. I guess the only difference is it specifies the visual medium, so eg a photo, painting, or the like.

I don't think there is a difference between prompt and caption and you might be overthinking this. However, you could always make captions sound more like prompts (if the specified medium is the only difference) by looking for specific datasets with a certain wording or manually adapting the data yourself.

5

MohamedRashad OP t1_irvovxd wrote

Maybe you are right (maybe I am overthinking the problem) I will give Image Captioning another try and see if it will work.

6

_Arsenie_Boca_ t1_irw1ti7 wrote

No, I believe you are right to think that an arbitrary image captioning model cannot accurately generate prompts that actually lead to a very similar image. Afterall, the prompts are very model-dependent.

Maybe you could use something similar to prompt tuning. Use a number of randomly initialized prompt embeddings, generate an image and backprop the distance between your target image and the generated image. After convergence, you can perform a nearest neighbor search to find the words closest to the embeddings.

Not sure if this has been done, but I think it should work reasonably well

19

MohamedRashad OP t1_irw4soy wrote

This is actually the first idea that came to me when thinking about this problem ... Backpropgating the output image until I reach the text representation that made it happen then use the distance function to get the closest words to this representation.

My biggest problem with this idea was the variable length of input words ... The search space for the best words to describe the image will be huge if there is no limit on the number of words that I can use to describe the image.

​

What are your thoughts about this (I would love to hear them)?

8

_Arsenie_Boca_ t1_irwat46 wrote

Thats a fair point. You would have a fixed length for the prompt.

Not sure if this makes sense but you could use an LSTM with arbitrary constant input to generate a variable-length sequence of embeddings and optimize the LSTM rather than the embeddings directly.

6