Viewing a single comment thread. View all comments

KingsmanVince t1_irvmgnx wrote

In Image Captioning, to train the model, you have to provide any text that describe the images. By this definition, "the prompt that makes the image" does FALL IN. One text can produce many images. One image can be described by many texts. Image and Text have many2many relationships.

For example, to capture a picture of a running dog, people can describe the whole process. That still a caption.

For example, I prompt "running dog". Dalle 2 draws a running dog me. Yes that's a freaking caption.

3

m1st3r_c t1_irvn1d5 wrote

OP is looking for a way to take a piece of AI generated art and reverse engineer the model that created it, to find out what prompt terms and weightings etc were used to create it.

7

MohamedRashad OP t1_irvmz2q wrote

But in this case, I will need to train image captioning model on text-to-image data and hope that it will provide me with the correct prompt to recreate the image using the text-to-image model.

I think a better solution is to use the backward propagation in text-to-image models to get the prompt that made the image (an inverse state or something like it).

1

KlutzyLeadership3652 t1_irwt908 wrote

Don't know how feasible this would be for you but you could create a surrogate model that learns image-to-text. Use your original text-to-image model to generate images given text (open caption generation datasets can give you good examples of captions), and the surrogate model trains to generate the text/caption back. This would be model centric so don't need to worry about many2many issue mentioned above.

This can be made more robust than a backward propagation approach.

3