Submitted by im-so-stupid-lol t3_10pa18k in singularity
Right now, text2image models are generating impressive output but I see two major issues with their use at the moment.
The first issue is the required dataset size. If you take a human and show them 4 or 5 images of an animal they've never seen before they'll generally be able to draw it quite well. but stable diffusion needs far more images to train on.
But the second and more prevalent issue is "prompt engineering". every damn portrait shot has some weird garbage about "4k ultra high definition high fidelity great lighting award winning beautiful stunning photo" or some bullshit. this suggests to me that we are still not communicating very well with the model. hell, negative prompts show this as well. some people have super long negative prompts full of things like "6 toes" and "deformed face". so the AI still needs to be told explicitly "don't draw a person with a deformed face"
what do you see as the solution to these issues going forward? will it be a combination of models like chatgpt and models like stable diffusion, so you can talk the text2image model through what you want in a more natural way?
fluffy_assassins t1_j6jh00j wrote
People will have to develop the skill of composing the proper prompts to get what they really want.