Submitted by im-so-stupid-lol t3_10pa18k in singularity
Right now, text2image models are generating impressive output but I see two major issues with their use at the moment.
The first issue is the required dataset size. If you take a human and show them 4 or 5 images of an animal they've never seen before they'll generally be able to draw it quite well. but stable diffusion needs far more images to train on.
But the second and more prevalent issue is "prompt engineering". every damn portrait shot has some weird garbage about "4k ultra high definition high fidelity great lighting award winning beautiful stunning photo" or some bullshit. this suggests to me that we are still not communicating very well with the model. hell, negative prompts show this as well. some people have super long negative prompts full of things like "6 toes" and "deformed face". so the AI still needs to be told explicitly "don't draw a person with a deformed face"
what do you see as the solution to these issues going forward? will it be a combination of models like chatgpt and models like stable diffusion, so you can talk the text2image model through what you want in a more natural way?
starstruckmon t1_j6l1k5l wrote
>take a human and show them 4 or 5 images of an animal they've never seen before they'll generally be able to draw it quite well
4-5 is actually enough to fine tune a pretrained SD model. Which is the correct comparison since we're already pretrained. Even if you ignore all the data upto that point in your life, even newborn brains are pretrained by evolution. They aren't initialised from random weights. Easier to notice this in other animals that can start walking right after birth.