I generated CIFAR10 images using energy based models from the joint distribution of an "airplane: 0" and "bird: 2" classes. As can be see below, the generated images can't be visually classified as any of the CIFAR10 classes, i.e., the prediction should roughly be uniform distribution.

Sampled from the joint distribution of CIFAR10 \"airplane\" and \"bird\" classes.

However, when I make inference using a pre-trained CIFAR10 model link the confidence scores of the predicted classes are very high.

Predicted classes

I am aware of adversarial attacks and this is kind of adversarial attack.

So, here is my opinion (question). I believe CNNs or any network should consider the visual quality when making a prediction.

Should / can CNNs be improved to act this way?

Thank you.

Comments

XecutionStyle t1_ir0dv73 wrote on October 4, 2022 at 12:56 PM

How do you propose we define quality?

ThoughtOk5558 OP t1_ir0eeke wrote on October 4, 2022 at 1:00 PM

That I don't know.

XecutionStyle t1_ir0fgw6 wrote on October 4, 2022 at 1:09 PM

See that's the problem. We benefit from eons for evolution to imprint what quality is (i.e. what correlates with real life) the most genetically.

To tell a CNN about quality without using a CNN to analyze is either cyclical or redundant. I'm afraid.

saw79 t1_ir13tsx wrote on October 4, 2022 at 3:57 PM

In addition to other commenter's [good] point about your nebulous "visual quality" idea, a couple other comments on what you're seeing:

Frankly, your generative model doesn't seem very good. If your generated samples don't look anything like CIFAR images, I would stop here. Your model's p(x) is clearly very different from CIFAR's p(x).
Why are "standard"/discriminative models' confidence scores high? This is a hugely important drawback of discriminative models and one reason why generative models are interesting in the first place. Discriminative models model p(y|x) (class given data), but don't know anything about p(x). Generative models model p(x, y) = p(y|x) p(x); i.e., they generally have access to the prior p(x) and can assess whether an image x can even be understood by the model in the first place. These types of models would (hopefully, if done correctly), give low confidence on "crappy" images.

ThoughtOk5558 OP t1_ir17ijo wrote on October 4, 2022 at 4:21 PM

I intentionally generated "bad" samples by doing few steps of MCMC sampling. I am also able to generate CIFARR10 looking samples.

I think your explanation is convincing.

Thank you.

BrotherAmazing t1_ir3dmwz wrote on October 5, 2022 at 1:02 AM

Nearly every data-driven approach to regression and purely discriminative classification has this problem, and it’s a problem of trying to extrapolate far outside the domain that you trained/fit the model in. It’s not about anything else.

Your generated images clearly look nothing like CIFAR-10 training images, so it’s not much different than if I fit two Gaussians to data that was Gaussian in 2-D using samples that all fit within the sphere of radius 1, then I send a 2-D feature measurement into my classifier than is a distance 100 from the origin. Any discriminative classifier that doesn’t have a way to detect outliers/anomalies will likely be extremely confident in classifying this 2-D feature as one of the two classes. We would not say that the classifier has a problem not considering “feature quality”, but would say it’s not very sophisticated.

In the real world in critical problems, CNNs aren’t just fed images like this. Smart engineers have ways to detect if an image is likely not in the training distribution and throw a flag to not have confidence in the CNN’s output.

saw79 t1_ir1a2tb wrote on October 4, 2022 at 4:37 PM

Oh ok cool. Is your code anywhere? What kind of energy model? I have experience with other types of deep generative models but actually am just starting to learn about EBMs myself recently.

ThoughtOk5558 OP t1_ir1au40 wrote on October 4, 2022 at 4:42 PM

https://github.com/wgrathwohl/JEM

I am using this EBM with slight modification (during sampling).

XecutionStyle t1_ir20qoc wrote on October 4, 2022 at 7:24 PM

I don't think it's nebulous. We infuse knowledge, bias, prior etc. like physics (in Lagrangian networks) all the time. I was just addressing his last point. There's no analytical solution for quality we can use as labels.

Networks can understand the difference between pretty and ugly semantically with tons of data, and tons of data only.

saw79 t1_ir23cl5 wrote on October 4, 2022 at 7:40 PM

All I meant by nebulous was that he didn't have a concrete idea for what to actually use as visual quality, and you've nicely described how it's actually a very deep inference that we as humans make with our relatively advanced brains.

I did not mean that it it's conceptually something that can't exist. I think we're very much in agreement.

porygon93 t1_ir0wesp wrote on October 4, 2022 at 3:09 PM

you are modeling p(z|x) instead of p(x)

BrotherAmazing t1_ir3etpb wrote on October 5, 2022 at 1:11 AM

I think someone didn’t understand what you meant and downvoted or downvoted because you didn’t define ‘z’ and ‘x’ and so on, but I know what you mean and you’re correct. This is another way of looking at it that is completely right.

p(x) for all these images under a CIFAR-10 world is basically 0, but your CNN is not computing that or factoring that in and is just assuming the input images are good images, then estimating the probability of airplane vs. bird for these nonsense images given that they are not nonsense images and given that they come from the same pdf as CIFAR-10….. which is a very very false assumption!

CremeEmotional6561 t1_ir3v9l9 wrote on October 5, 2022 at 3:25 AM

Because you forgot to add the "noise" class.

XecutionStyle t1_ir52ztc wrote on October 5, 2022 at 12:17 PM

That'd lower the confidence scores but relatively, they'll still be just as false-confident.