Submitted by i_sanitize_my_hands t3_11wreix in MachineLearning

Hello ML sub,

How does one evaluate the quality of training images before actually training a model ? Training a model is surely expensive. What if one had a way of sort of ascertaining that the image quality of a training set for a particular task (say object detection or semantic segmentation etc) ? It doesn't have to be perfect but some kind of hint...

Could you please point me to some papers or studies or discussions on this ?

There are some objective metrics like PSNR or SSIM but they need a reference image

5

Comments

You must log in or register to comment.

Joel_Duncan t1_jczkbc0 wrote

I keep seeing this getting asked like people are expecting a magic bullet solution.

​

In general you can only get out something within the realm of what you put in.

There are intelligent ways to structure training and models, but you can't fill in expected gaps without training with a reference or a close approximation of what those gaps are.

My best suggestion is to limit your input data or muxed model to specific high resolution subsets.

ex. You can train a LoRa on a small focused subset of data.

0

i_sanitize_my_hands OP t1_jczyrsl wrote

Not expecting a magic bullet solution. Been in the field long enough to know that.

However, any written record of the intelligent ways you mentioned are valuable and worth going through.

One of the reasons it gets asked a lot is because image quality analysis doesn't seem to get enough air time. There are only few papers and sone as old as 2016. They font reflect the trend since 'all you need is attention '

1

wind_dude t1_jd012ru wrote

I'm not big into image generation, but... some thoughts...

- SSIM - I believe the issue here has to due with the quality of the img captions. Perhaps merging captions on images

- could try training boolean classifiers for both images and captions, `is_junk`, and than using that model to remove junk from the training data.

1

xEdwin23x t1_jd0pc6v wrote

Active learning deals with using a small subset of representative images that should perform as well as a larger set of uncurated images. You can consider looking into that.

3

Keiny t1_jd2vgtk wrote

Someone suggested active learning, but it may be more suitable to look into the subfield of data valuation.

Data valuation broadly aims to assign values to data points that represent their contribution to a model’s overall performance. Many methods are based on game theoretic solution concepts such as the Shapley value and are therefore very expensive to compute. In practical settings, I would suggest the Shapley over kNN surrogate by Jia et al. (2019) or LAVA by Just et al. (2023).

You can find more papers at the GitHub repo awesome-data-valuation.

Hope that helps!

1

paulgavrikov t1_jd3jf74 wrote

Currently there are no good methods to do this. There’s discussion of existing methods and many insights into the problem in this paper https://arxiv.org/abs/2206.14486

TL;DR: which images you should remove depends on the ratio between samples / parameters, no current method works anywhere near ideal, but you may see improvements if you choose the most expensive methods

1