Submitted by Waterfront_xD t3_ydc9n1 in MachineLearning

I'm training a machine learning model using YOLOv5 from Ultralytics (arch: YOLOv5s6). The task is to detect and identify laundry symbols. For that, I've scraped and labeled 600 images from Google.

Using this dataset, I receive a result with an mAP around 0.6.

But 600 images is a tiny dataset and there are multiple laundry symbols where I have only 1-4 images for training and symbols where I have 100 and more.

So I started writing a Python script which generates more images of laundry symbols. The script basically takes a background image and adds randomly positioned 1-10 laundry symbols in different colors and rotations. No background is used twice. With that script, I generated around 6.000 entirely different images with laundry symbols that every laundry symbol is at least 800 times in the dataset.

Here are examples of the generated data: Link 1 Link 2

I combined the scraped and the generated dataset and retrained the model with the same configuration. The result is really bad: the mAP dropped to 0.15 and the model overfits. The confusion matrix told me why: Confusion matrix

Why is the model learning the background instead the objects?

First I thought my annotation might be wrong, but the training script from Ultralytics saves a few examples of training batch images - there the boxes are drawn perfectly around the generated symbols.

For completeness, below are more analytics added about the training:

More analytics

Labels Curves More examples from the dataset

3

Comments

You must log in or register to comment.

emotional_nerd_ t1_itrn8sd wrote

Hmm. I am considering whether this might be due to the added variety of backgrounds, causing the model to be easily confused with the details.

2

Fapaak t1_itrsfcg wrote

A common problem in object detection is unbalanced dataset, if you cut out squares from the image and say 100 squares are background and 5 are the symbol, the model may not learn to distinguish it.

Try balancing the dataset if thats your issue and try to use Focal Loss

4

mearco t1_itrsfny wrote

In my opinion. The symbols stand out easily on the backgrounds you are using. The synthetic images you make are too different from the images you really want to perform well on.
I would work on trying to collect more data or do more classic data augmentation.

It would be quite difficult to generate more realistic synthetic examples. One issue you have is the square color background around the object. You should try use a background remover tool so that you just have the black lines of the symbol.

7

StephaneCharette t1_itsddz6 wrote

Note from the Darknet + YOLO FAQ: "Can I train a neural network using synthetic images?"

>No.
>
>Or, to be more precise, you'll probably end up with a neural network that is great at detecting your synthetic images, but unable to detect much in real-world images.

Source: https://www.ccoderun.ca/programming/darknet_faq/#synthetic_images

I made that statement several years ago, and after all this time, I still think the correct answer is "no". Every time I try to use synthetic images, it never works out as I had planned.

Looking at your "Link1" and "Link2", it is immediately obvious this is not going to work. You cannot crop your objects: https://www.ccoderun.ca/programming/darknet_faq/#crop_training_images

Darknet/YOLO (and under the covers, I believe that Ultralytics is using Darknet) learns from context, not only what is in the bounding boxes. So if you are trying to detect snowboarders with those symbols, then you'll do OK. But if you are expecting to pass in images or video frames with clothes, then that snowboarder and bus are doing nothing to help you.

Want proof? Here is a YOLO neural network video I happened to upload to youtube today: https://www.youtube.com/watch?v=m3Trxxt9RzE

Note the "6" and "9" on those cards. They are correctly recognized, no confusion even though the font used makes those 2 numbers look identical when rotated 180 degrees. YOLO really does look at much more than just the bounding box.

6

AllDogsAreBlue t1_itsfh37 wrote

I think the problem is probably that you have wrapped the symbols in little rectangles (the blue/green/red/white backgrounds) before pasting them onto the random backgrounds, so the network is learning that they have to be wrapped in rectangles to count as objects, and thus fails on your real data in which they are not wrapped in rectangles.

It's not clear what the best solution is, but one possible approach would be to try and just extract the black strokes that make up the symbol, and try randomly pasting just those on the random background images (this might be a little more technically challenging to implement, you would have to store the symbols as images with an alpha/transparency channel, with an alpha of 1 where the symbols' strokes are and an alpha of 0 everywhere else. Then when pasting, do alpha compositing).

Also you probably want a wider variety of random rotation and scaling, and probably even perspective transforms.

3

Waterfront_xD OP t1_ituvw37 wrote

Initially, my thoughts were going into the direction: "when I train with so random backgrounds but clearly visible symbols on top, the production model will later find laundry symbols on any kind of image the user sends".

First I had the symbols transparent on the background, but it was already super hard for a human to find the symbols, so I thought in the real dataset, the laundry symbols will be always on a single colored background. That's why I started adding random background colors.

1

Waterfront_xD OP t1_itux693 wrote

First I had the symbols transparent on the background, but it was already super hard for a human to find the symbols, so I thought in the real dataset, the laundry symbols will be always on a single colored background. That's why I started adding random background colors.

1

Waterfront_xD OP t1_ituyep7 wrote

Thank you very much for this answer!

I understand now that I should not only select my model architecture based on the performance/reviews I read on some blog posts. It requires digging deeper into the architecture and understanding how it works to find the right one for the use case.

1