Submitted by vocdex t3_y6abd5 in MachineLearning

I trained a SINGLE class instance segmentation model with Detectron2 and YOLACT.

Both perform quite well.

What I want to do next:

  1. Crop out detected instances.
  2. Obtain image embeddings using PCA/ (VAE) Autoencoders ( any suggestions?)
  3. Do some sort of clustering based on those embeddings ( K-means, PCA)

Anyone thinks this pipeline makes sense? Could you guys provide any suggestions for image embedding techniques?

I am expecting this pipeline to group the class object into 2 categories based on shape: straight and bent. This feature is most visible to human eye but not sure if this works.

Thanks a lot!

# Edit: object is asparagus in a greenhouse farm. I'm using instance segmentation to avoid back/fore ground pixels in order to later use this segmentation mask for point cloud generation (with corresponding depth maps).

6

Comments

You must log in or register to comment.

Grove_street_home t1_iso2syg wrote

I have no direct answer, but if you want to separate the straight and bent objects you could also try to compare them to their convex hull images (assuming that bent objects actually have concave segmentation masks).

So for example, if the convex hull image of a mask is 20% bigger or more, then the object is bent. Else, it's straight.

4

ThrowThisShitAway10 t1_iso5tdz wrote

Sounds reasonable to me. I just wouldn't do the PCA on the image data directly, I would feed the cropped images through a pre-trained Res-Net backbone or something and then use PCA/tSNE on those embeddings.

1

jake_1001001 t1_iso7foq wrote

Do you have a labeled dataset for training? (Bent or straight)

Why use segmentation? Please clarify the task definition, it is currently quite vague. Plaine object detection should be adaquate for cropping your object as most DL frameworks take rectangular inputs, but this may be unnecessary depending on your dataset and input. If you are worried about background, you shouldn't and such information may be important for the model to determine relative shape or will just be considard noise if your training set is large enough and matches your expected input distribution.

For embeddings, you could use a pretrained contrastive supervised image encoder like vit.

Clustering can be done by training a linear classifier with a CE loss with bent or straight labeled images via finetuning, linear probing, or domain adaptation (adapter or retrain the norms). The loss will find the class centroids for you and provide a nice porbability output. Of course you could train a Kmeans classifier if youd like on the embeddings instead.

1

vocdex OP t1_isofmvp wrote

Thanks for suggestions.

I don't have a labeled dataset but I can create one, for sure. The object here is asparagus in a greenhouse farm.

Here's the situation: I am using segmentation because in the future, I want to use this segmentation with depth maps to create point clouds. I have tried to do so with only bounding box detections but due to the presence of background and foreground pixels (different depth image values), I am getting quite bad point clouds. Then, I applied simple depth value based filter to crop out only the object without any back/foreground. This works but doesn't generalize well to all situations.

I thought that instance segmentation would give me only the object pixels and I can fuse this with depth values in order to get point clouds.

Moreover, there could be different clusters other than bent vs straight. So, I want the clustering algorithm to find those clusters in an unsupervised fashion. If this doesn't work, then yes, I guess I'll have to create a dataset and train another bent vs straight classification model.

Thanks for reading till here!

1

jake_1001001 t1_isoot8f wrote

Aha, ok, use of segmentations to extract the object point cloud seems good and I have used similar approach for face reconstruction l.

Have you tried 3D approachs (ridgid and non rigid alignment)? How similar are the objects? you could use the dense alignment error to determine if the object is the same as a streight one.

But if we go back to image based methods, if your segmentation model is good, it may provide good embeddings already in the encoder. You could take those embeddings and compute thier distance to the embeddings of templates (straight, bent, etc). Kmeans may not cluster as you expect if there is a high variance in samples (shape, size, color, etc), which is why supervised methods could be preferred. Templates provide a prototype for your class to compute distance/similarity to (Euclidean, cosine similarity) . It is crude, but could work in constrained settings.

2

vocdex OP t1_isrjl7v wrote

Ah, haven't considered 3D approaches but definitely check them out. Objects are quite similar (green color, just the shape is different). Thank you for your help

1