Submitted by kingdroopa t3_10f7dyr in MachineLearning
[removed]
Submitted by kingdroopa t3_10f7dyr in MachineLearning
[removed]
Have you already tried different variants of GAN for more stable training?
Architecturally probably some form of unet is best. It’s the architecture of choice for things like segmentation so I imagine it would be good for IR as well
Could you recommend any SOTA models using U-NET?
Have tried CycleGAN, CUT (which is an improvement of CycleGAN), NEGCUT (similar to CUT) and ACL-GAN.
Hmm, interesting! Do you have any papers/article/sources supporting this claim?
+1 for UNets. Since IR will be a single channel you could use a single class semantic segmentation-type model (i.e. a UNet with a 1-channel output passed through a sigmoid). Something like this would get you started:
model = sm.Unet('resnet34', classes=1, activation='sigmoid')
Edit: Forgot the link for the package I'm referencing: https://github.com/qubvel/segmentation_models
Many of the most popular encoders/backbones are implemented in that package
Edit 2: Is the FOV important? If you could resize the images so that the RGB & IR FOV are equivalent then that would make things a lot simpler
Thanks a lot! Will look into it, but seems like the U-NET outputs are segmentation masks, whilst I want it to actually output (generate) IR image equivalents of the RGB image. Is there some idea that I'm missing, perhaps?
Sorry, I was wrong. Modern deep VAE's can match SOTA GAN model performance for img superresolution(https://arxiv.org/abs/2203.09445) but I don't have evidence for recoloring.
But diffusion models are shown to outperform GAN's on multiple img-to-img translation tasks. Eg:- https://deepai.org/publication/palette-image-to-image-diffusion-models
You could probably reframe your problem as an image colorization task:- https://paperswithcode.com/task/colorization and the SOTA is still Palette linked above
Thanks :) I noticed Palette uses paired images, whilst mine are a bit unaligned. Would you considered it a paired image set, or unpaired? They look closely similar, but don't share pixel information in the top/bottom of the images.
That depends on the extent to which the pixel information is misaligned I think. If cropping your images is not a solution and a large portion of your images have this issue, the model wouldn't be able to generate the right pixel information for the misaligned sections. But it's worth giving a try with Palette if the misalignment is not significant.
The Unet I described will output a continuous number for each pixel between 0 & 1, which you can use as a proxy for your IR image.
People often use a threshold to this image (e.g. 0.5) to create a mask which might be where you are getting confused
I think one important part here is the "misalignment" of the images. Have you tried to cut and resize the images, so that they show the same region? You don't need a GAN then
Maybe you could also turn the RGB image into grayscale and use it as an additional supervised loss for regularization and maybe more stable training.
You cannot just translate visible light to IR. No matter what machine learning you use, this is physically impossible.
Correct, it's not physically possible. This is a research project to find to what degree it IS possible :)
Interesting! I will for sure write that down in my TODO list, thanks!
The GAN models I've tested are based on the 'unaligned' approach (e.g. CycleGAN). I still have not tested to cut and resize the images, to make them show the same region. My immediate thought would be that the top-and-bottom of both images might dissapear, but perhaps its ok still?
Ahh, I see. Thanks! I'll write it down in my TODO list. Might have to investigate seg masks a bit more :)
Okay, in that case, I'll try to be a bit more helpful lol.
I think you absolutely need to use something like YOLO for object identification/classification.
Humans and animals are warmer than the environment
Cars and other vehicles are warmer than the environment
Glass blocks IR but not visible light
You could get the overall "look" with just image-based networks, but to make it really convincing (more like COD's thermal vision) you need classification in order to make objects look hot that are supposed to be hot.
if the two cameras are rigidly fixed, then you can calibrate them like one calibrates a stereo pair, and at least align the orientation and intrinsics. The points very far from the camera will be well aligned, the ones very close will remain unaligned.
The calibration process will involve you pointing positions by hand, but the maths for the correction is very very simple after that.
BlazeObsidian t1_j4v495i wrote
Autoencoders like VAE’s should work better than any other models for image to image translation. Maybe you can try different VAE models and compare their performanceI was wrong.