Are there smaller/distilled versions of CLIP? Or some other (smaller) models that connect text and images?

For my use case, the model needs to be small in size: ideally <20MB, fine < 60MB, ok < 100MB.

Comments

LetterRip t1_j43v3yi wrote on January 12, 2023 at 11:56 PM

This group did such a distillation but didn't share the weights, they got it down to 24 MB.

https://www.reddit.com/r/MachineLearning/comments/p1o2bd/research_we_distilled_clip_model_vit_only_from/

LAION or stability.ai or huggingface might be willing to provide free compute to distill one of the openCLIP models.

Come to think of it, stability.ai should be releasing the distilled stablediffusion latter this month (week or two?) and it presumably will have a distilled clip.

alkibijad OP t1_j462o4r wrote on January 13, 2023 at 12:38 PM

Cool, I wasn't aware of the distilled diffusion! That could be useful, thanks for sharing!

LetterRip t1_j47qjhj wrote on January 13, 2023 at 7:18 PM

I don't know for certain that the CLIP was distilled also, that is an assumption on my part. Also EMAD has been fuzzy about exactly when the release would be.

suflaj t1_j42i6pu wrote on January 12, 2023 at 6:53 PM

Nope. Authors experimented with it but said performance is lost. You can try to replace the transformers with ResNet50, but you'll have to do it yourself AFAIK.

manOnPavementWaving t1_j42wiwx wrote on January 12, 2023 at 8:22 PM

Ehm, CLIP actually has a resnet50 version. Its still too big, tho.

suflaj t1_j4308sf wrote on January 12, 2023 at 8:44 PM

Ah, wasn't aware they published the weights. But if that's too big I am not aware of anything significantly smaller that would retain most of the performance.

It should be relatively easy to pretrain a significantly smaller network yourself given the pretrained resnet weights with good enough sampling and a month or so training...

alkibijad OP t1_j462x0f wrote on January 13, 2023 at 12:40 PM

I was hoping to just fine-tune the model, let the training last days at most. Seems like my best chance is to wait for distilled stable diffusion, and use their clip encoder, as u/LetterRip mentions.

suflaj t1_j46gu2z wrote on January 13, 2023 at 2:32 PM

I would proceed with caution because smaller models are generally not that easy to finetune. In fact, the whole point of a larger model is that it not only contains a lot of information, but that it is fairly easy to adapt to new tasks because it has plenty of "space" to restructure itself. A smaller model trying to restructure itself is more likely to diverge or not be able to adapt to the task at all.

It would be more viable in that case to run the larger model layer by layer, finetune it, and then distill onto a smaller one. That way you use the maximum potential of a larger model to adapt to a different task, and you distill it into whatever you need.