Submitted by alkibijad t3_10a6whe in MachineLearning
Are there smaller/distilled versions of CLIP? Or some other (smaller) models that connect text and images?
For my use case, the model needs to be small in size: ideally <20MB, fine < 60MB, ok < 100MB.
suflaj t1_j42i6pu wrote
Nope. Authors experimented with it but said performance is lost. You can try to replace the transformers with ResNet50, but you'll have to do it yourself AFAIK.