I believe robust backbone models are very crucial since they are feature extractors and determine how good your embeddings are. So I suggest using CLIP from openAI, a very OP model that works well for zero-shot learning task.
I personally use it and suprisingly outperform others in an text-image retrieval task, highly recommend you try it out.
NonFocusNorm t1_itq4vts wrote
Reply to Combining image and text embedding [P] by External_Oven_6379
I believe robust backbone models are very crucial since they are feature extractors and determine how good your embeddings are. So I suggest using CLIP from openAI, a very OP model that works well for zero-shot learning task. I personally use it and suprisingly outperform others in an text-image retrieval task, highly recommend you try it out.