NonFocusNorm t1_itq4vts wrote on October 25, 2022 at 2:20 PM

Reply to Combining image and text embedding [P] by External_Oven_6379

I believe robust backbone models are very crucial since they are feature extractors and determine how good your embeddings are. So I suggest using CLIP from openAI, a very OP model that works well for zero-shot learning task. I personally use it and suprisingly outperform others in an text-image retrieval task, highly recommend you try it out.