由 Jina AI 训练的嵌入模型系列。
Jina CLIP:您的 CLIP 模型同时也是文本检索工具!
jina-clip-v1 是一款先进的英文多模态(文本-图像)嵌入模型。
传统的文本嵌入模型,例如 jina-embeddings-v2-base-en,在文本到文本检索方面表现出色,但无法完成跨模态任务。而像 openai/clip-vit-base-patch32 这类模型虽能有效对齐图像与文本嵌入,却因其训练方法和上下文限制,未能针对文本到文本检索进行优化。
jina-clip-v1 弥合了这一差距,在两个领域均展现出卓越性能。
其文本组件可媲美 jina-embeddings-v2-base-en 的检索效率,而整体架构更为跨模态检索树立了新的基准。
这种双重能力使其成为多模态检索增强生成(MuRAG)应用的理想工具,可在单一模型内实现无缝的文本到文本及文本到图像搜索。
!pip install transformers einops timm pillow
from transformers import AutoModel
# Initialize the model
model = AutoModel.from_pretrained('jinaai/jina-clip-v1', trust_remote_code=True)
# New meaningful sentences
sentences = ['A blue cat', 'A red cat']
# Public image URLs
image_urls = [
'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
]
# Encode text and images
text_embeddings = model.encode_text(sentences)
image_embeddings = model.encode_image(image_urls) # also accepts PIL.image, local filenames, dataURI
# Compute similarities
print(text_embeddings[0] @ text_embeddings[1].T) # text embedding similarity
print(text_embeddings[0] @ image_embeddings[0].T) # text-image cross-modal similarity
print(text_embeddings[0] @ image_embeddings[1].T) # text-image cross-modal similarity
print(text_embeddings[1] @ image_embeddings[0].T) # text-image cross-modal similarity
print(text_embeddings[1] @ image_embeddings[1].T)# text-image cross-modal similarity或 sentence-transformers:
# !pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
# Initialize the model
model = SentenceTransformer('jinaai/jina-clip-v1', trust_remote_code=True)
# New meaningful sentences
sentences = ['A blue cat', 'A red cat']
# Public image URLs
image_urls = [
'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
]
text_embeddings = model.encode(sentences)
image_embeddings = model.encode(image_urls)npm install xenova/transformers.js#v3 从源代码安装 Transformers.js v3。import { AutoTokenizer, CLIPTextModelWithProjection, AutoProcessor, CLIPVisionModelWithProjection, RawImage, cos_sim } from '@xenova/transformers';
// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained('jinaai/jina-clip-v1');
const text_model = await CLIPTextModelWithProjection.from_pretrained('jinaai/jina-clip-v1');
// Load processor and vision model
const processor = await AutoProcessor.from_pretrained('Xenova/clip-vit-base-patch32');
const vision_model = await CLIPVisionModelWithProjection.from_pretrained('jinaai/jina-clip-v1');
// Run tokenization
const texts = ['A blue cat', 'A red cat'];
const text_inputs = tokenizer(texts, { padding: true, truncation: true });
// Compute text embeddings
const { text_embeds } = await text_model(text_inputs);
// Read images and run processor
const urls = [
'https://i.pinimg.com/600x315/21/48/7e/21487e8e0970dd366dafaed6ab25d8d8.jpg',
'https://i.pinimg.com/736x/c9/f2/3e/c9f23e212529f13f19bad5602d84b78b.jpg'
];
const image = await Promise.all(urls.map(url => RawImage.read(url)));
const image_inputs = await processor(image);
// Compute vision embeddings
const { image_embeds } = await vision_model(image_inputs);
// Compute similarities
console.log(cos_sim(text_embeds[0].data, text_embeds[1].data)) // text embedding similarity
console.log(cos_sim(text_embeds[0].data, image_embeds[0].data)) // text-image cross-modal similarity
console.log(cos_sim(text_embeds[0].data, image_embeds[1].data)) // text-image cross-modal similarity
console.log(cos_sim(text_embeds[1].data, image_embeds[0].data)) // text-image cross-modal similarity
console.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity| 名称 | Flickr 图像检索 R@1 | Flickr 图像检索 R@5 | Flickr 文本检索 R@1 | Flickr 文本检索 R@5 |
|---|---|---|---|---|
| ViT-B-32 | 0.597 | 0.8398 | 0.781 | 0.938 |
| ViT-B-16 | 0.6216 | 0.8572 | 0.822 | 0.966 |
| jina-clip | 0.6748 | 0.8902 | 0.811 | 0.965 |
| 名称 | MSCOCO 图像检索 R@1 | MSCOCO 图像检索 R@5 | MSCOCO 文本检索 R@1 | MSCOCO 文本检索 R@5 |
|---|---|---|---|---|
| ViT-B-32 | 0.342 | 0.6001 | 0.5234 | 0.7634 |
| ViT-B-16 | 0.3309 | 0.5842 | 0.5242 | 0.767 |
| jina-clip | 0.4111 | 0.6644 | 0.5544 | 0.7904 |
| 名称 | STS12 | STS15 | STS17 | STS13 | STS14 | STS16 | STS22 | STSBenchmark | SummEval |
|---|---|---|---|---|---|---|---|---|---|
| jina-embeddings-v2 | 0.7427 | 0.8755 | 0.8888 | 0.833 | 0.7917 | 0.836 | 0.6346 | 0.8404 | 0.3056 |
| jina-clip | 0.7352 | 0.8746 | 0.8976 | 0.8323 | 0.7868 | 0.8377 | 0.6583 | 0.8493 | 0.3048 |
| 名称 | ArguAna | FiQA2018 | NFCorpus | Quora | SCIDOCS | SciFact | TRECCOVID |
|---|---|---|---|---|---|---|---|
| jina-embeddings-v2 | 0.4418 | 0.4158 | 0.3245 | 0.882 | 0.1986 | 0.6668 | 0.6591 |
| jina-clip | 0.4933 | 0.3827 | 0.3352 | 0.8789 | 0.2024 | 0.6734 | 0.7161 |
加入我们的 Discord 社区,与其他社区成员交流想法。
如果您在研究中发现 jina-clip-v1 有用,请引用以下论文:
@misc{2405.20204,
Author = {Andreas Koukounas and Georgios Mastrapas and Michael Günther and Bo Wang and Scott Martens and Isabelle Mohr and Saba Sturua and Mohammad Kalim Akram and Joan Fontanals Martínez and Saahil Ognawala and Susana Guzman and Maximilian Werk and Nan Wang and Han Xiao},
Title = {Jina CLIP: Your CLIP Model Is Also Your Text Retriever},
Year = {2024},
Eprint = {arXiv:2405.20204},
}ValueError: The model class you are passing has a `config_class` attribute that is not consistent with the config class you passed (model has <class 'transformers_modules.jinaai.jina-clip-implementation.7f069e2d54d609ef1ad2eb578c7bf07b5a51de41.configuration_clip.JinaCLIPConfig'> and you passed <class 'transformers_modules.jinaai.jina-clip-implementation.7f069e2d54d609ef1ad2eb578c7bf07b5a51de41.configuration_cli.JinaCLIPConfig'>. Fix one of those so they match!Transformers 库在 4.40.x 至 4.41.1 版本之间存在一个 bug。你可以将 transformers 更新至 >4.41.2 或 <=4.40.0 版本。
我们的实证研究表明,文本-文本余弦相似度通常大于文本-图像余弦相似度! 如果你想合并这两个分数,我们推荐两种方法:
combined_scores = sim(text, text) + lambda * sim(text, image) # optimal lambda depends on your dataset, but in general lambda=2 can be a good choice.# pseudo code
query_document_mean = np.mean(cos_sim_text_texts)
query_document_std = np.std(cos_sim_text_texts)
text_image_mean = np.mean(cos_sim_text_images)
text_image_std = np.std(cos_sim_text_images)
query_document_sim_normalized = (cos_sim_query_documents - query_document_mean) / query_document_std
text_image_sim_normalized = (cos_sim_text_images - text_image_mean) / text_image_std