1同济大学,2微软公司
*同等贡献
†通讯作者:yifanyang@microsoft.com
在本文中,我们提出了LLM2CLIP,这是一种新颖的方法,它借助大型语言模型(LLM)的能力来释放CLIP的潜力。通过在标题空间中使用对比学习对LLM进行微调,我们将其文本能力提取到输出嵌入中,显著提高了输出层的文本辨别能力。然后,我们设计了一个高效的训练过程,其中微调后的LLM充当CLIP视觉编码器的强大教师。由于LLM的存在,我们现在可以整合更长、更复杂的标题,而不受传统CLIP文本编码器的上下文窗口和能力限制。我们的实验表明,这种方法在跨模态任务中带来了显著的改进。我们的方法直接将之前的SOTA模型EVA02在长文本和短文本检索任务上的性能提升了16.5%,将仅在英文数据上训练的CLIP模型转变为最先进的跨语言模型。此外,当与Llama 1.5等模型集成到多模态训练中时,它在几乎所有基准测试中都持续优于CLIP,展示了全面的性能提升。
图像嵌入
from PIL import Image
from transformers import AutoModel
from transformers import CLIPImageProcessor
import torch
image_path = "CLIP.png"
model_name_or_path = "LLM2CLIP-Openai-L-14-336" # or /path/to/local/LLM2CLIP-Openai-L-14-336
processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
model = AutoModel.from_pretrained(
model_name_or_path,
torch_dtype=torch.float16,
trust_remote_code=True).to('cuda').eval()
image = Image.open(image_path)
input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')
with torch.no_grad(), torch.cuda.amp.autocast():
outputs = model.get_image_features(input_pixels)检索
from PIL import Image
from transformers import AutoModel, AutoConfig, AutoTokenizer
from transformers import CLIPImageProcessor
import torch
from llm2vec import LLM2Vec
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
model_name_or_path = "microsoft/LLM2CLIP-Openai-L-14-336" # or /path/to/local/LLM2CLIP-Openai-L-14-336
model = AutoModel.from_pretrained(
model_name_or_path,
torch_dtype=torch.bfloat16,
trust_remote_code=True).to('cuda').eval()
llm_model_name = 'microsoft/LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned'
config = AutoConfig.from_pretrained(
llm_model_name, trust_remote_code=True
)
llm_model = AutoModel.from_pretrained(llm_model_name, torch_dtype=torch.bfloat16, config=config, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
llm_model.config._name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct' # Workaround for LLM2VEC
l2v = LLM2Vec(llm_model, tokenizer, pooling_mode="mean", max_length=512, doc_max_length=512)
captions = ["a diagram", "a dog", "a cat"]
image_path = "CLIP.png"
image = Image.open(image_path)
input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')
text_features = l2v.encode(captions, convert_to_tensor=True).to('cuda')
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.get_image_features(input_pixels)
text_features = model.get_text_features(text_features)
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
@misc{huang2024llm2clippowerfullanguagemodel,
title={LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation},
author={Weiquan Huang and Aoqi Wu and Yifan Yang and Xufang Luo and Yuqing Yang and Liang Hu and Qi Dai and Xiyang Dai and Dongdong Chen and Chong Luo and Lili Qiu},
year={2024},
eprint={2411.04997},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.04997},
}