HuggingFace镜像/LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned
模型介绍文件和版本分析
下载使用量0

LLM2CLIP: Extending the Capability Boundaries of CLIP through Large Language Models

黄伟权1*,吴奥琦1*,杨一帆2†,罗旭芳2,杨玉青2,胡亮1,戴琪2,戴西洋2,陈东东2,罗冲2,邱莉莉2

1同济大学,2微软公司
*同等贡献
†通讯作者:yifanyang@microsoft.com

[📂 GitHub] [🆕 博客] [📜 LLM2CLIP]

在本文中,我们提出了LLM2CLIP,这是一种新颖的方法,它借助大型语言模型(LLM)的能力来释放CLIP的潜力。通过在标题空间中使用对比学习对LLM进行微调,我们将其文本能力提取到输出嵌入中,显著提高了输出层的文本辨别能力。然后,我们设计了一个高效的训练过程,其中微调后的LLM充当CLIP视觉编码器的强大教师。由于LLM的存在,我们现在可以整合更长、更复杂的标题,而不受传统CLIP文本编码器的上下文窗口和能力限制。我们的实验表明,这种方法在跨模态任务中带来了显著的改进。我们的方法直接将之前的SOTA模型EVA02在长文本和短文本检索任务上的性能提升了16.5%,将仅在英文数据上训练的CLIP模型转变为最先进的跨语言模型。此外,当与Llama 1.5等模型集成到多模态训练中时,它在几乎所有基准测试中都持续优于CLIP,展示了全面的性能提升。

LLM2CLIP性能

summary_tab
**需要注意的是,本文中呈现的所有结果均使用PyTorch权重进行评估。使用Hugging Face(hf)模型时,性能可能会有所不同。**

模型详情

  • 模型类型: 视觉基础模型、特征主干网络
  • 预训练数据集: CC3M、CC12M、YFCC15M 以及 Recap-DataComp-1B(3000万子集)

使用方法

Huggingface 版本

图像嵌入

from PIL import Image
from transformers import AutoModel
from transformers import CLIPImageProcessor
import torch

image_path = "CLIP.png"
model_name_or_path = "LLM2CLIP-Openai-L-14-336" # or /path/to/local/LLM2CLIP-Openai-L-14-336

processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
model = AutoModel.from_pretrained(
    model_name_or_path, 
    torch_dtype=torch.float16,
    trust_remote_code=True).to('cuda').eval()

image = Image.open(image_path)
input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')

with torch.no_grad(), torch.cuda.amp.autocast():
    outputs = model.get_image_features(input_pixels)

检索

from PIL import Image
from transformers import AutoModel, AutoConfig, AutoTokenizer
from transformers import CLIPImageProcessor
import torch
from llm2vec import LLM2Vec
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
model_name_or_path = "microsoft/LLM2CLIP-Openai-L-14-336" # or /path/to/local/LLM2CLIP-Openai-L-14-336
model = AutoModel.from_pretrained(
    model_name_or_path, 
    torch_dtype=torch.bfloat16,
    trust_remote_code=True).to('cuda').eval()

llm_model_name = 'microsoft/LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned'
config = AutoConfig.from_pretrained(
    llm_model_name, trust_remote_code=True
)
llm_model = AutoModel.from_pretrained(llm_model_name, torch_dtype=torch.bfloat16, config=config, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
llm_model.config._name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct' #  Workaround for LLM2VEC
l2v = LLM2Vec(llm_model, tokenizer, pooling_mode="mean", max_length=512, doc_max_length=512)

captions = ["a diagram", "a dog", "a cat"]
image_path = "CLIP.png"

image = Image.open(image_path)
input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')
text_features = l2v.encode(captions, convert_to_tensor=True).to('cuda')

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.get_image_features(input_pixels)
    text_features = model.get_text_features(text_features)

    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

BibTeX 与引用

@misc{huang2024llm2clippowerfullanguagemodel,
      title={LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation}, 
      author={Weiquan Huang and Aoqi Wu and Yifan Yang and Xufang Luo and Yuqing Yang and Liang Hu and Qi Dai and Xiyang Dai and Dongdong Chen and Chong Luo and Lili Qiu},
      year={2024},
      eprint={2411.04997},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.04997}, 
}