mobileclip_s2:可用于零样本图像分类任务，通过文本和图像模型提取特征并计算相似度实现分类。项目基于 Apple ML-MobileCLIP，提供 ONNX 权重，兼容 Transformers.js 生态。【此简介由AI生成】

https://github.com/apple/ml-mobileclip 搭配 ONNX 权重，以实现与 Transformers.js 的兼容。

使用方法（Transformers.js）

如果尚未安装，你可以通过以下方式从 NPM 安装 Transformers.js JavaScript 库：

npm i @huggingface/transformers

示例： 执行零样本图像分类。

import {
  AutoTokenizer,
  CLIPTextModelWithProjection,
  AutoProcessor,
  CLIPVisionModelWithProjection,
  RawImage,
  dot,
  softmax,
} from '@huggingface/transformers';

const model_id = 'Xenova/mobileclip_s2';

// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const text_model = await CLIPTextModelWithProjection.from_pretrained(model_id);

// Load processor and vision model
const processor = await AutoProcessor.from_pretrained(model_id);
const vision_model = await CLIPVisionModelWithProjection.from_pretrained(model_id);

// Run tokenization
const texts = ['cats', 'dogs', 'birds'];
const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });

// Compute text embeddings
const { text_embeds } = await text_model(text_inputs);
const normalized_text_embeds = text_embeds.normalize().tolist();

// Read image and run processor
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg';
const image = await RawImage.read(url);
const image_inputs = await processor(image);

// Compute vision embeddings
const { image_embeds } = await vision_model(image_inputs);
const normalized_image_embeds = image_embeds.normalize().tolist();

// Compute probabilities
const probabilities = normalized_image_embeds.map(
  x => softmax(normalized_text_embeds.map(y => 100 * dot(x, y)))
);
console.log(probabilities); // [[ 0.9999973851268408, 0.000002399646544186113, 2.1522661499262862e-7 ]]

https://github.com/apple/ml-mobileclip 搭配 ONNX 权重，以实现与 Transformers.js 的兼容。

使用方法（Transformers.js）

如果尚未安装，你可以通过以下方式从 NPM 安装 Transformers.js JavaScript 库：

npm i @huggingface/transformers

示例： 执行零样本图像分类。

import {
  AutoTokenizer,
  CLIPTextModelWithProjection,
  AutoProcessor,
  CLIPVisionModelWithProjection,
  RawImage,
  dot,
  softmax,
} from '@huggingface/transformers';

const model_id = 'Xenova/mobileclip_s2';

// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const text_model = await CLIPTextModelWithProjection.from_pretrained(model_id);

// Load processor and vision model
const processor = await AutoProcessor.from_pretrained(model_id);
const vision_model = await CLIPVisionModelWithProjection.from_pretrained(model_id);

// Run tokenization
const texts = ['cats', 'dogs', 'birds'];
const text_inputs = tokenizer(texts, { padding: 'max_length', truncation: true });

// Compute text embeddings
const { text_embeds } = await text_model(text_inputs);
const normalized_text_embeds = text_embeds.normalize().tolist();

// Read image and run processor
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg';
const image = await RawImage.read(url);
const image_inputs = await processor(image);

// Compute vision embeddings
const { image_embeds } = await vision_model(image_inputs);
const normalized_image_embeds = image_embeds.normalize().tolist();

// Compute probabilities
const probabilities = normalized_image_embeds.map(
  x => softmax(normalized_text_embeds.map(y => 100 * dot(x, y)))
);
console.log(probabilities); // [[ 0.9999973851268408, 0.000002399646544186113, 2.1522661499262862e-7 ]]