Phi-3.5-vision 是一款轻量级、最先进的开源多模态模型,其构建基于包含合成数据和筛选后公开网站数据的数据集,重点关注文本和视觉领域中质量极高、推理密集型的数据。该模型属于 Phi-3 模型家族,其多模态版本支持 128K 上下文长度(以 tokens 为单位)。模型经过了严格的增强流程,融合了监督微调与直接偏好优化,以确保精准遵循指令并具备强大的安全措施。
🏡 Phi-3 门户
📰 Phi-3 Microsoft 博客
📖 Phi-3 技术报告
👩🍳 Phi-3 指南
🖥️ 试用
该模型旨在面向英语的广泛商业和研究用途。它为具备视觉和文本输入能力的通用人工智能系统及应用提供支持,尤其适用于以下需求:
我们的模型旨在加速语言和多模态模型的研究,并作为生成式 AI 驱动功能的构建模块。
我们的模型并非专门为所有下游用途设计或评估。开发人员在选择使用场景时应考虑语言模型的常见局限性,并在特定下游用途中使用前,针对准确性、安全性和公平性进行评估和缓解,特别是在高风险场景中。开发人员应了解并遵守与其使用场景相关的适用法律法规(包括隐私、贸易合规法律等)。
本模型卡片中的任何内容均不应被解释为或视为对模型发布所依据的许可的限制或修改。
在本版本中,模型基于宝贵的客户反馈,实现了多帧图像理解与推理功能。多帧能力的典型应用示例包括详细图像对比、多图像总结/故事讲述以及视频总结,这些功能在 Office 场景中具有广泛的应用前景。我们还观察到模型在大多数单图像基准测试中的性能有所提升,例如,MMMU 性能从 40.2 提升至 43.0,MMBench 性能从 80.5 提升至 81.9,文档理解基准测试 TextVQA 从 70.9 提升至 72.0。我们相信大多数使用场景都将从本版本中受益,但仍建议用户在其 AI 应用中对新模型进行测试。感谢大家对 Phi-3 模型系列的热情采用,我们将继续欢迎社区的所有反馈。
以下是在现有多图像基准测试上的对比结果。总体而言,我们的模型在相同尺寸下优于竞争对手模型,并且在多帧能力和视频总结方面可与更大尺寸的模型相媲美。
BLINK:一个包含 14 项视觉任务的基准测试,人类可以快速解决这些任务,但对于当前的多模态大型语言模型(MLLM)来说仍然具有挑战性。
| 基准测试 | Phi-3.5-vision-instruct | LlaVA-Interleave-Qwen-7B | InternVL-2-4B | InternVL-2-8B | Gemini-1.5-Flash | GPT-4o-mini | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o |
|---|---|---|---|---|---|---|---|---|---|
| 艺术风格 | 87.2 | 62.4 | 55.6 | 52.1 | 64.1 | 70.1 | 59.8 | 70.9 | 73.3 |
| 计数 | 54.2 | 56.7 | 54.2 | 66.7 | 51.7 | 55.0 | 59.2 | 65.0 | 65.0 |
| 法医检测 | 92.4 | 31.1 | 40.9 | 34.1 | 54.5 | 38.6 | 67.4 | 60.6 | 75.8 |
| 功能对应 | 29.2 | 34.6 | 24.6 | 24.6 | 33.1 | 26.9 | 33.8 | 31.5 | 43.8 |
| 智商测试 | 25.3 | 26.7 | 26.0 | 30.7 | 25.3 | 29.3 | 26.0 | 34.0 | 19.3 |
| 拼图 | 68.0 | 86.0 | 55.3 | 52.7 | 71.3 | 72.7 | 57.3 | 68.0 | 67.3 |
| 多视图推理 | 54.1 | 44.4 | 48.9 | 42.9 | 48.9 | 48.1 | 55.6 | 49.6 | 46.6 |
| 对象定位 | 49.2 | 54.9 | 53.3 | 54.1 | 44.3 | 57.4 | 62.3 | 65.6 | 68.0 |
| 相对深度 | 69.4 | 77.4 | 63.7 | 67.7 | 57.3 | 58.1 | 71.8 | 76.6 | 71.0 |
| 相对反射率 | 37.3 | 34.3 | 32.8 | 38.8 | 32.8 | 27.6 | 36.6 | 38.8 | 40.3 |
| 语义对应 | 36.7 | 31.7 | 31.7 | 22.3 | 32.4 | 31.7 | 45.3 | 48.9 | 54.0 |
| 空间关系 | 65.7 | 75.5 | 78.3 | 78.3 | 55.9 | 81.1 | 60.1 | 79.0 | 84.6 |
| 视觉对应 | 53.5 | 40.7 | 34.9 | 33.1 | 29.7 | 52.9 | 72.1 | 81.4 | 86.0 |
| 视觉相似性 | 83.0 | 91.9 | 48.1 | 45.2 | 47.4 | 77.8 | 84.4 | 81.5 | 88.1 |
| 总体 | 57.0 | 53.1 | 45.9 | 45.4 | 45.8 | 51.9 | 56.5 | 61.0 | 63.2 |
Video-MME:全面评估多模态大型语言模型(MLLM)处理视频数据的能力,涵盖广泛的视觉领域、时间长度和数据模态。
| 基准测试 | Phi-3.5-vision-instruct | LlaVA-Interleave-Qwen-7B | InternVL-2-4B | InternVL-2-8B | Gemini-1.5-Flash | GPT-4o-mini | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o |
|---|---|---|---|---|---|---|---|---|---|
| 短时长(<2分钟) | 60.8 | 62.3 | 60.7 | 61.7 | 72.2 | 70.1 | 66.3 | 73.3 | 77.7 |
| 中时长(4-15分钟) | 47.7 | 47.1 | 46.4 | 49.6 | 62.7 | 59.6 | 54.7 | 61.2 | 68.0 |
| 长时长(30-60分钟) | 43.8 | 41.2 | 42.6 | 46.6 | 52.1 | 53.9 | 46.6 | 53.2 | 59.6 |
| 总体 | 50.8 | 50.2 | 49.9 | 52.6 | 62.3 | 61.2 | 55.9 | 62.6 | 68.4 |
当前 transformers 版本可通过以下命令验证:pip list | grep transformers。
所需软件包示例:
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.1.0
torch-npu==2.1.0.
torchvision==0.16.0
transformers==4.43.0
accelerate==0.30.0Phi-3.5-vision-Instruct 也可在 Azure AI Studio 中使用。
考虑到训练数据的特性,Phi-3.5-vision 模型最适合使用以下聊天格式的提示词:
单张图片:
<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n多轮对话:
<|user|>\n<|image_1|>\n{prompt_1}<|end|>\n<|assistant|>\n{response_1}<|end|>\n<|user|>\n{prompt_2}<|end|>\n<|assistant|>\n对于多图像使用场景,请在提示词开头添加多个图像占位符。<|image_{}|> 的索引应从 1 开始。提示词示例如下:
<|user|>\n<|image_1|>\n<|image_2|>\n<|image_3|>\n<|image_4|>\n{prompt}<|end|>\n<|assistant|>\n 获取 Phi-3.5-vision-instruct 模型检查点后,用户可以使用此示例代码进行推理。
from PIL import Image
import requests
from openmind import AutoModelForCausalLM
from openmind import AutoProcessor
model_id = "/path/to/Phi-3.5-vision-instruct"
# If using the pre downloaded model, pealse set the model_id as local model path
# Note: set _attn_implementation=None if you don't have flash_attn installed
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="npu:0",
trust_remote_code=True,
torch_dtype="auto",
_attn_implementation=None
)
# for best performance, use num_crops=4 for multi-frame, num_crops=16 for single-frame.
processor = AutoProcessor.from_pretrained(model_id,
trust_remote_code=True,
num_crops=4
)
images = []
placeholder = ""
# Note: if OOM, you might consider reduce number of frames in this example.
for i in range(1,20):
url = f"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-{i}-2048.jpg"
images.append(Image.open(requests.get(url, stream=True).raw))
placeholder += f"<|image_{i}|>\n"
messages = [
{"role": "user", "content": placeholder+"Summarize the deck of slides."},
]
prompt = processor.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = processor(prompt, images, return_tensors="pt").to("npu:0")
generation_args = {
"max_new_tokens": 1000,
"temperature": 0.0,
"do_sample": False,
}
generate_ids = model.generate(**inputs,
eos_token_id=processor.tokenizer.eos_token_id,
**generation_args
)
# remove input tokens
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False)[0]
print(response)注意事项:
与其他模型类似,Phi 系列模型可能会表现出不公平、不可靠或冒犯性的行为。需要注意的一些局限性行为包括:
开发人员应应用负责任的 AI 最佳实践,并负责确保特定用例符合相关法律法规(例如隐私、贸易等)。需要考虑的重要领域包括:
架构:Phi-3.5-vision 拥有 42 亿参数,包含图像编码器、连接器、投影器和 Phi-3 Mini 语言模型。
输入:文本和图像。最适合使用聊天格式的提示词。
上下文长度:128K tokens
GPU:256 张 A100-80G
训练时间:6 天
训练数据:5000 亿 tokens(视觉 tokens + 文本 tokens)
输出:针对输入生成的文本
日期:2024 年 7 月至 8 月期间训练
状态:这是一个基于截止日期为 2024 年 3 月 15 日的离线文本数据集训练的静态模型。随着模型的改进,未来可能会发布优化模型的新版本。
发布日期:2024 年 8 月
我们的训练数据来源广泛,主要包括以下几类:
数据收集过程涉及从公开可用文档中获取信息,并采用细致的方法过滤掉不良文档和图像。为了保护隐私,我们仔细过滤了各种图像和文本数据源,以从训练数据中移除或清理任何潜在的个人数据。有关数据的更多详细信息,请参见 Phi-3 技术报告。
我们建议用户参考 Phi-3 食谱:视觉模型微调指南
import os
import pandas as pd
from datasets import load_dataset
import requests
from PIL import Image
from io import BytesIO
def download_image(image_url, save_path):
try:
response = requests.get(image_url)
response.raise_for_status() # Check if the request was successful
image = Image.open(BytesIO(response.content))
image.save(save_path)
return True
except Exception as e:
print(f"Failed to download {image_url}: {e}")
return False
# Download the dataset from Hugging Face
dataset = load_dataset('Insert_Your_Dataset')
# Convert the Hugging Face dataset to a Pandas DataFrame
df = dataset['train'].to_pandas()
# Create directories to save the dataset and images
dataset_dir = './data/DataSetName'
images_dir = os.path.join(dataset_dir, 'images')
os.makedirs(images_dir, exist_ok=True)
# Filter out rows where image download fails
filtered_rows = []
for idx, row in df.iterrows():
image_url = row['imageurl']
image_name = f"{row['product_code']}.jpg"
image_path = os.path.join(images_dir, image_name)
if download_image(image_url, image_path):
row['local_image_path'] = image_path
filtered_rows.append(row)
# Create a new DataFrame with the filtered rows
filtered_df = pd.DataFrame(filtered_rows)
# Save the updated dataset to disk
dataset_path = os.path.join(dataset_dir, 'Dataset.csv')
filtered_df.to_csv(dataset_path, index=False)
print(f"Dataset and images saved to {dataset_dir}")transformers==4.43.0
peft==0.11.1
datasets
accelerate==0.30.0
deepspeed==0.13.1
Levenshtein
PyAV==12.3.0# Import necessary libraries
# Code orginally from https://wandb.ai/byyoung3/mlnews3/reports/How-to-fine-tune-Phi-3-vision-on-a-custom-dataset--Vmlldzo4MTEzMTg3
# Credits to: Brett Young https://github.com/bdytx5/
import os
import torch
from torch.utils.data import Dataset, DataLoader, random_split
from openmind import AutoModelForCausalLM, AutoProcessor
from torchvision import transforms
from PIL import Image
import pandas as pd
import random
import numpy as np
from torchvision.transforms.functional import resize, to_pil_image
# import wandb
import torch.optim as optim
import torch.nn.functional as F
torch.manual_seed(3)
# Custom Dataset class for Burberry Product Prices and Images
class BurberryProductDataset(Dataset):
def __init__(self, dataframe, tokenizer, max_length, image_size):
self.dataframe = dataframe
self.tokenizer = tokenizer
self.tokenizer.padding_side = 'left' # Set padding side to left
self.max_length = max_length
def __len__(self):
return len(self.dataframe)
def __getitem__(self, idx):
# Get the row at the given index
row = self.dataframe.iloc[idx]
# Create the text input for the model
text = f"<|user|>\n<|image_1|>What is shown in this image?<|end|><|assistant|>\nProduct: {row['title']}, Category: {row['category3_code']}, Full Price: {row['full_price']}<|end|>"
# Get the image path from the row
image_path = row['local_image_path']
# Tokenize the text input
encodings = self.tokenizer(text, truncation=True, padding='max_length', max_length=self.max_length)
try:
# Load and transform the image
image = Image.open(image_path).convert("RGB")
image = self.image_transform_function(image)
except (FileNotFoundError, IOError):
# Skip the sample if the image is not found
return None
# Add the image and price information to the encodings dictionary
encodings['pixel_values'] = image
encodings['price'] = row['full_price']
return {key: torch.tensor(val) for key, val in encodings.items()}
def image_transform_function(self, image):
# Convert the image to a numpy array
image = np.array(image)
return image
# Load dataset from disk
dataset_path = './data/burberry_dataset/burberry_dataset.csv'
df = pd.read_csv(dataset_path)
# Initialize processor and tokenizer for the pre-trained model
model_id = "microsoft/Phi-3-vision-instruct"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True, attn_implementation=False)
tokenizer = processor.tokenizer
# Split dataset into training and validation sets
train_size = int(0.9 * len(df))
val_size = len(df) - train_size
train_indices, val_indices = random_split(range(len(df)), [train_size, val_size])
train_indices = train_indices.indices
val_indices = val_indices.indices
train_df = df.iloc[train_indices]
val_df = df.iloc[val_indices]
# Create dataset and dataloader for training set
train_dataset = BurberryProductDataset(train_df, tokenizer, max_length=512, image_size=128)
train_loader = DataLoader(train_dataset, batch_size=1, shuffle=True)
# Create dataset and dataloader for validation set
val_dataset = BurberryProductDataset(val_df, tokenizer, max_length=512, image_size=128)
val_loader = DataLoader(val_dataset, batch_size=1, shuffle=False)
# Initialize the pre-trained model
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="npu", trust_remote_code=True, torch_dtype="auto")
# Set the device to GPU if available, otherwise use CPU
device = torch.device("npu" if torch.npu.is_available() else "cpu")
model.to(device)
# Initialize the optimizer
optimizer = optim.AdamW(model.parameters(), lr=5e-5)
# Training loop
num_epochs = 1
eval_interval = 150 # Evaluate every 'eval_interval' steps
loss_scaling_factor = 1000.0 # Variable to scale the loss by a certain amount
save_dir = './saved_models'
step = 0
accumulation_steps = 64 # Accumulate gradients over this many steps
# Create a directory to save the best model
if not os.path.exists(save_dir):
os.makedirs(save_dir)
best_val_loss = float('inf')
best_model_path = None
# Select 10 random images from the validation set for logging
num_log_samples = 10
log_indices = random.sample(range(len(val_dataset)), num_log_samples)
# Function to extract the predicted price from model predictions
def extract_price_from_predictions(predictions, tokenizer):
# Assuming the price is at the end of the text and separated by a space
predicted_text = tokenizer.decode(predictions[0], skip_special_tokens=True)
try:
predicted_price = float(predicted_text.split()[-1].replace(',', ''))
except ValueError:
predicted_price = 0.0
return predicted_price
# Function to evaluate the model on the validation set
def evaluate(model, val_loader, device, tokenizer, step, log_indices, max_samples=None):
model.eval()
total_loss = 0
total_price_error = 0
log_images = []
log_gt_texts = []
log_pred_texts = []
# table = wandb.Table(columns=["Image", "Ground Truth Text", "Predicted Text"])
with torch.no_grad():
for i, batch in enumerate(val_loader):
if max_samples and i >= max_samples:
break
if batch is None: # Skip if the batch is None
continue
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
pixel_values = batch['pixel_values'].to(device)
labels = input_ids.clone().detach()
actual_price = batch['price'].item()
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
pixel_values=pixel_values,
labels=labels
)
loss = outputs.loss
total_loss += loss.item()
# Calculate price error
predictions = torch.argmax(outputs.logits, dim=-1)
predicted_price = extract_price_from_predictions(predictions, tokenizer)
price_error = abs(predicted_price - actual_price)
total_price_error += price_error
# Log images, ground truth texts, and predicted texts
if i in log_indices:
log_images.append(pixel_values.cpu().squeeze().numpy())
log_gt_texts.append(tokenizer.decode(labels[0], skip_special_tokens=True))
log_pred_texts.append(tokenizer.decode(predictions[0], skip_special_tokens=True))
# Convert image to PIL format
pil_img = to_pil_image(resize(torch.from_numpy(log_images[-1]).permute(2, 0, 1), (336, 336))).convert("RGB")
# Add data to the table
# table.add_data(wandb.Image(pil_img), log_gt_texts[-1], log_pred_texts[-1])
# Log the table incrementally
# wandb.log({"Evaluation Results step {}".format(step): table, "Step": step})
avg_loss = total_loss / (i + 1) # i+1 to account for the loop index
avg_price_error = total_price_error / (i + 1)
model.train()
return avg_loss, avg_price_error
# Set the model to training mode
model.train()
# Training loop for the specified number of epochs
for epoch in range(num_epochs):
total_train_loss = 0
total_train_price_error = 0
batch_count = 0
for batch in train_loader:
step += 1
if batch is None: # Skip if the batch is None
continue
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
pixel_values = batch['pixel_values'].to(device)
labels = input_ids.clone().detach()
actual_price = batch['price'].float().to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
pixel_values=pixel_values,
labels=labels
)
loss = outputs.loss
total_loss = loss
predictions = torch.argmax(outputs.logits, dim=-1)
predicted_price = extract_price_from_predictions(predictions, tokenizer)
total_loss.backward()
if (step % accumulation_steps) == 0:
for param in model.parameters():
if param.grad is not None:
param.grad /= accumulation_steps
optimizer.step()
optimizer.zero_grad()
total_train_loss += total_loss.item()
total_train_price_error += abs(predicted_price - actual_price.item())
batch_count += 1
# Log batch loss to Weights & Biases
# wandb.log({"Batch Loss": total_loss.item(), "Step": step})
print(f"Epoch: {epoch}, Step: {step}, Batch Loss: {total_loss.item()}")
if step % eval_interval == 0:
val_loss, val_price_error = evaluate(model, val_loader, device, tokenizer=tokenizer, log_indices=log_indices, step=step )
# wandb.log({
# "Validation Loss": val_loss,
# "Validation Price Error (Average)": val_price_error,
# "Step": step
# })
print(f"Step: {step}, Validation Loss: {val_loss}, Validation Price Error (Normalized): {val_price_error}")
# Save the best model based on validation loss
if val_loss < best_val_loss:
best_val_loss = val_loss
best_model_path = os.path.join(save_dir, f"best_model")
model.save_pretrained(best_model_path, safe_serialization=False)
tokenizer.save_pretrained(best_model_path)
avg_train_loss = total_train_loss / batch_count
avg_train_price_error = total_train_price_error / batch_count
# wandb.log({
# "Epoch": epoch,
# "Average Training Loss": avg_train_loss,
# "Average Training Price Error": avg_train_price_error
# })
print(f"Epoch: {epoch}, Average Training Loss: {avg_train_loss}, Average Training Price Error: {avg_train_price_error}")
# Log the best model to Weights & Biases
if best_model_path:
run.log_model(
path=best_model_path,
name="phi3-v-burberry",
aliases=["best"],
)
# Finish the Weights & Biases run
# wandb.finish()为了全面了解模型能力,我们通过内部基准测试平台,在一系列零样本基准测试中对Phi-3.5-vision与其他模型进行了对比。以下是模型在代表性基准测试上的整体质量概况:
| 类别 | 基准测试 | Phi-3.5-vision-instruct | Intern-VL-2-4B | Intern-VL-2-8B | Gemini-1.5-Flash | GPT-4o-mini 2024-7-18 | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o 2024-5-13 |
|---|---|---|---|---|---|---|---|---|---|
| 主流综合基准 | MMMU (验证集) | 43.0 | 44.22 | 46.33 | 49.33 | 52.1 | 52.67 | 54.11 | 61.78 |
| MMBench (开发集-英文) | 81.9 | 83.4 | 87.0 | 85.7 | 83.8 | 82.3 | 87.9 | 88.4 | |
| 视觉科学知识推理 | ScienceQA (图像测试集) | 91.3 | 94.9 | 95.9 | 84.5 | 84.0 | 73.8 | 86.0 | 88.5 |
| 视觉数学推理 | MathVista (测试迷你集) | 43.9 | 53.7 | 51.1 | 55.3 | 38.8 | 54.0 | 57.4 | 54.4 |
| InterGPS (测试集) | 36.3 | 45.6 | 53.2 | 39.4 | 39.9 | 45.6 | 58.2 | 46.9 | |
| 图表推理 | AI2D (测试集) | 78.1 | 77.3 | 81.4 | 78.4 | 75.2 | 68.9 | 75.6 | 82.8 |
| ChartQA (测试集) | 81.8 | 78.8 | 80.4 | 57.6 | 54.5 | 73.2 | 68.2 | 64.0 | |
| 文档智能 | TextVQA (验证集) | 72.0 | 66.2 | 68.8 | 67.4 | 70.9 | 70.5 | 64.5 | 75.6 |
| 对象视觉存在验证 | POPE (测试集) | 86.1 | 83.3 | 84.2 | 86.1 | 83.6 | 76.6 | 89.3 | 87.0 |
方法 Phi-3系列模型采用了稳健的安全后训练方法。该方法利用了多种开源数据集和内部生成的数据集。安全对齐的整体技术结合了监督微调(SFT)和人类反馈强化学习(RLHF)方法,使用人工标注和合成的英语数据集,包括专注于有用性和无害性的公开可用数据集,以及针对多个安全类别的各类问答数据。
安全评估 我们利用多种评估技术,包括红队测试、对抗性对话模拟和安全评估基准数据集,来评估Phi-3.5模型在多个风险类别中产生不良输出的倾向。为了弥补单一方法的局限性,我们采用了多种评估手段。有关我们安全对齐的更多详细信息,请参阅技术报告。
请注意,默认情况下,Phi-3.5-Mini-Instruct模型使用flash attention,这需要特定类型的GPU硬件才能运行。我们已在以下类型的硬件上进行了测试:
本模型根据MIT许可证授权。
本项目可能包含项目、产品或服务的商标或徽标。微软商标或徽标的授权使用受微软商标与品牌指南约束,且必须遵循该指南。在本项目的修改版本中使用微软商标或徽标不得造成混淆或暗示微软的赞助。任何第三方商标或徽标的使用均受该第三方政策的约束。