Phi-3.5-vision-instruct:可用于图像理解、OCR、图表解析、多图对比及视频摘要等场景，是轻量级开源模型，支持128K上下文长度，具备多帧图像推理能力，经严格优化确保指令遵循与安全性。【此简介由AI生成】

模型概述

Phi-3.5-vision 是一款轻量级、最先进的开源多模态模型，其构建基于包含合成数据和筛选后公开网站数据的数据集，重点关注文本和视觉领域中质量极高、推理密集型的数据。该模型属于 Phi-3 模型家族，其多模态版本支持 128K 上下文长度（以 tokens 为单位）。模型经过了严格的增强流程，融合了监督微调与直接偏好优化，以确保精准遵循指令并具备强大的安全措施。

🏡 Phi-3 门户
📰 Phi-3 Microsoft 博客
📖 Phi-3 技术报告
👩‍🍳 Phi-3 指南
🖥️ 试用

预期用途

主要使用场景

该模型旨在面向英语的广泛商业和研究用途。它为具备视觉和文本输入能力的通用人工智能系统及应用提供支持，尤其适用于以下需求：

内存/计算资源受限的环境
延迟敏感的场景
通用图像理解
光学字符识别
图表和表格理解
多图像比较
多图像或视频片段总结

我们的模型旨在加速语言和多模态模型的研究，并作为生成式 AI 驱动功能的构建模块。

使用场景考量

我们的模型并非专门为所有下游用途设计或评估。开发人员在选择使用场景时应考虑语言模型的常见局限性，并在特定下游用途中使用前，针对准确性、安全性和公平性进行评估和缓解，特别是在高风险场景中。开发人员应了解并遵守与其使用场景相关的适用法律法规（包括隐私、贸易合规法律等）。

本模型卡片中的任何内容均不应被解释为或视为对模型发布所依据的许可的限制或修改。

发布说明

在本版本中，模型基于宝贵的客户反馈，实现了多帧图像理解与推理功能。多帧能力的典型应用示例包括详细图像对比、多图像总结/故事讲述以及视频总结，这些功能在 Office 场景中具有广泛的应用前景。我们还观察到模型在大多数单图像基准测试中的性能有所提升，例如，MMMU 性能从 40.2 提升至 43.0，MMBench 性能从 80.5 提升至 81.9，文档理解基准测试 TextVQA 从 70.9 提升至 72.0。我们相信大多数使用场景都将从本版本中受益，但仍建议用户在其 AI 应用中对新模型进行测试。感谢大家对 Phi-3 模型系列的热情采用，我们将继续欢迎社区的所有反馈。

以下是在现有多图像基准测试上的对比结果。总体而言，我们的模型在相同尺寸下优于竞争对手模型，并且在多帧能力和视频总结方面可与更大尺寸的模型相媲美。

BLINK：一个包含 14 项视觉任务的基准测试，人类可以快速解决这些任务，但对于当前的多模态大型语言模型（MLLM）来说仍然具有挑战性。

基准测试	Phi-3.5-vision-instruct	LlaVA-Interleave-Qwen-7B	InternVL-2-4B	InternVL-2-8B	Gemini-1.5-Flash	GPT-4o-mini	Claude-3.5-Sonnet	Gemini-1.5-Pro	GPT-4o
艺术风格	87.2	62.4	55.6	52.1	64.1	70.1	59.8	70.9	73.3
计数	54.2	56.7	54.2	66.7	51.7	55.0	59.2	65.0	65.0
法医检测	92.4	31.1	40.9	34.1	54.5	38.6	67.4	60.6	75.8
功能对应	29.2	34.6	24.6	24.6	33.1	26.9	33.8	31.5	43.8
智商测试	25.3	26.7	26.0	30.7	25.3	29.3	26.0	34.0	19.3
拼图	68.0	86.0	55.3	52.7	71.3	72.7	57.3	68.0	67.3
多视图推理	54.1	44.4	48.9	42.9	48.9	48.1	55.6	49.6	46.6
对象定位	49.2	54.9	53.3	54.1	44.3	57.4	62.3	65.6	68.0
相对深度	69.4	77.4	63.7	67.7	57.3	58.1	71.8	76.6	71.0
相对反射率	37.3	34.3	32.8	38.8	32.8	27.6	36.6	38.8	40.3
语义对应	36.7	31.7	31.7	22.3	32.4	31.7	45.3	48.9	54.0
空间关系	65.7	75.5	78.3	78.3	55.9	81.1	60.1	79.0	84.6
视觉对应	53.5	40.7	34.9	33.1	29.7	52.9	72.1	81.4	86.0
视觉相似性	83.0	91.9	48.1	45.2	47.4	77.8	84.4	81.5	88.1
总体	57.0	53.1	45.9	45.4	45.8	51.9	56.5	61.0	63.2

Video-MME：全面评估多模态大型语言模型（MLLM）处理视频数据的能力，涵盖广泛的视觉领域、时间长度和数据模态。

基准测试	Phi-3.5-vision-instruct	LlaVA-Interleave-Qwen-7B	InternVL-2-4B	InternVL-2-8B	Gemini-1.5-Flash	GPT-4o-mini	Claude-3.5-Sonnet	Gemini-1.5-Pro	GPT-4o
短时长（<2分钟）	60.8	62.3	60.7	61.7	72.2	70.1	66.3	73.3	77.7
中时长（4-15分钟）	47.7	47.1	46.4	49.6	62.7	59.6	54.7	61.2	68.0
长时长（30-60分钟）	43.8	41.2	42.6	46.6	52.1	53.9	46.6	53.2	59.6
总体	50.8	50.2	49.9	52.6	62.3	61.2	55.9	62.6	68.4

使用方法

环境要求

当前 transformers 版本可通过以下命令验证：pip list | grep transformers。

所需软件包示例：

numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.1.0
torch-npu==2.1.0.
torchvision==0.16.0
transformers==4.43.0
accelerate==0.30.0

Phi-3.5-vision-Instruct 也可在 Azure AI Studio 中使用。

输入格式

考虑到训练数据的特性，Phi-3.5-vision 模型最适合使用以下聊天格式的提示词：

单张图片：

<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n

多轮对话：

<|user|>\n<|image_1|>\n{prompt_1}<|end|>\n<|assistant|>\n{response_1}<|end|>\n<|user|>\n{prompt_2}<|end|>\n<|assistant|>\n

对于多图像使用场景，请在提示词开头添加多个图像占位符。<|image_{}|> 的索引应从 1 开始。提示词示例如下：

<|user|>\n<|image_1|>\n<|image_2|>\n<|image_3|>\n<|image_4|>\n{prompt}<|end|>\n<|assistant|>\n

在本地加载模型

获取 Phi-3.5-vision-instruct 模型检查点后，用户可以使用此示例代码进行推理。

from PIL import Image 
import requests 
from openmind import AutoModelForCausalLM 
from openmind import AutoProcessor 

model_id = "/path/to/Phi-3.5-vision-instruct" 
# If using the pre downloaded model, pealse set the model_id as local model path

# Note: set _attn_implementation=None if you don't have flash_attn installed
model = AutoModelForCausalLM.from_pretrained(
  model_id, 
  device_map="npu:0", 
  trust_remote_code=True, 
  torch_dtype="auto", 
  _attn_implementation=None    
)

# for best performance, use num_crops=4 for multi-frame, num_crops=16 for single-frame.
processor = AutoProcessor.from_pretrained(model_id, 
  trust_remote_code=True, 
  num_crops=4
) 

images = []
placeholder = ""

# Note: if OOM, you might consider reduce number of frames in this example.
for i in range(1,20):
    url = f"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-{i}-2048.jpg" 
    images.append(Image.open(requests.get(url, stream=True).raw))
    placeholder += f"<|image_{i}|>\n"

messages = [
    {"role": "user", "content": placeholder+"Summarize the deck of slides."},
]

prompt = processor.tokenizer.apply_chat_template(
  messages, 
  tokenize=False, 
  add_generation_prompt=True
)

inputs = processor(prompt, images, return_tensors="pt").to("npu:0") 

generation_args = { 
    "max_new_tokens": 1000, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

generate_ids = model.generate(**inputs, 
  eos_token_id=processor.tokenizer.eos_token_id, 
  **generation_args
)

# remove input tokens 
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids, 
  skip_special_tokens=True, 
  clean_up_tokenization_spaces=False)[0] 

print(response)

注意事项：

为获得最佳性能，建议在多帧场景下设置 num_crops=4，在单帧场景下设置 num_crops=16。
用户若需关闭 flash_attention，可设置 __attn_implementation='eager'_| None

负责任的 AI 考量

与其他模型类似，Phi 系列模型可能会表现出不公平、不可靠或冒犯性的行为。需要注意的一些局限性行为包括：

服务质量：Phi 模型主要针对英文文本进行训练。非英文语言的性能会较差。训练数据中代表性不足的英语变体，其性能可能会比标准美式英语差。
伤害表征与刻板印象的延续：这些模型可能会过度或不足地代表某些人群，抹去某些群体的代表性，或强化贬低性或负面的刻板印象。尽管经过了安全训练，由于不同群体的代表性水平不同，或者训练数据中反映现实世界模式和社会偏见的负面刻板印象示例的普遍性，这些局限性仍可能存在。
不当或冒犯性内容：这些模型可能会生成其他类型的不当或冒犯性内容，这可能使其在没有针对特定用例的额外缓解措施的情况下，不适用于敏感场景。
信息可靠性：语言模型可能会生成无意义的内容或编造听起来合理但不准确或过时的内容。
代码范围有限：Phi-3 的大部分训练数据基于 Python，并使用常见的包，如 "typing, math, random, collections, datetime, itertools"。如果模型生成的 Python 脚本使用了其他包或其他语言的脚本，我们强烈建议用户手动验证所有 API 的使用。

开发人员应应用负责任的 AI 最佳实践，并负责确保特定用例符合相关法律法规（例如隐私、贸易等）。需要考虑的重要领域包括：

分配：在可能对法律地位或资源、生活机会分配（例如住房、就业、信贷等）产生重大影响的场景中，若未进行进一步评估和额外的去偏技术处理，模型可能不适用。
高风险场景：开发人员应评估在高风险场景中使用模型的适用性，在这些场景中，不公平、不可靠或冒犯性的输出可能会造成极高的成本或导致伤害。这包括在准确性和可靠性至关重要的敏感或专业领域提供建议（例如法律或健康建议）。应根据部署环境在应用层面实施额外的安全措施。
错误信息：模型可能会产生不准确的信息。开发人员应遵循透明度最佳实践，并告知最终用户他们正在与 AI 系统交互。在应用层面，开发人员可以建立反馈机制和管道，使响应基于特定用例的上下文信息，这种技术称为检索增强生成（RAG）。
有害内容生成：开发人员应根据其上下文评估输出，并使用适用于其用例的可用安全分类器或自定义解决方案。
滥用：其他形式的滥用，如欺诈、垃圾邮件或恶意软件制作，可能是可能的，开发人员应确保其应用程序不违反适用的法律法规。
个体识别：具有视觉功能的模型可能具有在图像中唯一识别个体的潜力。安全训练会引导模型拒绝此类请求，但开发人员应考虑并酌情实施额外的缓解措施或用户同意流程，这是其各自管辖区的要求（例如，在处理前对图像输入中的人脸进行模糊处理的措施）。

训练

模型

架构：Phi-3.5-vision 拥有 42 亿参数，包含图像编码器、连接器、投影器和 Phi-3 Mini 语言模型。
输入：文本和图像。最适合使用聊天格式的提示词。
上下文长度：128K tokens
GPU：256 张 A100-80G
训练时间：6 天
训练数据：5000 亿 tokens（视觉 tokens + 文本 tokens）
输出：针对输入生成的文本
日期：2024 年 7 月至 8 月期间训练
状态：这是一个基于截止日期为 2024 年 3 月 15 日的离线文本数据集训练的静态模型。随着模型的改进，未来可能会发布优化模型的新版本。
发布日期：2024 年 8 月

数据概述

我们的训练数据来源广泛，主要包括以下几类：

经过严格质量过滤的公开可用文档、精选的高质量教育数据和代码；
精选的高质量图文交错数据；
新创建的合成“教科书式”数据，用于教授数学、编码、常识推理、世界通用知识（科学、日常活动、心智理论等），以及新创建的图像数据（如图表/表格/图表/幻灯片）和新创建的多图像与视频数据（如短视频片段/两张相似图像对）；
涵盖各种主题的高质量聊天格式监督数据，以反映人类在指令遵循、真实性、诚实性和 helpfulness 等不同方面的偏好。

数据收集过程涉及从公开可用文档中获取信息，并采用细致的方法过滤掉不良文档和图像。为了保护隐私，我们仔细过滤了各种图像和文本数据源，以从训练数据中移除或清理任何潜在的个人数据。有关数据的更多详细信息，请参见 Phi-3 技术报告。

如何微调？

我们建议用户参考 Phi-3 食谱：视觉模型微调指南

准备微调样本数据

import os
import pandas as pd
from datasets import load_dataset
import requests
from PIL import Image
from io import BytesIO


def download_image(image_url, save_path):
    try:
        response = requests.get(image_url)
        response.raise_for_status()  # Check if the request was successful
        image = Image.open(BytesIO(response.content))
        image.save(save_path)
        return True
    except Exception as e:
        print(f"Failed to download {image_url}: {e}")
        return False


# Download the dataset from Hugging Face
dataset = load_dataset('Insert_Your_Dataset')


# Convert the Hugging Face dataset to a Pandas DataFrame
df = dataset['train'].to_pandas()


# Create directories to save the dataset and images
dataset_dir = './data/DataSetName'
images_dir = os.path.join(dataset_dir, 'images')
os.makedirs(images_dir, exist_ok=True)


# Filter out rows where image download fails
filtered_rows = []
for idx, row in df.iterrows():
    image_url = row['imageurl']
    image_name = f"{row['product_code']}.jpg"
    image_path = os.path.join(images_dir, image_name)
    if download_image(image_url, image_path):
        row['local_image_path'] = image_path
        filtered_rows.append(row)


# Create a new DataFrame with the filtered rows
filtered_df = pd.DataFrame(filtered_rows)


# Save the updated dataset to disk
dataset_path = os.path.join(dataset_dir, 'Dataset.csv')
filtered_df.to_csv(dataset_path, index=False)


print(f"Dataset and images saved to {dataset_dir}")

环境安装

transformers==4.43.0
peft==0.11.1
datasets
accelerate==0.30.0
deepspeed==0.13.1
Levenshtein
PyAV==12.3.0

微调脚本

# Import necessary libraries
# Code orginally from https://wandb.ai/byyoung3/mlnews3/reports/How-to-fine-tune-Phi-3-vision-on-a-custom-dataset--Vmlldzo4MTEzMTg3 
# Credits to: Brett Young https://github.com/bdytx5/

import os
import torch
from torch.utils.data import Dataset, DataLoader, random_split
from openmind import AutoModelForCausalLM, AutoProcessor
from torchvision import transforms
from PIL import Image
import pandas as pd
import random
import numpy as np
from torchvision.transforms.functional import resize, to_pil_image
# import wandb
import torch.optim as optim
import torch.nn.functional as F

torch.manual_seed(3)


# Custom Dataset class for Burberry Product Prices and Images
class BurberryProductDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length, image_size):
        self.dataframe = dataframe
        self.tokenizer = tokenizer
        self.tokenizer.padding_side = 'left'  # Set padding side to left
        self.max_length = max_length
        
    def __len__(self):
        return len(self.dataframe)

    def __getitem__(self, idx):
        # Get the row at the given index
        row = self.dataframe.iloc[idx]
        
        # Create the text input for the model
        text = f"<|user|>\n<|image_1|>What is shown in this image?<|end|><|assistant|>\nProduct: {row['title']}, Category: {row['category3_code']}, Full Price: {row['full_price']}<|end|>"
        
        # Get the image path from the row
        image_path = row['local_image_path']
        
        # Tokenize the text input
        encodings = self.tokenizer(text, truncation=True, padding='max_length', max_length=self.max_length)
        
        try:
            # Load and transform the image
            image = Image.open(image_path).convert("RGB")
            image = self.image_transform_function(image)
        except (FileNotFoundError, IOError):
            # Skip the sample if the image is not found
            return None
        
        # Add the image and price information to the encodings dictionary
        encodings['pixel_values'] = image
        encodings['price'] = row['full_price']
        
        return {key: torch.tensor(val) for key, val in encodings.items()}

    def image_transform_function(self, image):
        # Convert the image to a numpy array
        image = np.array(image)
        return image

# Load dataset from disk
dataset_path = './data/burberry_dataset/burberry_dataset.csv'
df = pd.read_csv(dataset_path)

# Initialize processor and tokenizer for the pre-trained model
model_id = "microsoft/Phi-3-vision-instruct"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True, attn_implementation=False)
tokenizer = processor.tokenizer

# Split dataset into training and validation sets
train_size = int(0.9 * len(df))
val_size = len(df) - train_size
train_indices, val_indices = random_split(range(len(df)), [train_size, val_size])
train_indices = train_indices.indices
val_indices = val_indices.indices
train_df = df.iloc[train_indices]
val_df = df.iloc[val_indices]

# Create dataset and dataloader for training set
train_dataset = BurberryProductDataset(train_df, tokenizer, max_length=512, image_size=128)
train_loader = DataLoader(train_dataset, batch_size=1, shuffle=True)

# Create dataset and dataloader for validation set
val_dataset = BurberryProductDataset(val_df, tokenizer, max_length=512, image_size=128)
val_loader = DataLoader(val_dataset, batch_size=1, shuffle=False)

# Initialize the pre-trained model
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="npu", trust_remote_code=True, torch_dtype="auto")

# Set the device to GPU if available, otherwise use CPU
device = torch.device("npu" if torch.npu.is_available() else "cpu")
model.to(device)

# Initialize the optimizer
optimizer = optim.AdamW(model.parameters(), lr=5e-5)

# Training loop
num_epochs = 1
eval_interval = 150  # Evaluate every 'eval_interval' steps
loss_scaling_factor = 1000.0  # Variable to scale the loss by a certain amount
save_dir = './saved_models'
step = 0
accumulation_steps = 64  # Accumulate gradients over this many steps

# Create a directory to save the best model
if not os.path.exists(save_dir):
    os.makedirs(save_dir)

best_val_loss = float('inf')
best_model_path = None

# Select 10 random images from the validation set for logging
num_log_samples = 10
log_indices = random.sample(range(len(val_dataset)), num_log_samples)

# Function to extract the predicted price from model predictions
def extract_price_from_predictions(predictions, tokenizer):
    # Assuming the price is at the end of the text and separated by a space
    predicted_text = tokenizer.decode(predictions[0], skip_special_tokens=True)
    try:
        predicted_price = float(predicted_text.split()[-1].replace(',', ''))
    except ValueError:
        predicted_price = 0.0
    return predicted_price

# Function to evaluate the model on the validation set
def evaluate(model, val_loader, device, tokenizer, step, log_indices, max_samples=None):
    model.eval()
    total_loss = 0
    total_price_error = 0
    log_images = []
    log_gt_texts = []
    log_pred_texts = []
    # table = wandb.Table(columns=["Image", "Ground Truth Text", "Predicted Text"])

    with torch.no_grad():
        for i, batch in enumerate(val_loader):
            if max_samples and i >= max_samples:
                break

            if batch is None:  # Skip if the batch is None
                continue

            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            pixel_values = batch['pixel_values'].to(device)
            labels = input_ids.clone().detach()
            actual_price = batch['price'].item()

            outputs = model(
                input_ids=input_ids, 
                attention_mask=attention_mask, 
                pixel_values=pixel_values, 
                labels=labels
            )
            loss = outputs.loss
            total_loss += loss.item()

            # Calculate price error
            predictions = torch.argmax(outputs.logits, dim=-1)
            predicted_price = extract_price_from_predictions(predictions, tokenizer)
            price_error = abs(predicted_price - actual_price)
            total_price_error += price_error

            # Log images, ground truth texts, and predicted texts
            if i in log_indices:
                log_images.append(pixel_values.cpu().squeeze().numpy())
                log_gt_texts.append(tokenizer.decode(labels[0], skip_special_tokens=True))
                log_pred_texts.append(tokenizer.decode(predictions[0], skip_special_tokens=True))

                # Convert image to PIL format
                pil_img = to_pil_image(resize(torch.from_numpy(log_images[-1]).permute(2, 0, 1), (336, 336))).convert("RGB")
                
                # Add data to the table
                # table.add_data(wandb.Image(pil_img), log_gt_texts[-1], log_pred_texts[-1])

                # Log the table incrementally
    # wandb.log({"Evaluation Results step {}".format(step): table, "Step": step})

    avg_loss = total_loss / (i + 1)  # i+1 to account for the loop index
    avg_price_error = total_price_error / (i + 1)
    model.train()

    return avg_loss, avg_price_error

# Set the model to training mode
model.train()

# Training loop for the specified number of epochs
for epoch in range(num_epochs):
    total_train_loss = 0
    total_train_price_error = 0
    batch_count = 0

    for batch in train_loader:
        step += 1

        if batch is None:  # Skip if the batch is None
            continue

        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        pixel_values = batch['pixel_values'].to(device)
        labels = input_ids.clone().detach()
        actual_price = batch['price'].float().to(device)

        outputs = model(
            input_ids=input_ids, 
            attention_mask=attention_mask, 
            pixel_values=pixel_values, 
            labels=labels
        )
        loss = outputs.loss
        total_loss = loss
        predictions = torch.argmax(outputs.logits, dim=-1)            
        predicted_price = extract_price_from_predictions(predictions, tokenizer)

        total_loss.backward()

        if (step % accumulation_steps) == 0:
            for param in model.parameters():
                if param.grad is not None:
                    param.grad /= accumulation_steps
            optimizer.step()
            optimizer.zero_grad()

        total_train_loss += total_loss.item()
        total_train_price_error += abs(predicted_price - actual_price.item())
        batch_count += 1

        # Log batch loss to Weights & Biases
        # wandb.log({"Batch Loss": total_loss.item(), "Step": step})

        print(f"Epoch: {epoch}, Step: {step}, Batch Loss: {total_loss.item()}")

        if step % eval_interval == 0:
            val_loss, val_price_error = evaluate(model, val_loader, device, tokenizer=tokenizer, log_indices=log_indices, step=step )
            # wandb.log({
            #     "Validation Loss": val_loss,
            #     "Validation Price Error (Average)": val_price_error,
            #     "Step": step
            # })
            print(f"Step: {step}, Validation Loss: {val_loss}, Validation Price Error (Normalized): {val_price_error}")

            # Save the best model based on validation loss
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                best_model_path = os.path.join(save_dir, f"best_model")
                model.save_pretrained(best_model_path, safe_serialization=False)
                tokenizer.save_pretrained(best_model_path)

            avg_train_loss = total_train_loss / batch_count
            avg_train_price_error = total_train_price_error / batch_count
            # wandb.log({
            #     "Epoch": epoch,
            #     "Average Training Loss": avg_train_loss,
            #     "Average Training Price Error": avg_train_price_error
            # })
            
    print(f"Epoch: {epoch}, Average Training Loss: {avg_train_loss}, Average Training Price Error: {avg_train_price_error}")

    # Log the best model to Weights & Biases
    if best_model_path:
        run.log_model(
            path=best_model_path,
            name="phi3-v-burberry",
            aliases=["best"],
        )

# Finish the Weights & Biases run
# wandb.finish()

基准测试

为了全面了解模型能力，我们通过内部基准测试平台，在一系列零样本基准测试中对Phi-3.5-vision与其他模型进行了对比。以下是模型在代表性基准测试上的整体质量概况：

类别	基准测试	Phi-3.5-vision-instruct	Intern-VL-2-4B	Intern-VL-2-8B	Gemini-1.5-Flash	GPT-4o-mini 2024-7-18	Claude-3.5-Sonnet	Gemini-1.5-Pro	GPT-4o 2024-5-13
主流综合基准	MMMU (验证集)	43.0	44.22	46.33	49.33	52.1	52.67	54.11	61.78
	MMBench (开发集-英文)	81.9	83.4	87.0	85.7	83.8	82.3	87.9	88.4
视觉科学知识推理	ScienceQA (图像测试集)	91.3	94.9	95.9	84.5	84.0	73.8	86.0	88.5
视觉数学推理	MathVista (测试迷你集)	43.9	53.7	51.1	55.3	38.8	54.0	57.4	54.4
	InterGPS (测试集)	36.3	45.6	53.2	39.4	39.9	45.6	58.2	46.9
图表推理	AI2D (测试集)	78.1	77.3	81.4	78.4	75.2	68.9	75.6	82.8
	ChartQA (测试集)	81.8	78.8	80.4	57.6	54.5	73.2	68.2	64.0
文档智能	TextVQA (验证集)	72.0	66.2	68.8	67.4	70.9	70.5	64.5	75.6
对象视觉存在验证	POPE (测试集)	86.1	83.3	84.2	86.1	83.6	76.6	89.3	87.0

安全评估与红队测试

方法 Phi-3系列模型采用了稳健的安全后训练方法。该方法利用了多种开源数据集和内部生成的数据集。安全对齐的整体技术结合了监督微调（SFT）和人类反馈强化学习（RLHF）方法，使用人工标注和合成的英语数据集，包括专注于有用性和无害性的公开可用数据集，以及针对多个安全类别的各类问答数据。

安全评估 我们利用多种评估技术，包括红队测试、对抗性对话模拟和安全评估基准数据集，来评估Phi-3.5模型在多个风险类别中产生不良输出的倾向。为了弥补单一方法的局限性，我们采用了多种评估手段。有关我们安全对齐的更多详细信息，请参阅技术报告。

软件

硬件

请注意，默认情况下，Phi-3.5-Mini-Instruct模型使用flash attention，这需要特定类型的GPU硬件才能运行。我们已在以下类型的硬件上进行了测试：

NVIDIA A100
NVIDIA A6000
NVIDIA H100
昇腾NPU

许可证

本模型根据MIT许可证授权。

商标

本项目可能包含项目、产品或服务的商标或徽标。微软商标或徽标的授权使用受微软商标与品牌指南约束，且必须遵循该指南。在本项目的修改版本中使用微软商标或徽标不得造成混淆或暗示微软的赞助。任何第三方商标或徽标的使用均受该第三方政策的约束。