Ascend-SACT/Qwen3-VL-30B-A3B-Instruct-MindSpeed-MM
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

[toc]

Qwen3-VL-30B-A3B-Instruct MindSpeed-MM 微调指导

1. 模型概述及场景

1.1 Qwen3-VL 亮点

Qwen3-VL——迄今为止 Qwen 系列中最强大的视觉-语言模型。

这一代在各个方面都进行了全面升级:更优秀的文本理解和生成、更深入的视觉感知和推理、扩展的上下文长度、增强的空间和视频动态理解能力,以及更强的代理交互能力。

提供密集型和 MoE 架构,可从边缘扩展到云端,并有 Instruct 和增强推理的 Thinking 版本,以实现灵活的按需部署。

主要增强功能:

  • 视觉代理:操作 PC/移动 GUI——识别元素、理解功能、调用工具、完成任务。
  • 视觉编码增强:从图像/视频生成 Draw.io/HTML/CSS/JS。
  • 高级空间感知:判断物体位置、视角和遮挡;提供更强的 2D 接地并支持 3D 接地,用于空间推理和具身 AI。
  • 长上下文和视频理解:原生 256K 上下文,可扩展至 1M;处理书籍和长达数小时的视频,具有完整的回忆和秒级索引。
  • 增强的多模态推理:擅长 STEM/数学——因果分析和基于逻辑、证据的答案。
  • 升级的视觉识别:更广泛、更高品质的预训练能够“识别一切”——名人、动漫、产品、地标、动植物等。
  • 扩展的 OCR:支持 32 种语言(从 19 种增加);在低光、模糊和倾斜情况下表现稳健;更好地处理罕见/古代字符和术语;改进了长文档结构解析。
  • 与纯 LLM 相当的文本理解:无缝的文本-视觉融合,实现无损、统一的理解。

本案例使用昇腾 A2 机器,基于 MindSpeed-MM 框架完成 Qwen3-VL-30B-A3B-Instruct 微调实践。

2. 准备运行环境

2.1. 硬件环境

硬件名称配置信息
机器型号A2
测试集群2机(16卡)
操作系统ARM

2.2 软件版本

软件版本部署方式
DriverAscendHDK 25.2.0宿主机
FirmwareAscendHDK 25.2.0宿主机
Docker镜像OSUbuntu 20.04.6容器
Python3.10.18容器
CANN8.3.RC1容器
Torch2.7.1容器
Torch_npu2.7.1容器
transformersmaster(c0dbe09)容器
MindSpeed0.12.1容器
Megatron-LMcore_v0.12.1容器

2.3. 镜像准备

可基于此镜像进行改造

2.3.1 删除不相关的环境

# 1. 先退出当前环境(返回到base环境)
conda deactivate
# 删除名为 mindspeed-mm-1030 的环境(可以删除所有不相关的镜像)
conda remove --name mindspeed-mm-1030 --all

2.3.2 创建新的 conda 环境,安装 torch和torch-npu

# 创建mindspeed-mm环境
source /root/miniconda3/etc/profile.d/conda.sh 
export MindSpeed_MM_ENV_NAME=mindspeed-mm-1225
conda create -n ${MindSpeed_MM_ENV_NAME} python=3.10.18 -y
conda activate ${MindSpeed_MM_ENV_NAME}

# 配置pip镜像
pip config set global.index-url http://mirrors.aliyun.com/pypi/simple
pip config set global.trusted-host mirrors.aliyun.com

# 安装python基础依赖
pip install --no-cache-dir attrs cython decorator \
    sympy cffi pyyaml pathlib2 psutil protobuf==3.20.0 scipy \
    requests absl-py

# 配置Git(全局设置)
git config --global http.sslverify false && \
git config --global https.sslverify false && \
git config --global http.postBuffer 2000000000


# 下载并安装PyTorch和torch_npu到conda环境
wget --no-check-certificate "https://download.pytorch.org/whl/cpu/torch-2.7.1%2Bcpu-cp310-cp310-manylinux_2_28_aarch64.whl" 
wget --no-check-certificate        "https://gitcode.com/Ascend/pytorch/releases/download/v7.2.0-pytorch2.7.1/torch_npu-2.7.1-cp310-cp310-manylinux_2_28_aarch64.whl"
pip install --no-cache-dir torch-2.7.1*.whl 
pip install --no-cache-dir torch_npu-2.7.1*.whl
pip install --no-cache-dir tensorboard tensorboard-data-server wheel
rm -f *.whl

git clone -b master https://gitcode.com/Ascend/apex.git 
cd apex/ 
bash scripts/build.sh --python=3.10
pip uninstall -y apex
pip install --no-cache-dir apex/dist/apex*.whl
cd .. 
rm -rf apex  /root/.cache/pip

# 配置bashrc,交互式shell自动激活环境,如果文件中有其他的可以先删除其他的环境
echo "source /root/miniconda3/etc/profile.d/conda.sh" >> /root/.bashrc && \
echo "conda activate ${MindSpeed_MM_ENV_NAME}" >> /root/.bashrc

2.3.3 安装 MindSpeed-MM 等代码镜像

git clone https://gitcode.com/Ascend/MindSpeed-MM.git
cd MindSpeed-MM

# 对于X86架构机器,执行如下指令:
bash scripts/install.sh --arch x86 --msid d76dbddd4517d48a2fc1cd494de8b9a6cfdbfbab&& pip install -r examples/qwen3vl/requirements.txt

# 对于ARM架构机器,执行如下指令:
bash scripts/install.sh --arch arm --msid d76dbddd4517d48a2fc1cd494de8b9a6cfdbfbab&& pip install -r examples/qwen3vl/requirements.txt

执行完成后生成如下目录:

drwxr-xr-x 14 root root 4096 Dec 25 10:17 Megatron-LM/
drwxr-xr-x 10 root root 4096 Dec 25 10:12 MindSpeed/
drwxr-xr-x 20 root root 4096 Dec 25 10:20 MindSpeed-MM/
drwxr-xr-x  2 root root 4096 Dec 25 10:15 ckpt/
drwxr-xr-x  2 root root 4096 Dec 25 10:15 data/
drwxr-xr-x  2 root root 4096 Dec 25 10:15 logs/

查看相关依赖是否安装:

(mindspeed-mm-1225) root@dl-c8820:/src# pip list | grep -i mindspeed
mindspeed                 0.12.1      /src/MindSpeed-MM/MindSpeed
mindspeed-mm              0.1         /src/MindSpeed-MM
transformers              4.57.0.dev0 /src/MindSpeed-MM/src/transformers

镜像参考:https://gitcode.com/Ascend/MindSpeed-RL/blob/2.2.0/rl-plugin/README.MD

环境安装参考:https://gitcode.com/Ascend/MindSpeed-MM/blob/master/examples/qwen3vl/README.md

3. 运行指导

3.1 数据集准备

3.1.1 数据集概述

COCO2017是计算机视觉领域最重要的基准数据集之一,全称为"Common Objects in Context"(上下文中的常见物体)。这是由微软COCO联盟发布的大规模数据集,专门用于目标检测、图像分割、关键点检测等任务。

数据集规模

COCO2017包含约164,000张图像和180万个标注实例,具体分布如下:

  • 训练集(train2017):118,287张图像
  • 验证集(val2017):5,000张图像
  • 测试集(test2017):40,670张图像(其中test-dev 20K,test-challenge 20K)
  • 无标注图像:约41,739张(占总数25%)

数据集特点

  1. 复杂场景:图像描绘了日常生活中复杂的场景,物体处于自然的上下文环境中
  2. 高质量标注:像素级的精确标注
  3. 多样性:涵盖各种物体尺寸、视角和场景

主要任务

1. 目标检测 (Object Detection):识别图像中的物体并用边界框框出其位置

2. 实例分割 (Instance Segmentation):不仅检测物体,还要精确分割出每个物体实例的像素级轮廓

3. 关键点检测 (Keypoint Detection/Pose Estimation):检测人体的17个关键点,包括眼睛、鼻子、肩膀、肘部、膝盖等

4. 全景分割 (Panoptic Segmentation):统一的分割任务,同时处理"things"(可数物体)和"stuff"(背景类别)

5. 图像描述生成 (Image Captioning):根据图像生成自然语言描述

3.1.2 数据集下载和转换

COCO2017数据集:https://cocodataset.org/#download,

训练数据集:http://images.cocodataset.org/zips/train2017.zip,下载后**解压**。

获取图片数据集的描述文件(LLaVA-Instruct-150K),放到指定目录。

运行数据转换脚本python examples/qwen2vl/llava_instruct_2_mllm_demo_format.py(qwen3vl和qwen2vl使用同一个脚本),mllm_format_json_path是数据转换后的文件路径,执行命令和脚本如下:

python llava_instruct_2_mllm_demo_format.py

3.2 模型权重准备

3.2.1 模型权重下载

可使用 HuggingFace CLI 进行下载:

# 安装依赖
pip install -U huggingface_hub

# 设置环境变量
export HF_ENDPOINT=https://hf-mirror.com

#下载模型
hf download Qwen/Qwen3-VL-30B-A3B-Instruct --local-dir ./weights_hf/Qwen3-VL-30B-A3B-Instruct

# 如果无法下载,使用modelscope下载
pip install modelscope
modelscope download --model Qwen/Qwen3-VL-30B-A3B-Instruct  --local_dir ./weights_hf/Qwen3-VL-30B-A3B-Instruct

3.2.2 权重转换

如果使用fsdp2的meta init初始化模型,需要先完成以下权重转换

mm-convert Qwen3VLConverter hf_to_dcp \
  --hf_dir /root/work/filestorage/weights_hf/Qwen3-VL-30B-A3B-Instruct \
  --dcp_dir /root/work/filestorage/weights_dcp/Qwen3-VL-30B-A3B-Instruct-dcp

完成后会在dcp_dir中生成一个转换后的权重:

Qwen3-VL-30B-A3B-Instruct-dcp/
drwxr-x--- 3 root root 4096 Nov 17 10:18 ./
drwxr-xr-x 3 root root 4096 Nov 17 10:15 ../
-rw-r----- 1 root root    7 Nov 17 10:18 latest_checkpointed_iteration.txt
drwxr-x--- 2 root root 4096 Nov 17 10:18 release/

使用dcp的权重时,需在examples/qwen3vl/finetune_qwen3vl.sh的GPT_ARGS中添加--init-model-with-meta-device参数

3.3 微调

3.3.1 修改微调脚本

在原始脚本的GPT_ARGS中添加--init-model-with-meta-device参数,并根据实际需求修改脚本中的NODE_RANK、MASTER_ADDR、路径等信息。

#!/bin/bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /root/miniconda3/etc/profile.d/conda.sh
conda activate mindspeed-mm-1225

cd /src/MindSpeed-MM/

# 该变量只用于规避megatron对其校验,对npu无效
export CUDA_DEVICE_MAX_CONNECTIONS=2 # 开启FSDP2时,不能置为1
export ASCEND_SLOG_PRINT_TO_STDOUT=0
export ASCEND_GLOBAL_LOG_LEVEL=3
export TASK_QUEUE_ENABLE=2
export COMBINED_ENABLE=1
export CPU_AFFINITY_CONF=1
export HCCL_CONNECT_TIMEOUT=1200
export NPU_ASD_ENABLE=0
export ASCEND_LAUNCH_BLOCKING=0
export ACLNN_CACHE_LIMIT=100000
export TOKENIZERS_PARALLELISM=false
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export MULTI_STREAM_MEMORY_REUSE=1

NPUS_PER_NODE=8
MASTER_ADDR=$(python /root/work/filestorage/scripts/get_master_ip.py)
MASTER_PORT=6000
NNODES=2
NODE_RANK=${RANK}
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))


MM_DATA="/root/work/filestorage/scripts/qwen3-vl/data_30B.json"
MM_MODEL="/root/work/filestorage/scripts/qwen3-vl/model_30B.json"
MM_TOOL="/root/work/filestorage/scripts/qwen3-vl/tools.json"
LOAD_PATH="/root/work/filestorage/weights_dcp/Qwen3-VL-30B-A3B-Instruct-dcp-1225/"
SAVE_PATH="/root/work/filestorage/ckpt/qwen3-vl-30b/tp1pp1cp1"
FSDP2_PATH="/root/work/filestorage/scripts/qwen3-vl/fsdp2_config.yaml"
LOG_PATH="/root/work/filestorage/logs/qwen3-vl-30b/1226"

mkdir -p ${LOG_PATH}

TP=1
PP=1
CP=1
MBS=1
GRAD_ACC_STEP=1
SEQ_LEN=1024
DP=$(($WORLD_SIZE/$TP/$PP/$CP))
GBS=$(($MBS*$GRAD_ACC_STEP*$DP))


DISTRIBUTED_ARGS="
    --nproc_per_node $NPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

# GPT_ARGS中模型相关参数具体配置在example/qwen2vl/model_xb.json中,训练相关参数配置在这里
GPT_ARGS="
    --use-mcore-models \
    --init-model-with-meta-device \
    --tensor-model-parallel-size ${TP} \
    --pipeline-model-parallel-size ${PP} \
    --context-parallel-size ${CP} \
    --context-parallel-algo ulysses_cp_algo \
    --micro-batch-size ${MBS} \
    --global-batch-size ${GBS} \
    --tokenizer-type NullTokenizer \
    --vocab-size 152064 \
    --seq-length ${SEQ_LEN} \
    --make-vocab-size-divisible-by 1 \
    --normalization RMSNorm \
    --use-fused-rmsnorm \
    --swiglu \
    --use-fused-swiglu \
    --no-masked-softmax-fusion \
    --lr 1.0e-5 \
    --lr-decay-style cosine \
    --weight-decay 0 \
    --train-iters 2000 \
    --lr-warmup-fraction 0.1 \
    --clip-grad 0.0 \
    --adam-beta1 0.9 \
    --adam-beta2 0.999 \
    --no-gradient-accumulation-fusion \
    --seed 42 \
    --load $LOAD_PATH \
    --use-flash-attn \
    --no-load-optim \
    --no-load-rng \
    --no-save-optim \
    --no-save-rng \
    --num-workers 8 \
    --use-torch-fsdp2 \
    --untie-embeddings-and-output-weights \
    --ckpt-format torch_dcp \
    --fsdp2-config-path $FSDP2_PATH \
    --optimizer-selection fused_torch_adamw \
    --use-cpu-initialization \
    --calculate-per-token-loss \
    --log-tps
"

MM_ARGS="
    --mm-data $MM_DATA \
    --mm-model $MM_MODEL \
    --mm-tool $MM_TOOL
"

OUTPUT_ARGS="
    --log-interval 1 \
    --save-interval 500 \
    --eval-interval 500 \
    --eval-iters 500 \
    --save $SAVE_PATH \
"
logfile=$(date +%Y%m%d)_$(date +%H%M%S)
torchrun $DISTRIBUTED_ARGS pretrain_transformers.py \
    $GPT_ARGS \
    $MM_ARGS \
    $OUTPUT_ARGS \
    --distributed-backend nccl \
    2>&1 | tee ${LOG_PATH}/train_${logfile}-RANK${RANK}.log
chmod 440 ${LOG_PATH}/train_${logfile}.log
find $SAVE_PATH -type d -exec chmod 750 {} \;
find $SAVE_PATH -type f -exec chmod 640 {} \;
STEP_TIME=`grep "elapsed time per iteration" ${LOG_PATH}/train_${logfile}.log | awk -F ':' '{print$5}' | awk -F '|' '{print$1}' | head -n 150 | tail -n 100 | awk '{sum+=$1} END {if (NR != 0) printf("%.1f",sum/NR)}'`
SAMPLES_PER_SECOND=`awk 'BEGIN{printf "%.3f\n", '${GBS}'*1000/'${STEP_TIME}'}'`
echo "Elapsed Time Per iteration: $STEP_TIME" >> ${LOG_PATH}/train_${logfile}-summary.log
echo "Average Samples per Second: $SAMPLES_PER_SECOND" >> ${LOG_PATH}/train_${logfile}-summary.log
LOG_TOKENS_PER_SECOND=`grep "tokens per sample" ${LOG_PATH}/train_${logfile}.log`
if [ "$LOG_TOKENS_PER_SECOND" ]; then
    AVERAGE_TOKENS=`grep "tokens per sample" ${LOG_PATH}/train_${logfile}.log | awk -F 'tokens per sample:' '{print$2}' | awk -F '|' '{print$1}' | head -n 150 | tail -n 100 | awk '{sum+=$1} END {if (NR != 0) printf("%.1f",sum/NR)}'`
    TOKENS_PER_SECOND=`awk 'BEGIN{printf "%.3f\n", '${SAMPLES_PER_SECOND}'*'${AVERAGE_TOKENS}'}'`
    echo "Consumed Tokens per Second: $TOKENS_PER_SECOND" >> ${LOG_PATH}/train_${logfile}-summary.log
fi

3.3.2 修改 model_30B.json

在 model_30B.json 中修改模型权重路径,使用 huggingface 的权重。

{
    "model_id": "qwen3_vl_moe",
    "init_from_hf_path": "/root/work/filestorage/weights_hf/Qwen3-VL-30B-A3B-Instruct/",
    "image_encoder": {
        "vision_encoder": {
            "model_id": "qwen3vit",
            "num_layers": 27,
            "hidden_size": 1152,
            "num_attention_heads": 16,
            "freeze": false,
            "attn_implementation": "flash_attention_2",
            "attn_layout": "TND",
            "synchronize_per_layer": true
        },
        "vision_projector": {
            "model_id": "lnmlp",
            "num_layers": 1,
            "freeze": false
        }
    },
    "text_decoder": {
        "model_id": "qwen3lm",
        "num_layers": 48,
        "hidden_size": 2048,
        "num_attention_heads": 32,
        "max_position_embeddings": 262144,
        "freeze": false,
        "use_npu_fused_moe": true,
        "attn_implementation": "flash_attention_2",
        "attn_layout": "TND",
        "is_causal": false,
        "activation_offload": false,
        "synchronize_per_layer": true
    },
    "loss_cfg": {
        "compute_mode": "default",
        "chunk_size": 1024,
        "router_aux_loss_coef": 0.0
    },
    "patch": {
        "clip_grad_async": true,
        "scale_grad": true
    }
}

3.3.3 修改 data_30B.json

在 data_30B.json 中修改模型权重路径和数据集路径,其中 dataset_dir 表示图片存储路径,dataset 是数据集路径。若数据集包含图片,系统会到 dataset_dir 中进行查找。

{
    "dataset_param": {
        "dataset_type": "huggingface",
        "preprocess_parameters": {
            "model_name_or_path": "/root/work/filestorage/weights_hf/Qwen3-VL-30B-A3B-Instruct",
            "use_fast_tokenizer": true,
            "split_special_tokens": false,
            "image_max_pixels": 262144,
            "image_min_pixels": 1024,
            "video_max_pixels": 16384,
            "video_min_pixels": 0,
            "video_fps": 2.0,
            "video_maxlen": 64
        },
        "basic_parameters": {
            "template": "qwen3_vl_nothink",
            "dataset_dir": "/root/work/filestorage/datasets/qwen3-vl/train2017",
            "dataset": "/root/work/filestorage/datasets/qwen3-vl/mllm_format_llava_instruct_data.json",
            "cache_dir": "",
            "enable_thinking": false,
            "overwrite_cache": false,
            "train_on_prompt": false,
            "mask_history": false,
            "preprocessing_batch_size": 1000,
            "preprocessing_num_workers": 16,
            "max_samples": null,
            "tool_format": null
        },
        "attr": {
            "system": null,
            "images": "images",
            "videos": null,
            "messages": "messages",
            "role_tag": "role",
            "content_tag": "content",
            "user_tag": "user",
            "assistant_tag": "assistant",
            "observation_tag": null,
            "function_tag": null,
            "system_tag": null
        }
    },
    "dataloader_param": {
        "dataloader_mode": "sampler",
        "drop_last": true,
        "sampler_type": "BaseRandomBatchSampler",
        "collate_param": {
            "model_name": "qwen3vl",
            "ignore_pad_token_for_loss": true
        },
        "pin_memory": true,
        "shuffle": true
    }
}

转换后的数据集如下,images 表示图片路径(如果不是绝对路径的话会去 data_30B.json 的 dataset_dir 中查找),message 中的 role、content 的名称和值要和 data_30B.json 中的 attr 标签保持一致。

[
    {
        "images": [
            "/root/work/filestorage/datasets/qwen3-vl/train2017/000000033471.jpg"
        ],
        "messages": [
            {
                "role": "user",
                "content": "<image>\nWhat are the colors of the bus in the image?"
            },
            {
                "role": "assistant",
                "content": "The bus in the image is white and red."
            },
        ]
    }
]

3.3.4 启动微调

在各个节点执行:

bash finetune_qwen3vl_30B.sh

可将日志拖动到https://curryrice233.github.io/TrainingLogParser/ 网站中查看 loss 曲线。

公司内部可使用:https://traininglogparser.openx.huawei.com/

[2025-12-26 18:45:45] iteration       40/    2000 | consumed samples:          640 | elapsed time per iteration (ms): 3899.4 | learning rate: 1.950000E-06 | global batch size:    16 | tokens per sample: 1.737500E+01 | loss: 2.844364E+00 | loss scale: 1.0 | grad norm: 38.164 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-12-26 18:45:49] iteration       41/    2000 | consumed samples:          656 | elapsed time per iteration (ms): 3673.6 | learning rate: 2.000000E-06 | global batch size:    16 | tokens per sample: 1.393750E+01 | loss: 2.221318E+00 | loss scale: 1.0 | grad norm: 19.008 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |

img

3.3.5 启动推理

训练完成之后,以Qwen3VL-8B为例,将保存在save_dir目录下的权重转换成huggingface格式

mm-convert Qwen3VLConverter dcp_to_hf \
  --load_dir /root/work/filestorage/ckpt/qwen3-vl-30b/tp1pp1cp1/iter_0000500/ \
  --save_dir /root/work/filestorage/ckpt/qwen3-vl-30b/tp1pp1cp1/iter_0000500_hf/ \
  --model_assets_dir /root/work/filestorage/weights_hf/Qwen3-VL-30B-A3B-Instruct/

其中,iter_000xx表示保存的第xx步的权重,--save_dir表示转换后的权重保存路径,--model_assets_dir表示原始huggingface权重的路径。

完成权重转换之后,即可使用transformers库进行推理。

推理脚本内容:

from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessor

# default: Load the model on the available device(s)
model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
    "/root/work/filestorage/ckpt/qwen3-vl-30b/tp1pp1cp1/iter_0000200_hf", dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("/root/work/filestorage/ckpt/qwen3-vl-30b/tp1pp1cp1/iter_0000200_hf")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "/root/work/filestorage/scripts/qwen3-vl/kite.jpg",
            },
            {"type": "text", "text": "描述这张照片"},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)

inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

执行推理脚本输出如下:

python gen.py

img

4. 常见问题

4.1 微调时 size 不匹配

img

此问题是由不同版本转换后的权重不一致导致的,使用新版本的mm-convert Qwen3VLConverter hf_to_dcp重新转换即可解决。

4.2 数据格式不一致

代码支持的数据集格式如下:

[
    {
        "images": [
            "/root/work/filestorage/datasets/qwen3-vl/train2017/000000033471.jpg"
        ],
        "messages": [
            {
                "role": "user",
                "content": "<image>\nWhat are the colors of the bus in the image?"
            },
            {
                "role": "assistant",
                "content": "The bus in the image is white and red."
            },
        ]
    }
]

客户数据格式可能不一样,例如:

[
    {
        "id": "GCC_train_001711524",
        "image": "GCC_train_001711524.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "Write a terse but informative summary of the picture.\n<image>"
            },
            {
                "from": "gpt",
                "value": "an old dinghy rests upon a muddy shore"
            }
        ]
    }
]

有两种方式可以处理:

  1. 使用转换脚本将客户数据集转换成版本支持的数据集(推荐)

  2. 修改 data_30B.json 的 attr属性

    "attr": {
                "system": null,
                "images": "image",
                "videos": null,
                "messages": "conversations",
                "role_tag": "from",
                "content_tag": "value",
                "user_tag": "human",
                "assistant_tag": "gpt",
                "observation_tag": null,
                "function_tag": null,
                "system_tag": null
            }

4.3 use_npu_fused_moe 性能对比

开启MoE融合可以提升模型训练性能,开启方式为将model_30B.json文件中修改use_npu_fused_moe字段为true

未开启:

[2025-12-26 18:22:17] iteration       28/    2000 | consumed samples:          448 | elapsed time per iteration (ms): 75573.3 | learning rate: 1.350000E-06 | global batch size:    16 | tokens per sample: 1.612500E+01 | loss: 3.324713E+00 | loss scale: 1.0 | grad norm: 25.632 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-12-26 18:23:32] iteration       29/    2000 | consumed samples:          464 | elapsed time per iteration (ms): 75544.4 | learning rate: 1.400000E-06 | global batch size:    16 | tokens per sample: 1.362500E+01 | loss: 3.193691E+00 | loss scale: 1.0 | grad norm: 26.713 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |

开启:

[2025-12-26 18:44:54] iteration       28/    2000 | consumed samples:          448 | elapsed time per iteration (ms): 4279.7 | learning rate: 1.350000E-06 | global batch size:    16 | tokens per sample: 1.612500E+01 | loss: 3.296955E+00 | loss scale: 1.0 | grad norm: 41.643 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2025-12-26 18:44:59] iteration       29/    2000 | consumed samples:          464 | elapsed time per iteration (ms): 5274.1 | learning rate: 1.400000E-06 | global batch size:    16 | tokens per sample: 1.362500E+01 | loss: 3.168727E+00 | loss scale: 1.0 | grad norm: 25.991 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 |

开启后大概有15倍以上的提升。