[toc]
Qwen3-VL——迄今为止 Qwen 系列中最强大的视觉-语言模型。
这一代在各个方面都进行了全面升级:更优秀的文本理解和生成、更深入的视觉感知和推理、扩展的上下文长度、增强的空间和视频动态理解能力,以及更强的代理交互能力。
提供密集型和 MoE 架构,可从边缘扩展到云端,并有 Instruct 和增强推理的 Thinking 版本,以实现灵活的按需部署。
主要增强功能:
本案例使用昇腾 A2 机器,基于 MindSpeed-MM 框架完成 Qwen3-VL-30B-A3B-Instruct 微调实践。
| 硬件名称 | 配置信息 |
|---|---|
| 机器型号 | A2 |
| 测试集群 | 2机(16卡) |
| 操作系统 | ARM |
| 软件 | 版本 | 部署方式 |
|---|---|---|
| Driver | AscendHDK 25.2.0 | 宿主机 |
| Firmware | AscendHDK 25.2.0 | 宿主机 |
| Docker镜像OS | Ubuntu 20.04.6 | 容器 |
| Python | 3.10.18 | 容器 |
| CANN | 8.3.RC1 | 容器 |
| Torch | 2.7.1 | 容器 |
| Torch_npu | 2.7.1 | 容器 |
| transformers | master(c0dbe09) | 容器 |
| MindSpeed | 0.12.1 | 容器 |
| Megatron-LM | core_v0.12.1 | 容器 |
可基于此镜像进行改造
# 1. 先退出当前环境(返回到base环境)
conda deactivate
# 删除名为 mindspeed-mm-1030 的环境(可以删除所有不相关的镜像)
conda remove --name mindspeed-mm-1030 --all# 创建mindspeed-mm环境
source /root/miniconda3/etc/profile.d/conda.sh
export MindSpeed_MM_ENV_NAME=mindspeed-mm-1225
conda create -n ${MindSpeed_MM_ENV_NAME} python=3.10.18 -y
conda activate ${MindSpeed_MM_ENV_NAME}
# 配置pip镜像
pip config set global.index-url http://mirrors.aliyun.com/pypi/simple
pip config set global.trusted-host mirrors.aliyun.com
# 安装python基础依赖
pip install --no-cache-dir attrs cython decorator \
sympy cffi pyyaml pathlib2 psutil protobuf==3.20.0 scipy \
requests absl-py
# 配置Git(全局设置)
git config --global http.sslverify false && \
git config --global https.sslverify false && \
git config --global http.postBuffer 2000000000
# 下载并安装PyTorch和torch_npu到conda环境
wget --no-check-certificate "https://download.pytorch.org/whl/cpu/torch-2.7.1%2Bcpu-cp310-cp310-manylinux_2_28_aarch64.whl"
wget --no-check-certificate "https://gitcode.com/Ascend/pytorch/releases/download/v7.2.0-pytorch2.7.1/torch_npu-2.7.1-cp310-cp310-manylinux_2_28_aarch64.whl"
pip install --no-cache-dir torch-2.7.1*.whl
pip install --no-cache-dir torch_npu-2.7.1*.whl
pip install --no-cache-dir tensorboard tensorboard-data-server wheel
rm -f *.whl
git clone -b master https://gitcode.com/Ascend/apex.git
cd apex/
bash scripts/build.sh --python=3.10
pip uninstall -y apex
pip install --no-cache-dir apex/dist/apex*.whl
cd ..
rm -rf apex /root/.cache/pip
# 配置bashrc,交互式shell自动激活环境,如果文件中有其他的可以先删除其他的环境
echo "source /root/miniconda3/etc/profile.d/conda.sh" >> /root/.bashrc && \
echo "conda activate ${MindSpeed_MM_ENV_NAME}" >> /root/.bashrcgit clone https://gitcode.com/Ascend/MindSpeed-MM.git
cd MindSpeed-MM
# 对于X86架构机器,执行如下指令:
bash scripts/install.sh --arch x86 --msid d76dbddd4517d48a2fc1cd494de8b9a6cfdbfbab&& pip install -r examples/qwen3vl/requirements.txt
# 对于ARM架构机器,执行如下指令:
bash scripts/install.sh --arch arm --msid d76dbddd4517d48a2fc1cd494de8b9a6cfdbfbab&& pip install -r examples/qwen3vl/requirements.txt执行完成后生成如下目录:
drwxr-xr-x 14 root root 4096 Dec 25 10:17 Megatron-LM/
drwxr-xr-x 10 root root 4096 Dec 25 10:12 MindSpeed/
drwxr-xr-x 20 root root 4096 Dec 25 10:20 MindSpeed-MM/
drwxr-xr-x 2 root root 4096 Dec 25 10:15 ckpt/
drwxr-xr-x 2 root root 4096 Dec 25 10:15 data/
drwxr-xr-x 2 root root 4096 Dec 25 10:15 logs/查看相关依赖是否安装:
(mindspeed-mm-1225) root@dl-c8820:/src# pip list | grep -i mindspeed
mindspeed 0.12.1 /src/MindSpeed-MM/MindSpeed
mindspeed-mm 0.1 /src/MindSpeed-MM
transformers 4.57.0.dev0 /src/MindSpeed-MM/src/transformers镜像参考:https://gitcode.com/Ascend/MindSpeed-RL/blob/2.2.0/rl-plugin/README.MD
环境安装参考:https://gitcode.com/Ascend/MindSpeed-MM/blob/master/examples/qwen3vl/README.md
COCO2017是计算机视觉领域最重要的基准数据集之一,全称为"Common Objects in Context"(上下文中的常见物体)。这是由微软COCO联盟发布的大规模数据集,专门用于目标检测、图像分割、关键点检测等任务。
数据集规模
COCO2017包含约164,000张图像和180万个标注实例,具体分布如下:
数据集特点
主要任务
1. 目标检测 (Object Detection):识别图像中的物体并用边界框框出其位置
2. 实例分割 (Instance Segmentation):不仅检测物体,还要精确分割出每个物体实例的像素级轮廓
3. 关键点检测 (Keypoint Detection/Pose Estimation):检测人体的17个关键点,包括眼睛、鼻子、肩膀、肘部、膝盖等
4. 全景分割 (Panoptic Segmentation):统一的分割任务,同时处理"things"(可数物体)和"stuff"(背景类别)
5. 图像描述生成 (Image Captioning):根据图像生成自然语言描述
COCO2017数据集:https://cocodataset.org/#download,
训练数据集:http://images.cocodataset.org/zips/train2017.zip,下载后**解压**。
获取图片数据集的描述文件(LLaVA-Instruct-150K),放到指定目录。
运行数据转换脚本python examples/qwen2vl/llava_instruct_2_mllm_demo_format.py(qwen3vl和qwen2vl使用同一个脚本),mllm_format_json_path是数据转换后的文件路径,执行命令和脚本如下:
python llava_instruct_2_mllm_demo_format.py可使用 HuggingFace CLI 进行下载:
# 安装依赖
pip install -U huggingface_hub
# 设置环境变量
export HF_ENDPOINT=https://hf-mirror.com
#下载模型
hf download Qwen/Qwen3-VL-30B-A3B-Instruct --local-dir ./weights_hf/Qwen3-VL-30B-A3B-Instruct
# 如果无法下载,使用modelscope下载
pip install modelscope
modelscope download --model Qwen/Qwen3-VL-30B-A3B-Instruct --local_dir ./weights_hf/Qwen3-VL-30B-A3B-Instruct如果使用fsdp2的meta init初始化模型,需要先完成以下权重转换
mm-convert Qwen3VLConverter hf_to_dcp \
--hf_dir /root/work/filestorage/weights_hf/Qwen3-VL-30B-A3B-Instruct \
--dcp_dir /root/work/filestorage/weights_dcp/Qwen3-VL-30B-A3B-Instruct-dcp完成后会在dcp_dir中生成一个转换后的权重:
Qwen3-VL-30B-A3B-Instruct-dcp/
drwxr-x--- 3 root root 4096 Nov 17 10:18 ./
drwxr-xr-x 3 root root 4096 Nov 17 10:15 ../
-rw-r----- 1 root root 7 Nov 17 10:18 latest_checkpointed_iteration.txt
drwxr-x--- 2 root root 4096 Nov 17 10:18 release/使用dcp的权重时,需在examples/qwen3vl/finetune_qwen3vl.sh的GPT_ARGS中添加--init-model-with-meta-device参数
在原始脚本的GPT_ARGS中添加--init-model-with-meta-device参数,并根据实际需求修改脚本中的NODE_RANK、MASTER_ADDR、路径等信息。
#!/bin/bash
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
source /root/miniconda3/etc/profile.d/conda.sh
conda activate mindspeed-mm-1225
cd /src/MindSpeed-MM/
# 该变量只用于规避megatron对其校验,对npu无效
export CUDA_DEVICE_MAX_CONNECTIONS=2 # 开启FSDP2时,不能置为1
export ASCEND_SLOG_PRINT_TO_STDOUT=0
export ASCEND_GLOBAL_LOG_LEVEL=3
export TASK_QUEUE_ENABLE=2
export COMBINED_ENABLE=1
export CPU_AFFINITY_CONF=1
export HCCL_CONNECT_TIMEOUT=1200
export NPU_ASD_ENABLE=0
export ASCEND_LAUNCH_BLOCKING=0
export ACLNN_CACHE_LIMIT=100000
export TOKENIZERS_PARALLELISM=false
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export MULTI_STREAM_MEMORY_REUSE=1
NPUS_PER_NODE=8
MASTER_ADDR=$(python /root/work/filestorage/scripts/get_master_ip.py)
MASTER_PORT=6000
NNODES=2
NODE_RANK=${RANK}
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
MM_DATA="/root/work/filestorage/scripts/qwen3-vl/data_30B.json"
MM_MODEL="/root/work/filestorage/scripts/qwen3-vl/model_30B.json"
MM_TOOL="/root/work/filestorage/scripts/qwen3-vl/tools.json"
LOAD_PATH="/root/work/filestorage/weights_dcp/Qwen3-VL-30B-A3B-Instruct-dcp-1225/"
SAVE_PATH="/root/work/filestorage/ckpt/qwen3-vl-30b/tp1pp1cp1"
FSDP2_PATH="/root/work/filestorage/scripts/qwen3-vl/fsdp2_config.yaml"
LOG_PATH="/root/work/filestorage/logs/qwen3-vl-30b/1226"
mkdir -p ${LOG_PATH}
TP=1
PP=1
CP=1
MBS=1
GRAD_ACC_STEP=1
SEQ_LEN=1024
DP=$(($WORLD_SIZE/$TP/$PP/$CP))
GBS=$(($MBS*$GRAD_ACC_STEP*$DP))
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
# GPT_ARGS中模型相关参数具体配置在example/qwen2vl/model_xb.json中,训练相关参数配置在这里
GPT_ARGS="
--use-mcore-models \
--init-model-with-meta-device \
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--context-parallel-size ${CP} \
--context-parallel-algo ulysses_cp_algo \
--micro-batch-size ${MBS} \
--global-batch-size ${GBS} \
--tokenizer-type NullTokenizer \
--vocab-size 152064 \
--seq-length ${SEQ_LEN} \
--make-vocab-size-divisible-by 1 \
--normalization RMSNorm \
--use-fused-rmsnorm \
--swiglu \
--use-fused-swiglu \
--no-masked-softmax-fusion \
--lr 1.0e-5 \
--lr-decay-style cosine \
--weight-decay 0 \
--train-iters 2000 \
--lr-warmup-fraction 0.1 \
--clip-grad 0.0 \
--adam-beta1 0.9 \
--adam-beta2 0.999 \
--no-gradient-accumulation-fusion \
--seed 42 \
--load $LOAD_PATH \
--use-flash-attn \
--no-load-optim \
--no-load-rng \
--no-save-optim \
--no-save-rng \
--num-workers 8 \
--use-torch-fsdp2 \
--untie-embeddings-and-output-weights \
--ckpt-format torch_dcp \
--fsdp2-config-path $FSDP2_PATH \
--optimizer-selection fused_torch_adamw \
--use-cpu-initialization \
--calculate-per-token-loss \
--log-tps
"
MM_ARGS="
--mm-data $MM_DATA \
--mm-model $MM_MODEL \
--mm-tool $MM_TOOL
"
OUTPUT_ARGS="
--log-interval 1 \
--save-interval 500 \
--eval-interval 500 \
--eval-iters 500 \
--save $SAVE_PATH \
"
logfile=$(date +%Y%m%d)_$(date +%H%M%S)
torchrun $DISTRIBUTED_ARGS pretrain_transformers.py \
$GPT_ARGS \
$MM_ARGS \
$OUTPUT_ARGS \
--distributed-backend nccl \
2>&1 | tee ${LOG_PATH}/train_${logfile}-RANK${RANK}.log
chmod 440 ${LOG_PATH}/train_${logfile}.log
find $SAVE_PATH -type d -exec chmod 750 {} \;
find $SAVE_PATH -type f -exec chmod 640 {} \;
STEP_TIME=`grep "elapsed time per iteration" ${LOG_PATH}/train_${logfile}.log | awk -F ':' '{print$5}' | awk -F '|' '{print$1}' | head -n 150 | tail -n 100 | awk '{sum+=$1} END {if (NR != 0) printf("%.1f",sum/NR)}'`
SAMPLES_PER_SECOND=`awk 'BEGIN{printf "%.3f\n", '${GBS}'*1000/'${STEP_TIME}'}'`
echo "Elapsed Time Per iteration: $STEP_TIME" >> ${LOG_PATH}/train_${logfile}-summary.log
echo "Average Samples per Second: $SAMPLES_PER_SECOND" >> ${LOG_PATH}/train_${logfile}-summary.log
LOG_TOKENS_PER_SECOND=`grep "tokens per sample" ${LOG_PATH}/train_${logfile}.log`
if [ "$LOG_TOKENS_PER_SECOND" ]; then
AVERAGE_TOKENS=`grep "tokens per sample" ${LOG_PATH}/train_${logfile}.log | awk -F 'tokens per sample:' '{print$2}' | awk -F '|' '{print$1}' | head -n 150 | tail -n 100 | awk '{sum+=$1} END {if (NR != 0) printf("%.1f",sum/NR)}'`
TOKENS_PER_SECOND=`awk 'BEGIN{printf "%.3f\n", '${SAMPLES_PER_SECOND}'*'${AVERAGE_TOKENS}'}'`
echo "Consumed Tokens per Second: $TOKENS_PER_SECOND" >> ${LOG_PATH}/train_${logfile}-summary.log
fi在 model_30B.json 中修改模型权重路径,使用 huggingface 的权重。
{
"model_id": "qwen3_vl_moe",
"init_from_hf_path": "/root/work/filestorage/weights_hf/Qwen3-VL-30B-A3B-Instruct/",
"image_encoder": {
"vision_encoder": {
"model_id": "qwen3vit",
"num_layers": 27,
"hidden_size": 1152,
"num_attention_heads": 16,
"freeze": false,
"attn_implementation": "flash_attention_2",
"attn_layout": "TND",
"synchronize_per_layer": true
},
"vision_projector": {
"model_id": "lnmlp",
"num_layers": 1,
"freeze": false
}
},
"text_decoder": {
"model_id": "qwen3lm",
"num_layers": 48,
"hidden_size": 2048,
"num_attention_heads": 32,
"max_position_embeddings": 262144,
"freeze": false,
"use_npu_fused_moe": true,
"attn_implementation": "flash_attention_2",
"attn_layout": "TND",
"is_causal": false,
"activation_offload": false,
"synchronize_per_layer": true
},
"loss_cfg": {
"compute_mode": "default",
"chunk_size": 1024,
"router_aux_loss_coef": 0.0
},
"patch": {
"clip_grad_async": true,
"scale_grad": true
}
}在 data_30B.json 中修改模型权重路径和数据集路径,其中 dataset_dir 表示图片存储路径,dataset 是数据集路径。若数据集包含图片,系统会到 dataset_dir 中进行查找。
{
"dataset_param": {
"dataset_type": "huggingface",
"preprocess_parameters": {
"model_name_or_path": "/root/work/filestorage/weights_hf/Qwen3-VL-30B-A3B-Instruct",
"use_fast_tokenizer": true,
"split_special_tokens": false,
"image_max_pixels": 262144,
"image_min_pixels": 1024,
"video_max_pixels": 16384,
"video_min_pixels": 0,
"video_fps": 2.0,
"video_maxlen": 64
},
"basic_parameters": {
"template": "qwen3_vl_nothink",
"dataset_dir": "/root/work/filestorage/datasets/qwen3-vl/train2017",
"dataset": "/root/work/filestorage/datasets/qwen3-vl/mllm_format_llava_instruct_data.json",
"cache_dir": "",
"enable_thinking": false,
"overwrite_cache": false,
"train_on_prompt": false,
"mask_history": false,
"preprocessing_batch_size": 1000,
"preprocessing_num_workers": 16,
"max_samples": null,
"tool_format": null
},
"attr": {
"system": null,
"images": "images",
"videos": null,
"messages": "messages",
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant",
"observation_tag": null,
"function_tag": null,
"system_tag": null
}
},
"dataloader_param": {
"dataloader_mode": "sampler",
"drop_last": true,
"sampler_type": "BaseRandomBatchSampler",
"collate_param": {
"model_name": "qwen3vl",
"ignore_pad_token_for_loss": true
},
"pin_memory": true,
"shuffle": true
}
}转换后的数据集如下,images 表示图片路径(如果不是绝对路径的话会去 data_30B.json 的 dataset_dir 中查找),message 中的 role、content 的名称和值要和 data_30B.json 中的 attr 标签保持一致。
[
{
"images": [
"/root/work/filestorage/datasets/qwen3-vl/train2017/000000033471.jpg"
],
"messages": [
{
"role": "user",
"content": "<image>\nWhat are the colors of the bus in the image?"
},
{
"role": "assistant",
"content": "The bus in the image is white and red."
},
]
}
]在各个节点执行:
bash finetune_qwen3vl_30B.sh可将日志拖动到https://curryrice233.github.io/TrainingLogParser/ 网站中查看 loss 曲线。
公司内部可使用:https://traininglogparser.openx.huawei.com/
[2025-12-26 18:45:45] iteration 40/ 2000 | consumed samples: 640 | elapsed time per iteration (ms): 3899.4 | learning rate: 1.950000E-06 | global batch size: 16 | tokens per sample: 1.737500E+01 | loss: 2.844364E+00 | loss scale: 1.0 | grad norm: 38.164 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-12-26 18:45:49] iteration 41/ 2000 | consumed samples: 656 | elapsed time per iteration (ms): 3673.6 | learning rate: 2.000000E-06 | global batch size: 16 | tokens per sample: 1.393750E+01 | loss: 2.221318E+00 | loss scale: 1.0 | grad norm: 19.008 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
训练完成之后,以Qwen3VL-8B为例,将保存在save_dir目录下的权重转换成huggingface格式
mm-convert Qwen3VLConverter dcp_to_hf \
--load_dir /root/work/filestorage/ckpt/qwen3-vl-30b/tp1pp1cp1/iter_0000500/ \
--save_dir /root/work/filestorage/ckpt/qwen3-vl-30b/tp1pp1cp1/iter_0000500_hf/ \
--model_assets_dir /root/work/filestorage/weights_hf/Qwen3-VL-30B-A3B-Instruct/其中,iter_000xx表示保存的第xx步的权重,--save_dir表示转换后的权重保存路径,--model_assets_dir表示原始huggingface权重的路径。
完成权重转换之后,即可使用transformers库进行推理。
推理脚本内容:
from transformers import Qwen3VLMoeForConditionalGeneration, AutoProcessor
# default: Load the model on the available device(s)
model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
"/root/work/filestorage/ckpt/qwen3-vl-30b/tp1pp1cp1/iter_0000200_hf", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("/root/work/filestorage/ckpt/qwen3-vl-30b/tp1pp1cp1/iter_0000200_hf")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "/root/work/filestorage/scripts/qwen3-vl/kite.jpg",
},
{"type": "text", "text": "描述这张照片"},
],
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)执行推理脚本输出如下:
python gen.py

此问题是由不同版本转换后的权重不一致导致的,使用新版本的mm-convert Qwen3VLConverter hf_to_dcp重新转换即可解决。
代码支持的数据集格式如下:
[
{
"images": [
"/root/work/filestorage/datasets/qwen3-vl/train2017/000000033471.jpg"
],
"messages": [
{
"role": "user",
"content": "<image>\nWhat are the colors of the bus in the image?"
},
{
"role": "assistant",
"content": "The bus in the image is white and red."
},
]
}
]客户数据格式可能不一样,例如:
[
{
"id": "GCC_train_001711524",
"image": "GCC_train_001711524.jpg",
"conversations": [
{
"from": "human",
"value": "Write a terse but informative summary of the picture.\n<image>"
},
{
"from": "gpt",
"value": "an old dinghy rests upon a muddy shore"
}
]
}
]有两种方式可以处理:
使用转换脚本将客户数据集转换成版本支持的数据集(推荐)
修改 data_30B.json 的 attr属性
"attr": {
"system": null,
"images": "image",
"videos": null,
"messages": "conversations",
"role_tag": "from",
"content_tag": "value",
"user_tag": "human",
"assistant_tag": "gpt",
"observation_tag": null,
"function_tag": null,
"system_tag": null
}use_npu_fused_moe 性能对比开启MoE融合可以提升模型训练性能,开启方式为将model_30B.json文件中修改use_npu_fused_moe字段为true
未开启:
[2025-12-26 18:22:17] iteration 28/ 2000 | consumed samples: 448 | elapsed time per iteration (ms): 75573.3 | learning rate: 1.350000E-06 | global batch size: 16 | tokens per sample: 1.612500E+01 | loss: 3.324713E+00 | loss scale: 1.0 | grad norm: 25.632 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-12-26 18:23:32] iteration 29/ 2000 | consumed samples: 464 | elapsed time per iteration (ms): 75544.4 | learning rate: 1.400000E-06 | global batch size: 16 | tokens per sample: 1.362500E+01 | loss: 3.193691E+00 | loss scale: 1.0 | grad norm: 26.713 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |开启:
[2025-12-26 18:44:54] iteration 28/ 2000 | consumed samples: 448 | elapsed time per iteration (ms): 4279.7 | learning rate: 1.350000E-06 | global batch size: 16 | tokens per sample: 1.612500E+01 | loss: 3.296955E+00 | loss scale: 1.0 | grad norm: 41.643 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-12-26 18:44:59] iteration 29/ 2000 | consumed samples: 464 | elapsed time per iteration (ms): 5274.1 | learning rate: 1.400000E-06 | global batch size: 16 | tokens per sample: 1.362500E+01 | loss: 3.168727E+00 | loss scale: 1.0 | grad norm: 25.991 | num zeros: 0.0 | number of skipped iterations: 0 | number of nan iterations: 0 |开启后大概有15倍以上的提升。