随着大模型技术的快速发展,以 Qwen 系列模型 为代表的通用多模态模型在对话、视觉理解与复杂推理等任务中展现出强大的能力。其中,Qwen3.5-35B-A3B 作为高性能多专家(MoE)模型,在性能与推理效率之间取得了良好平衡,逐渐成为企业级应用与科研场景中的重要选择。
为了响应客户的紧急需求,本指南基于MindSpeed-MM训练框架,对Qwen3.5-35B-A3B模型在昇腾环境下的微调流程进行了系统梳理,并结合 vLLM 推理引擎,给出了从模型权重处理、数据准备、训练优化到高性能推理部署的完整实践路径。
当前适配的场景为A2单机8卡,支持微调、推理。本指导所用环境信息如下:
| 软件/组件 | 版本 |
|---|---|
| CANN | 8.2.RC1 |
| python | 3.10.20 |
| torch | 2.7.1 |
| torch_npu | 2.7.1 |
| 驱动固件 | 25.5.0 |
| transformers | 5.2.0.dev0 |
| MindSpeed | 0.12.1 |
| MindSpeed-MM | 0.1 |
| vllm-ascend | 0.17.0rc1 |
| vllm | 0.17.0+empty |
由于Qwen3.5模型在vllm-ascend上首次支持的版本是v0.17.0rc1,故需通过如下命令获取:
docker pull quay.nju.edu.cn/ascend/vllm-ascend:v0.17.0rc1export IMAGE=quay.nju.edu.cn/ascend/vllm-ascend:v0.17.0rc1
docker run \
--name vllm-ascend-qwen \
--shm-size=500g \
--net=host \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--privileged=true \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-v /tmp:/tmp \
-v /opt:/opt \
-it $IMAGE bash在创建完成后,需确保已进入容器,启动容器的命令如下:
docker exec -itu root vllm-ascend-qwen bash首先检查当前环境是否有conda:conda --version 若没有conda,则需要安装,对于arm架构来说,通过wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh -O miniconda.sh获取到安装脚本,再输入chmod +x miniconda.sh 给与安装权限,最后执行./miniconda.sh进行安装。
激活conda环境变量:
source /root/miniconda3/etc/profile.d/conda.sh
conda create -n mm_env python=3.10conda activate mm_envcd /home
git clone https://gitcode.com/Ascend/MindSpeed-MM.git
cd MindSpeed-MMbash scripts/install.sh --msid eb10b92 && bash examples/qwen3_5/install_extensions.sh备注:若在安装过程中遇到pip源问题导致安装特别慢,可以通过导入下述环境变量解决:
export PIP_INDEX_URL=https://mirrors.aliyun.com/pypi/simple
export PIP_TRUSTED_HOST=mirrors.aliyun.commkdir -p MindSpeed-MM/ckpt/hf_path
mkdir -p MindSpeed-MM/ckpt/dcp_pathcd MindSpeed-MM
pip install huggingface_hub
HF_ENDPOINT=https://hf-mirror.com python -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='Qwen/Qwen3.5-35B-A3B',
local_dir='./ckpt/hf_path/Qwen3.5-35B-A3B',
resume_download=True,
local_dir_use_symlinks=False
)
"如果使用fsdp2的meta init初始化模型,需要先根据模型类型完成以下权重转换:
mm-convert Qwen35Converter hf_to_dcp \
--hf_dir ./ckpt/hf_path/Qwen3.5-35B-A3B \
--dcp_dir ./ckpt/dcp_path/Qwen3.5-35B-A3Bmkdir -p MindSpeed-MM/data/COCO2017参考国内快速下载地址: https://aistudio.baidu.com/datasetdetail/100602
参考国内下载链接: https://www.modelscope.cn/datasets/AI-ModelScope/LLaVA-Instruct-150K/files
python examples/qwen2vl/llava_instruct_2_mllm_demo_format.py转换后参考数据目录结构如下:
$playground
├── data
├── COCO2017
├── train2017
├── llava_instruct_150k.json
├── mllm_format_llava_instruct_data.json
...Modify the dataset paths in examples/qwen3_5/qwen3_5_35B_config.yaml, including: model_name_or_path, dataset, load, and other paths: The modified qwen3_5_35B_config.yaml is as follows:
Note: Since full-parameter fine-tuning in an environment with 8 910B cards will cause OOM, the following configuration freezes the weights of some layers during training:
# 并行策略
parallel:
tensor_parallel_size: 1
fully_shard_parallel_size: auto
fsdp_plan:
apply_modules:
- model.visual.blocks.{*}
- model.visual
- model.language_model.layers.{*}.linear_attn
- model.language_model.layers.{*}.mlp
- model.language_model.layers.{*}
- model.language_model.embed_tokens
- model.language_model.norm
- model.language_model.rotary_emb
- model.language_model
- lm_head
hook_modules:
- model.language_model.layers.{*}
param_dtype: bf16
reduce_dtype: fp32
recompute: true
recompute_plan:
apply_modules:
- model.language_model.layers.{*}
- model.visual.blocks.{*}
ulysses_parallel_size: 1 # 开启 ulysses-cp 时, 请将 model 的 attn_implementation 设置为 flash_attention_2
expert_parallel_size: 1
ep_plan:
apply_modules:
- model.language_model.layers.{*}.mlp.experts
# 数据相关配置
data:
dataset_param:
dataset_type: huggingface
#数据集属性
attr:
images: images
messages: messages
role_tag: role
content_tag: content
user_tag: user
assistant_tag: assistant
# 数据预处理
preprocess_parameters:
model_name_or_path: &HF_MODEL_LOAD_PATH ./ckpt/hf_path/Qwen3.5-35B-A3B # 替换为原始hf权重
use_fast_tokenizer: true
split_special_tokens: false
image_max_pixels: 262144
image_min_pixels: 1024
video_max_pixels: 16384
video_min_pixels: 0
video_fps: 2.0
video_maxlen: 64
basic_parameters:
cutoff_len: 1024
template: qwen3_vl_nothink
enable_thinking: false
train_on_prompt: false
mask_history: false
# tool_format: null
dataset_dir: ./data
dataset: &DATASET_PATH ./data/mllm_format_llava_instruct_data.json
cache_dir: ./cache_dir/
overwrite_cache: false
preprocessing_batch_size: 1000
preprocessing_num_workers: 16
max_samples: null
# 数据加载
dataloader_param:
pin_memory: true
shuffle: false
dataloader_mode: sampler
drop_last: true
sampler_type: BaseRandomBatchSampler
num_workers: 8
collate_param:
model_name: qwen3vl
ignore_pad_token_for_loss: true
# 模型配置
model:
model_id: qwen3_5_moe
model_name_or_path: *HF_MODEL_LOAD_PATH
trust_remote_code: true
attn_implementation: sdpa
freeze:
- model.visual
- model.language_model.layers.{*}
loss_cfg:
loss_type: default # If you want raw loss in model, loss_type can be set to "raw".
router_aux_loss_coef: 0.0
enable_chunk_loss: true
chunkloss_plan:
apply_module: lm_head
chunk_size: 1024
use_triton_gdn: true
use_grouped_expert_matmul: true
# 训练配置
training:
micro_batch_size: 1
gradient_accumulation_steps: 1
seed: 42
lr: 1.0e-5
lr_decay_style: cosine
lr_warmup_ratio: 0.1
weight_decay: 0
train_iters: 10000
clip_grad: 0.0
init_model_with_meta_device: true
optimizer: adamw
adam_fused: true
save_interval: 10000
no_load_optim: true # Do not load optimizer state; remove if loading is needed.
no_load_rng: true # Do not load RNG state; remove if loading is needed.
no_save_optim: true # Do not save optimizer state; remove if saving is needed.
no_save_rng: true # Do not save RNG state; remove if saving is needed.
load: ./ckpt/dcp_path/Qwen3.5-35B-A3B # 替换为转换后的dcp权重
save: ./save_path
use_deter_comp: false
plugin:
- mindspeed_mm/fsdp/models/qwen3_5_moe
- mindspeed_mm/fsdp/data/datasets/huggingface
# 工具配置
tools:
profile:
enable: false
profile_type: static
ranks: [0]
static_param:
level: level1
with_stack: false
with_memory: false
record_shapes: false
with_cpu: true
save_path: ./profiling
start_step: 10
end_step: 11
data_simplification: false
aic_metrics_type: PipeUtilization
memory_profile:
enable: false
start_step: 1
end_step: 2
save_path: ./memory_snapshot
dump_ranks: [0]
stacks: all
max_entries: null
mem_info: false注意:上述训练参数冻结了model.visual和model.language_model.layers.{*},若不冻结则在A2上训练时会发生OOM。
nohup bash examples/qwen3_5/finetune_qwen3_5_35B.sh &启动后,可以通过 tail -f nohup.out 命令查看训练状态。正常训练时,日志中会打印当前 iteration 的 loss 和 grad norm 等信息。
mm-convert Qwen35Converter dcp_to_hf \
--save_hf_dir /tmp/save_hf_path/Qwen3.5-35B-A3B \
--dcp_dir save_path/iter_0000100 \
--origin_hf_dir ckpt/hf_path/Qwen3.5-35B-A3B注意,上述命令将转换后的权重保存至/tmp/save_hf_path/Qwen3.5-35B-A3B目录下,可自行修改save_hf_dir参数来调整转换后的权重路径。
由于转换后的权重tokenizer与原始权重不同,直接推理时会导致transformers在加载tokenizer时找不到对应的tokenizer类。因此需要将转换后权重中的tokenizer替换为原始权重中的tokenizer:
# 进入转换后的权重目录
cd /tmp/save_hf_path/Qwen3.5-35B-A3B
# 将tokenizer.json和tokenizer_config.json替换为原始权重中的文件
cp /opt/data/weight/Qwen3.5-35B-A3B/tokenizer.json ./
cp /opt/data/weight/Qwen3.5-35B-A3B/tokenizer_config.json ./在推理前需要退出当前conda虚拟环境,在容器本身的python环境中执行,退出命令:conda deactivate
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m vllm.entrypoints.openai.api_server \
--served-model-name 'qwen3_5' \
--model='/tmp/save_hf_path/Qwen3.5-35B-A3B' \
--port 8080 \
-tp 4 \
--default-chat-template-kwargs '{"enable_thinking": false}' \
--compilation-config '{"cudagraph_capture_sizes":[4,10,16,64], "cudagraph_mode":"FULL_DECODE_ONLY"}' \
--gpu-memory-utilization 0.9 \
--max-num-seqs 64 \
--max-model-len 17408 \
--allowed-local-media-pathexport ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
vllm serve /tmp/save_hf_path/test2/Qwen3.5-35B-A3B \
--host 0.0.0.0 \
--port 8000 \
--data-parallel-size 1 \
--tensor-parallel-size 4 \
--seed 1024 \
--served-model-name qwen3.5 \
--max-num-seqs 32 \
--max-model-len 133000 \
--max-num-batched-tokens 8096 \
--trust-remote-code \
--gpu-memory-utilization 0.90 \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--default-chat-template-kwargs '{"enable_thinking": false}' \
--async-schedulingcurl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3_5",
"prompt": "介绍一下你自己,用中文回答",
"max_tokens": 200,
"temperature": 0
}'
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3_5",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustrate?"}
]}
]
}'MODEL_NAME="qwen3_5"
MODEL_PATH="/tmp/save_hf_path/Qwen3.5-35B-A3B"
INPUT_LEN=2048
OUTPUT_LEN=2048
vllm bench serve --backend vllm --model $MODEL_NAME --host localhost --port 8080 --dataset-name random --random-input-len $INPUT_LEN --random-output-len $OUTPUT_LEN --tokenizer $MODEL_PATH --result-dir="./qwen35-35b-xn" --num-prompts=150 --max-concurrency 50 --ignore-eos注意:MODEL_PATH需要根据实际情况替换。