使用fsdp作为后端,对Qwen3-VL-30B-A3B模型进行强化学习。序列长度支持40K。
本文档采用verl fully_async_policy进行训练。
| 配套 | 版本 |
|---|---|
| CANN | cann-8.5.1 |
| torch | 2.8.0+cpu |
| torch_npu | 2.8.0.post2 |
| transformers | 4.57.6 |
| vllm | 0.13.0 |
| vllm-ascend | 0.13.0 |
| megatron | core_v0.15.3 |
| MindSpeed | core_r0.15.3 |
| verl | 0.8.0 |
16 * 910B A2
下文以/workspace作为源码安装的根目录
选择vllm&vllm_ascend官方镜像
docker pull quay.io/ascend/vllm-ascend:releases-v0.13.0-openeuler启动镜像后
cd /vllm-workspace
vllm/model_executor/layers/fused_moe/gpt_oss_triton_kernels_moe.py
将logger.error改为logger.info
if has_triton_kernels():
try:
import triton_kernels.swiglu
from triton_kernels.matmul_ogs import FnSpecs, FusedActivation, matmul_ogs
from triton_kernels.routing import RoutingData, routing, routing_from_bitmatrix
from triton_kernels.tensor import Bitmatrix
except (AttributeError, ImportError) as e:
logger.info(
"Failed to import Triton kernels. Please make sure your triton "
"version is compatible. Error: %s",
e,
)vllm/model_executor/layers/fused_moe/config.py
将logger.error改为logger.info
if has_triton_kernels():
try:
from triton_kernels.matmul_ogs import PrecisionConfig
except (ImportError, AttributeError) as e:
logger.info(
"Failed to import Triton kernels. Please make sure your triton "
"version is compatible. Error: %s",
e,
)镜像中已包含triton-ascend。如使用vllm镜像,可跳过此步。
因vllm会自动安装triton,如重新安装过vllm,需卸载triton,再安装triton-ascend。
卸载后,如果/usr/local/lib/python3.10/dist-packages下,有triton的残留目录,需手工删除。
pip uninstall -y triton安装triton-ascend
pip install triton-ascendpip install qwen_vl_utils
pip install transformers_modules
pip install ray==2.53.0
pip install tensordict==0.10.0
pip install mathruler
pip install latex2sympy2在workspace目录下
cd /workspace
git clone --depth 1 --branch core_v0.15.3 https://githubfast.com/NVIDIA/Megatron-LM.git
pip install -e Megatron-LMcd /workspace
git clone https://gitcode.com/longboat/MindSpeed.git
cd MindSpeed
git checkout origin/core_r0.15.3
pip install -e .
cd ..提升权重加载和保存性能
torch/distributed/_shard/sharding_spec/_internals.py
https://github.com/pytorch/pytorch/pull/167073
def validate_non_overlapping_shards_metadata(shards: list[ShardMetadata]):
"""
Ensures none of the shards overlap with each other.
Args:
shards(List[ShardMetadata]): List of :class:`ShardMetadata` objects representing
each shard.
Raises:
``ValueError`` if there's overlap in any two shards.
"""
returntorch/distributed/checkpoint/default_planner.py
https://github.com/pytorch/pytorch/pull/166820
def _validate_global_plan(global_plan: list[SavePlan], metadata: Metadata) -> bool:
all_good = True
return all_good在workspace目录下
cd /workspace
git clone https://githubfast.com/longboat2010/mbridge.git
cd mbridge
git checkout 4cfd6f5eab84ed5424a8202e1a282e6ac584fce5
pip install -e .
cd ..在workspace目录下
cd /workspace
git clone https://gitcode.com/longboat/verl8.git verl
cd verl
git checkout origin/qwen3_vl
pip install -e .
mkdir logs如下bug
[
fix device mismatch error for FSDP2 training](https://github.com/huggingface/transformers/pull/41536)
通过monkey_patch方式修改。
适配过程中的修改,在如下提交中 https://gitcode.com/longboat/verl8/commit/b8edfa322e7213d631ae4ba10b3e100d426ca7bb?ref=qwen3_vl
modelscope download --model Qwen/Qwen3-VL-30B-A3B-Thinking --local_dir ./Qwen3-VL-30B-A3B-Thinking下载verl提供的测试数据集dapo-math-17k和aime-2024。
cd verl
python ./examples/data_preprocess/geo3k_multiturn_w_tool.py --local_save_dir /datasets/geo3k为加速启动,在训练脚本中,将语料和权重放入共享内存。(data.use_shm=true, actor_rollout_ref.model.use_shm=true)
请在节点启动前,将节点共享内存设置为256G。
在首节点执行
ray start --head在其他节点执行:(以首节点IP为10.119.5.215为例)
ray start --address='10.119.5.215:6379'全部节点执行后,查看ray状态。应看到有128张NPU资源。
ray status在首节点启动训练。
使用nohup,避免因终端问题导致训练进程挂起。
cd verl
nohup bash ./verl/experimental/fully_async_policy/shell/run_qwen_vl_npu_async.sh &
tail -f nohup.out脚本如下:
# Node Info
n_gpus_rollout=8
n_gpus_training=8
n_nodes_rollout=8
n_nodes_train=8 # $((NNODES - n_nodes_rollout))
echo n_gpus_rollout:$n_gpus_rollout, n_nodes_rollout:$n_nodes_rollout, n_gpus_training:$n_gpus_training, n_nodes_train:$n_nodes_train
# Project Configuration
project_name='async_qwen3_vl_fsdp'
experiment_name='npu_nnodes'$n_nodes_train
strategy=fsdp
root_path=/pathtoroot
logfile=${root_path}/verl/logs/${project_name}_$(date +%Y%m%d)_$(date +%H%M%S).log
START_TIME=$(date +%s)
START_DATE=$(date)
# Model Configuration
hf_model_path=${root_path}/models/Qwen3-VL-30B-A3B-Thinking
default_local_dir=${root_path}/models/train/${project_name}/${experiment_name} #save checkpoint
# Data Configuration
train_files=${root_path}/datasets/geo3k/train.parquet
val_files=${root_path}/datasets/geo3k/test.parquet
max_prompt_length=$((1024 * 8))
max_response_length=$((1024 * 32))
# Training Batch Configuration
rollout_n=4
ppo_micro_batch_size_per_gpu=4
ppo_mini_batch_size=$(($n_nodes_train * $n_gpus_training * $ppo_micro_batch_size_per_gpu))
ppo_max_token_len_per_gpu=$(($max_prompt_length + $max_response_length))
log_prob_micro_batch_size_per_gpu=4
log_prob_max_token_len_per_gpu=$(($max_prompt_length + $max_response_length))
ulysses_sequence_parallel_size=4
use_dynamic_bsz=true
training_steps=300
# Async configuration
staleness_threshold=0.5 # 0 0.3 1
require_batches=1
trigger_parameter_sync_step=1 #$((train_bsz / ( train_prompt_mini_bsz * require_batches))) # 8 16 32
total_rollout_steps=$((($ppo_mini_batch_size * $require_batches * $trigger_parameter_sync_step * $training_steps)))
partial_rollout=True
# Performance and Memory Management Configuration
enable_gradient_checkpointing=true
offload_policy=false
param_offload=true #是否卸载模型权重到CPU
optimizer_offload=true #是否卸载优化器状态到CPU
entropy_checkpointing=true #在训练时对熵计算启用重计算,降低显存峰值
entropy_from_logits_with_chunking=true #通过分块计算熵以减少显存峰值
reshard_after_forward=true #控制前向计算后的参数行为,平衡内存与通信。默认值True:前向后重新分片参数,反向时重新全收集
use_torch_compile=true
# vLLM Configuration
gpu_memory_utilization=0.9
vllm_max_model_len=$(($max_prompt_length + $max_response_length))
vllm_max_num_batched_tokens=$(($max_prompt_length + $max_response_length)) #一次批处理中可处理的最大总Token数
vllm_max_num_seqs=32
tensor_model_parallel_size=2
data_parallel_size=1
expert_parallel_size=2
enforce_eager=false
temperature=1.0
top_k=-1
top_p=1.0
val_top_p=0.7
# Rollout Correction parameters
rollout_is=sequence
rollout_is_threshold=2.0
rollout_is_batch_normalize=true
# rollout_rs=token_k1
rollout_rs=null
rollout_rs_threshold=0.6_1.6
set -x
python3 -m verl.experimental.fully_async_policy.fully_async_main \
algorithm.adv_estimator=grpo \
algorithm.use_kl_in_reward=false \
algorithm.rollout_correction.rollout_is=${rollout_is} \
algorithm.rollout_correction.rollout_is_threshold=${rollout_is_threshold} \
algorithm.rollout_correction.rollout_is_batch_normalize=${rollout_is_batch_normalize} \
algorithm.rollout_correction.rollout_rs=${rollout_rs} \
algorithm.rollout_correction.rollout_rs_threshold=${rollout_rs_threshold} \
data.train_files=${train_files} \
data.val_files=${val_files} \
data.train_batch_size=0 \
data.max_prompt_length=$max_prompt_length \
data.max_response_length=$max_response_length \
data.filter_overlong_prompts=true \
data.truncation='error' \
data.return_raw_chat=true \
data.image_key=images \
data.seed=1234 \
data.use_shm=true \
actor_rollout_ref.hybrid_engine=false \
actor_rollout_ref.nccl_timeout=1200 \
actor_rollout_ref.model.path=${hf_model_path} \
actor_rollout_ref.actor.checkpoint.save_contents='["hf_model"]' \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.model.enable_gradient_checkpointing=$enable_gradient_checkpointing \
actor_rollout_ref.model.enable_activation_offload=false \
actor_rollout_ref.model.use_fused_kernels=true \
actor_rollout_ref.model.use_shm=true \
actor_rollout_ref.actor.use_kl_loss=true \
actor_rollout_ref.actor.kl_loss_coef=0.01 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.optim.lr_warmup_steps=1 \
actor_rollout_ref.actor.use_rollout_log_probs=true \
actor_rollout_ref.actor.ppo_mini_batch_size=$ppo_mini_batch_size \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=$ppo_micro_batch_size_per_gpu \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$ppo_max_token_len_per_gpu \
actor_rollout_ref.actor.use_torch_compile=$use_torch_compile \
actor_rollout_ref.actor.use_dynamic_bsz=$use_dynamic_bsz \
actor_rollout_ref.actor.use_prefix_grouper=false \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=$ulysses_sequence_parallel_size \
actor_rollout_ref.actor.entropy_checkpointing=$entropy_checkpointing \
actor_rollout_ref.actor.entropy_from_logits_with_chunking=$entropy_from_logits_with_chunking \
actor_rollout_ref.actor.strategy=$strategy \
actor_rollout_ref.actor.fsdp_config.entropy_checkpointing=$entropy_checkpointing \
actor_rollout_ref.actor.fsdp_config.entropy_from_logits_with_chunking=$entropy_from_logits_with_chunking \
actor_rollout_ref.actor.fsdp_config.strategy=$strategy \
actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
actor_rollout_ref.actor.fsdp_config.dtype=bfloat16 \
actor_rollout_ref.actor.fsdp_config.use_torch_compile=$use_torch_compile \
actor_rollout_ref.actor.fsdp_config.forward_prefetch=true \
actor_rollout_ref.actor.fsdp_config.offload_policy=$offload_policy \
actor_rollout_ref.actor.fsdp_config.param_offload=$param_offload \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=$optimizer_offload \
actor_rollout_ref.actor.fsdp_config.reshard_after_forward=$reshard_after_forward \
actor_rollout_ref.actor.fsdp_config.ulysses_sequence_parallel_size=$ulysses_sequence_parallel_size \
actor_rollout_ref.actor.fsdp_config.use_orig_params=true \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.mode='async' \
actor_rollout_ref.rollout.n=$rollout_n \
actor_rollout_ref.rollout.checkpoint_engine.update_weights_bucket_megabytes=1024 \
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=$use_dynamic_bsz \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=$log_prob_micro_batch_size_per_gpu \
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=$log_prob_max_token_len_per_gpu \
actor_rollout_ref.rollout.gpu_memory_utilization=$gpu_memory_utilization \
actor_rollout_ref.rollout.tensor_model_parallel_size=$tensor_model_parallel_size \
actor_rollout_ref.rollout.data_parallel_size=$data_parallel_size \
actor_rollout_ref.rollout.expert_parallel_size=$expert_parallel_size \
actor_rollout_ref.rollout.max_model_len=$vllm_max_model_len \
actor_rollout_ref.rollout.max_num_batched_tokens=$vllm_max_num_batched_tokens \
actor_rollout_ref.rollout.max_num_seqs=$vllm_max_num_seqs \
actor_rollout_ref.rollout.enable_chunked_prefill=true \
actor_rollout_ref.rollout.enable_prefix_caching=true \
+actor_rollout_ref.rollout.enable_sleep_mode=false \
actor_rollout_ref.rollout.disable_log_stats=true \
actor_rollout_ref.rollout.enforce_eager=$enforce_eager \
+actor_rollout_ref.rollout.engine_kwargs.vllm.compilation_config=\{'cudagraph_mode':'FULL_DECODE_ONLY','cudagraph_capture_sizes':[64,32,16,8,4,2,1]\} \
+actor_rollout_ref.rollout.engine_kwargs.vllm.async-scheduling=true \
actor_rollout_ref.rollout.free_cache_engine=false \
actor_rollout_ref.rollout.calculate_log_probs=true \
actor_rollout_ref.rollout.load_format="dummy" \
actor_rollout_ref.rollout.temperature=$temperature \
actor_rollout_ref.rollout.top_k=$top_k \
actor_rollout_ref.rollout.top_p=$top_p \
actor_rollout_ref.rollout.val_kwargs.temperature=$temperature \
actor_rollout_ref.rollout.val_kwargs.top_k=$top_k \
actor_rollout_ref.rollout.val_kwargs.top_p=$val_top_p \
actor_rollout_ref.rollout.val_kwargs.do_sample=True \
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=$use_dynamic_bsz \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=$log_prob_micro_batch_size_per_gpu \
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=$log_prob_max_token_len_per_gpu \
actor_rollout_ref.ref.ulysses_sequence_parallel_size=$ulysses_sequence_parallel_size \
actor_rollout_ref.ref.strategy=$strategy \
actor_rollout_ref.ref.fsdp_config.param_offload=$param_offload \
critic.enable=false \
critic.strategy=$strategy \
rollout.n_gpus_per_node=$n_gpus_rollout \
rollout.nnodes=$n_nodes_rollout \
rollout.n=$rollout_n \
rollout.total_rollout_steps=$total_rollout_steps \
trainer.n_gpus_per_node=$n_gpus_training \
trainer.nnodes=$n_nodes_train \
trainer.critic_warmup=0 \
trainer.logger='["console"]' \
trainer.project_name=$project_name \
trainer.experiment_name=$experiment_name \
trainer.val_before_train=false \
trainer.device=npu \
trainer.save_freq=30 \
trainer.test_freq=-1 \
trainer.resume_mode=disable \
trainer.total_epochs=1 \
trainer.total_training_steps=$training_steps \
trainer.default_local_dir=${default_local_dir} \
trainer.rollout_data_dir=${default_local_dir}/train_rollout \
trainer.validation_data_dir=${default_local_dir}/val_rollout \
async_training.require_batches=${require_batches} \
async_training.staleness_threshold="${staleness_threshold}" \
async_training.trigger_parameter_sync_step="${trigger_parameter_sync_step}" \
async_training.partial_rollout="${partial_rollout}" \
async_training.use_trainer_do_validate=false \
+ray_kwargs.ray_init.runtime_env.env_vars=\{\
HCCL_OP_EXPANSION_MODE:AIV,\
HCCL_INTRA_PCIE_ENABLE:\"1\",\
HCCL_INTRA_ROCE_ENABLE:\"0\",\
HCCL_EXEC_TIMEOUT:\"1200\",\
HCCL_CONNECT_TIMEOUT:\"1200\",\
CUDA_DEVICE_MAX_CONNECTIONS:\"1\",\
MULTI_STREAM_MEMORY_REUSE:\"1\",\
TASK_QUEUE_ENABLE:\"1\",\
ASCEND_LAUNCH_BLOCKING:\"0\",\
HYDRA_FULL_ERROR:\"0\",\
VLLM_LOGGING_LEVEL:INFO,\
VLLM_USE_V1:\"1\",\
VLLM_ASCEND_ENABLE_NZ:\"0\",\
VLLM_ASCEND_ENABLE_PREFETCH_MLP:\"1\",\
VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE:\"1\",\
VLLM_ASCEND_ENABLE_FLASHCOMM1:\"1\",\
VLLM_ATTENTION_BACKEND:FLASH_ATTN,\
LD_PRELOAD:\"/usr/local/Ascend/ascend-toolkit/latest/arm64-linux/lib64/libjemalloc.so\",\
PYTORCH_NPU_ALLOC_CONF:\"expandable_segments:True\",\
CPU_AFFINITY_CONF:\"1\"\} \
2>&1 | tee ${logfile}
set +x
TOTAL_DURATION=$(($(date +%s) - START_TIME))
echo "脚本开始执行时间:${START_DATE}" | tee -a ${logfile}
echo "脚本结束执行时间:$(date)" | tee -a ${logfile}
echo "脚本总耗时:${TOTAL_DURATION} 秒" | tee -a ${logfile}
TOK=$(grep 'perf/total_num_tokens:' ${logfile} | awk -F 'perf/total_num_tokens:' '{print$2}' | awk -F ' ' '{print$1}' | tail -n 11 | tail -n +2 | awk '{sum+=$1} END {print sum/NR}')
TPS=$(grep 'perf/throughput:' ${logfile} | awk -F 'perf/throughput:' '{print$2}' | awk -F ' ' '{print$1}' | tail -n 11 | tail -n +2 | awk '{sum+=$1} END {print sum/NR}')
MFU=$(grep 'perf/mfu/actor:' ${logfile} | awk -F 'perf/mfu/actor:' '{print$2}' | awk -F ' ' '{print$1}' | tail -n 11 | tail -n +2 | awk '{sum+=$1} END {print sum/NR}')
response_length=$(grep 'response_length/mean:' ${logfile} | awk -F 'response_length/mean:' '{print$2}' | awk -F ' ' '{print$1}' | tail -n 11 | tail -n +2 | awk '{sum+=$1} END {print sum/NR}')
timing_step=$(grep 'timing_s/step:' ${logfile} | awk -F 'timing_s/step:' '{print$2}' | awk -F ' ' '{print$1}' | tail -n 11 | tail -n +2 | awk '{sum+=$1} END {print sum/NR}')
echo "Final Performance MFU : $MFU" | tee -a ${logfile}
echo "Final Performance TPS : $TPS" | tee -a ${logfile}
echo "Final Performance TOK : $TOK" | tee -a ${logfile}
echo "Final Performance response_length : $response_length" | tee -a ${logfile}
echo "Final Performance timing_step : $timing_step" | tee -a ${logfile}部分参数说明:
| 参数 | 说明 |
|---|---|
| n_nodes_rollout | 推理节点数量。调整节点数量,只影响推理时长。 |
| n_nodes_train | 训练节点数量。 |
| checkpoint.save_contents | 权重保存方式, |
为仅保存hf权重,
为保存torch dcp格式,占用空间较大。| |root_path|设置为全部节点可以访问的公共路径| | +ray_kwargs.ray_init.runtime_env.env_vars|通过ray向全部节点设置环境变量|
对于30B模型,使用fsdp后端训练,性能较好,MFU可达20%以上。且当前verl代码,在NPU内存占用率方面,fsdp优于fsdp2。
当前verl代码,Qwen3-VL的megatron后端训练代码支持不完善,CP、use_remove_padding等功能存在bug。在批次、MFU等指标方面,不及fsdp。
端口被占用。
VLLM已知问题:https://github.com/vllm-project/vllm/pull/35977
File "/usr/local/python3.11.13/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 494, in __init__
self.input_socket = self.resources.input_socket = make_zmq_socket(
^^^^^^^^^^^^^^^^
File "/usr/local/python3.11.13/lib/python3.11/site-packages/vllm/utils/network_utils.py", line 307, in make_zmq_socket
socket.bind(path)
File "/usr/local/python3.11.13/lib/python3.11/site-packages/zmq/sugar/socket.py", line 320, in bind
super().bind(addr)
File "zmq/backend/cython/_zmq.py", line 1009, in zmq.backend.cython._zmq.Socket.bind
File "zmq/backend/cython/_zmq.py", line 190, in zmq.backend.cython._zmq._check_rc
zmq.error.ZMQError: Address already in use (addr='tcp://10.119.13.85:36909')且在相同环境下,使用VLLM部署Qwen3-VL-30B-A3B推理,推理结果异常。
返回结果:{"role":"assistant","content":null,"refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n…\n"},"
原因:transformers 5.4版本及以上,tie_word_embeddings默认值发生变化。导致embedding 被破坏,模型无法正确理解输入token。
https://github.com/huggingface/transformers/commit/39f751a538ca67932cab53e6eb5763243674ae2c
解决:修改congfig.json,在text_config段内增加"tie_word_embeddings": false,
https://ai.gitcode.com/Ascend-SACT/Qwen3-VL-30B-A3B_verl/blob/main/config.json