[toc]
Qwen3 是 Qwen 系列的最新一代大语言模型,提供了一套全面的稠密型和专家混合(MoE)模型。基于大规模训练,Qwen3 在推理、指令遵循、智能体能力和多语言支持方面实现了突破性进展,具有以下核心特性:
Qwen3-235B-A22B 具有以下特性:
模型链接:https://huggingface.co/Qwen/Qwen3-235B-A22B
本案例使用昇腾 A2 机器,基于 verl 框架完成 Qwen3-235B-A22B GRPO 算法强化学习实践。
| 硬件名称 | 配置信息 |
|---|---|
| 机器型号 | A2 |
| 测试集群 | 16机(128卡) |
| 操作系统 | ARM |
| 软件 | 版本 | 部署方式 |
|---|---|---|
| Driver | AscendHDK 25.2.0 | 宿主机 |
| Firmware | AscendHDK 25.2.0 | 宿主机 |
| Docker镜像OS | Ubuntu 20.04.6 | 容器 |
| Python | 3.10.18 | 容器 |
| CANN | 8.3.RC1 | 容器 |
| Torch | 2.7.1 | 容器 |
| Torch_npu | 2.7.1 | 容器 |
| vllm | master(38217877aa70041c0115ee367b75197af9cbc5ad) | 容器 |
| vllm-ascend | master(1de16ead8eecfec8903ec1b330b27a4fa2593c35) | 容器 |
| transformers | master(8365f70e925) | 容器 |
| MindSpeed | master(1cdd0abd75e40936ad31721c092f57c695dd72c4) | 容器 |
| Megatron-LM | core_v0.12.1 | 容器 |
| verl | master(796871d7d092f7cbc6a64e7f4a3796f7a2217f5e) | 容器 |
| MindSpeed-RL | 2.2.0 | 容器 |
verl 镜像已发布,镜像下载链接:Ascend-SACT/ascend_rl_train_image
参考:https://gitcode.com/Ascend/MindSpeed-RL/blob/2.2.0/rl-plugin/README.MD
GSM8K(小学数学 8K)是一个包含8,500道高质量、语言多样的小学数学应用题的数据。该数据集旨在支持需要多步推理的基础数学问题的问答任务。
这些问题需要 2 到 8 个步骤来解决。
解答主要涉及使用基本算术运算(+ − × ÷)执行一系列基础计算以得出最终答案。
一个聪明的初中生应该能够解决每一道题:论文中提到,“问题所需的概念不超过初等代数水平,绝大多数问题无需显式定义变量即可解决。”
解答以自然语言提供,而非纯数学表达式。论文中指出:“我们认为这是最具通用性的数据格式,我们期望它能够揭示大语言模型内部独白的特性。”
执行如下脚本,脚本先从 https://huggingface.co/datasets/openai/gsm8k/tree/main 下载数据集,然后把数据集预处理为parquet格式:
# 使用国内镜像源
export HF_ENDPOINT=https://hf-mirror.com
cd /src/rl-verl-2.2.0/verl/
python3 examples/data_preprocess/gsm8k.py --local_dir /root/work/filestorage/datasets/gsm8k数据处理完成后会在--local_dir 指定的目录生成处理后的数据:
gsm8k/
|-- test.parquet
|-- train.parquet参考文档:https://verl.readthedocs.io/en/latest/start/quickstart.html
可使用 HuggingFace CLI 进行下载:
export HF_ENDPOINT=https://hf-mirror.com
pip install -U huggingface_hub
hf download Qwen/Qwen3-235B-A22B --local-dir ./Qwen3-235B-A22B下载完成后有如下文件(权重有108个,为节省空间,这里只列出部分):
Qwen3-235B-A22B/
├── LICENSE
├── README.md
├── config.json
├── generation_config.json
├── merges.txt
├── model-00001-of-00118.safetensors
├── model-00002-of-00118.safetensors
├── model-00118-of-00118.safetensors
├── model.safetensors.index.json
├── tokenizer.json
├── tokenizer_config.json
└── vocab.json权重大小:
root@dl-4ed054c:~/work/filestorage/weights_hf# du -sh Qwen3-235B-A22B/
438G Qwen3-235B-A22B/HuggingFace 权重转换成 Megatron 权重,使用单机8卡:
cd /src/rl-verl-2.2.0/verl
# --use_cpu_initialization Only work for MoE models
python scripts/converter_hf_to_mcore.py \
--hf_model_path /root/work/filestorage/Qwen3-235B-A22B \
--output_path /root/work/filestorage/weights_mg/Qwen3-235B-A22B-verl \
--use_cpu_initialization转换成功后,会在输出目录生成如下文件:
Qwen3-235B-A22B-verl
|-- __0_0.distcp
|-- __0_1.distcp
|-- common.pt
|-- metadata.json参考文档:https://verl.readthedocs.io/en/latest/advance/checkpoint.html#checkpoint-page
此处操作在客户平台上进行,不同局点的具体情况可能存在差异,但总体命令基本一致:
Master节点:
ray start --head --port 6166 --dashboard-host=0.0.0.0 --dashboard-port=8260 --blockworker 节点:
ray start --address="${MASTER_IP}:6166" --resources='{"NPU": 8}' --block完整的代码:
master 节点:
bash ray_start_master.sh
export RAY_DEDUP_LOGS=0
export HYDRA_FULL_ERROR=1
#TASK_QUEUE_ENABLE,下发优化,图模式设置为1,非图模式设置为2
export TASK_QUEUE_ENABLE=2
export HCCL_ASYNC_ERROR_HANDLING=0
export HCCL_EXEC_TIMEOUT=3600
export HCCL_CONNECT_TIMEOUT=3600
export LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/driver:$LD_LIBRARY_PATH
ulimit -n 32768
SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0" # 替换为正确的IFNAME
export GLOO_SOCKET_IFNAME="eth0" # 替换为正确的IFNAME
ray start --head --port 6166 --dashboard-host=0.0.0.0 --dashboard-port=8260 --blockworker节点:
bash ray_start_worker.sh ${MASTER_IP}
export RAY_DEDUP_LOGS=0
export HYDRA_FULL_ERROR=1
#TASK_QUEUE_ENABLE,下发优化,图模式设置为1,非图模式设置为2
export TASK_QUEUE_ENABLE=2
export HCCL_ASYNC_ERROR_HANDLING=0
export HCCL_EXEC_TIMEOUT=3600
export HCCL_CONNECT_TIMEOUT=3600
export LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/driver:$LD_LIBRARY_PATH
ulimit -n 32768
MASTER_IP=$1
SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0" # 替换为正确的IFNAME
export GLOO_SOCKET_IFNAME="eth0" # 替换为正确的IFNAME
ray start --address="${MASTER_IP}:6166" --resources='{"NPU": 8}' --block进入容器后查看ray状态:
ray status检查卡数是否与预期相符,如下所示为双机16卡。Active显示机器数量,Total Usage显示使用的卡数。
Node status
---------------------------------------------------------------
Active:
1 node_de0674cc214e6f6bbb325bdcb5c294e4ea9b98b1ad44cdc82abeeb03
1 node_3eabfacf577fcf812a636a4931e51f2136cdcc3383a21bc0771858fe
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
0.0/320.0 CPU
0.0/16.0 NPU
0B/3.69TiB memory
0B/60.80GiB object_store_memory只需要在主节点执行即可:
nohup bash Qwen3-235B-A22B/test_grpo_qwen3_235b_A2.sh > qwen3-235b-1120-1.log 2>&1 &完整脚本如下
HF_MODEL_PATH=/root/work/filestorage/Qwen3-235B-A22B
DIST_CKPT_PATH=/root/work/filestorage/weights_mg/Qwen3-235B-A22B-verl
TRAIN_DATA_PATH=/root/work/filestorage/datasets/gsm8k/train.parquet # gsm8k
TEST_DATA_PATH=/root/work/filestorage/datasets/gsm8k/test.parquet # gsm8k
CKPTS_DIR=/root/work/filestorage/ckpt/Qwen3-235B-A22B/
log_path=/root/work/filestorage/output
export CUDA_DEVICE_MAX_CONNECTIONS=1
offload=True
python3 -m verl.trainer.main_ppo --config-path=config \
--config-name='ppo_megatron_trainer.yaml'\
algorithm.adv_estimator=grpo \
data.train_files=$TRAIN_DATA_PATH \
data.val_files=$TEST_DATA_PATH \
data.train_batch_size=128 \
data.max_prompt_length=8192 \
data.max_response_length=4096 \
data.filter_overlong_prompts=False \
data.truncation='error' \
actor_rollout_ref.model.path=$HF_MODEL_PATH \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=128 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=8 \
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4 \
actor_rollout_ref.actor.megatron.expert_model_parallel_size=4 \
actor_rollout_ref.actor.megatron.use_dist_checkpointing=True \
actor_rollout_ref.actor.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.tensor_model_parallel_size=8 \
+actor_rollout_ref.rollout.dp_model_parallel_size=8 \
+actor_rollout_ref.rollout.rollout_world_size=128 \
actor_rollout_ref.rollout.name=vllm \
+actor_rollout_ref.rollout.enable_expert_parallel=True \
actor_rollout_ref.actor.megatron.param_offload=${offload} \
actor_rollout_ref.actor.megatron.optimizer_offload=${offload} \
actor_rollout_ref.actor.megatron.grad_offload=${offload} \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
actor_rollout_ref.rollout.n=16 \
actor_rollout_ref.rollout.max_num_batched_tokens=1024 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=8 \
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=4 \
actor_rollout_ref.ref.megatron.expert_model_parallel_size=4 \
actor_rollout_ref.ref.megatron.param_offload=${offload} \
actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \
actor_rollout_ref.ref.megatron.dist_checkpointing_path=$DIST_CKPT_PATH \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger='["console"]' \
trainer.project_name='verl_grpo_example_gsm8k_math' \
trainer.experiment_name='qwen3_30b_moe_megatron' \
trainer.device=npu \
trainer.n_gpus_per_node=8 \
trainer.nnodes=16 \
trainer.save_freq=50 \
trainer.default_local_dir="${CKPTS_DIR}" \
trainer.test_freq=5 \
trainer.total_epochs=15 \
trainer.val_before_train=false \
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform \
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full \
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1 \
++actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_first_pipeline_stage=11 \
++actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=11 \
+actor_rollout_ref.actor.megatron.override_transformer_config.use_flash_attn=True \
++actor_rollout_ref.ref.megatron.override_transformer_config.use_flash_attn=True 2>&1 | tee ""${log_path}/verl_qwen3_235b_$(date +%Y%m%d_%H%M).log""训练过程中可以查看日志:
grep -nr "response" /tmp/ray/session_latest/logs
grep -nr "response" /tmp/ray/session_latest/logs > qwen3-235b-verl.log可将日志拖动到https://curryrice233.github.io/TrainingLogParser/ 网站中查看 reward 曲线。
公司内部可使用:https://traininglogparser.openx.huawei.com/

日志:
36m(TaskRunner pid=11069)[0m step:1 - global_seqlen/min:13889 - global_seqlen/max:32712 - global_seqlen/minmax_diff:18823 - global_seqlen/balanced_min:22614 - global_seqlen/balanced_max:22617 - global_seqlen/mean:22615.296875 - actor/entropy:0.5020670294761658 - actor/kl_loss:0.01002851337671018 - actor/kl_coef:0.0010000000000000005 - actor/pg_loss:0.0033242334029637277 - actor/pg_clipfrac:0.03154947847724543 - actor/ppo_kl:9.132483211349296e-05 - actor/pg_clipfrac_lower:0.00040768300505078514 - actor/grad_norm:0.21204539154165722 - perf/mfu/actor:0.015722994628242382 - perf/max_memory_allocated_gb:57.76551055908203 - perf/max_memory_reserved_gb:60.91796875 - perf/cpu_memory_used_gb:465.78759765625 - actor/lr:1e-06 - training/global_step:1 - training/epoch:0 - critic/score/mean:0.44775390625 - critic/score/max:1.0 - critic/score/min:0.0 - critic/rewards/mean:0.44775390625 - critic/rewards/max:1.0 - critic/rewards/min:0.0 - critic/advantages/mean:-0.05481206253170967 - critic/advantages/max:3.7499847412109375 - critic/advantages/min:-2.561730146408081 - critic/returns/mean:-0.05481206253170967 - critic/returns/max:3.7499847412109375 - critic/returns/min:-2.561730146408081 - response_length/mean:1329.1591796875 - response_length/max:4096.0 - response_length/min:227.0 - response_length/clip_ratio:0.046875 - prompt_length/mean:84.296875 - prompt_length/max:149.0 - prompt_length/min:45.0 - prompt_length/clip_ratio:0.0 - timing_s/start_profile:0.0005802820669487119 - timing_s/generate_sequences:1311.4302978515625 - timing_s/reshard:205.53201293945312 - timing_s/gen:1549.5419146210188 - timing_s/reward:1.9772784940432757 - timing_s/old_log_prob:484.43600556300953 - timing_s/ref:133.7677788970759 - timing_s/adv:0.17963880999013782 - timing_s/update_actor:644.0539472099626 - timing_s/step:2814.4491687440313 - timing_s/stop_profile:0.0002078310353681445 - timing_per_token_ms/ref:0.046210349499708064 - timing_per_token_ms/gen:0.5692412726490985 - timing_per_token_ms/update_actor:0.22248973738390657 - timing_per_token_ms/adv:6.205658987388162e-05 - perf/total_num_tokens:2894758 - perf/time_per_step:2814.4491687440313 - perf/throughput:8.035425590966435关键指标:
critic/rewards/mean:reward 的均值,该值应为奖励模型的奖励数值求和后除以 global_batch_size,min 和 max 则为奖励模型及规则奖励对同一个样本的奖励最大值和最小值。
prompt_length/mean,prompt_length/max;response_length/mean,response_length/max:输入与输出的最大值和均值,其中 response_length 会显著影响生成的性能,同时需要关注其与 max_tokens 的差异,防止过多回答被截断。
timing_s/[role]:对应进程组花费的时间,其中 gen 对应生成时间,通常 gen 时间为总时间的主要组成部分;update 为 actor 更新时间。
pg_loss,pg_clipfrac,ppo_kl,kl loss:均为 grpo 中计算的各项用于更新的指标。其中 pg_loss 可以反映优势函数的整体情况,若其为 0 则表明输入对应得分无差异,模型无更新;Ppo_kl 可以反映模型更新的前后输出概率的变化,kl_loss 可以反映模型相对参考模型的变化大小,当 kl_loss 在第一步偏移 0 时,可以反映权重加载是否异常。
吞吐:
端到端吞吐 e2e_tps=(response_length_mean+prompt_length_mean)×global_batch_size×n_samples_per_prompt/world_size /time_all
训练吞吐 update_tps=(response_length_mean+prompt_length_mean)×global_batch_size×n_samples_per_prompt/world_size /time_update
推理吞吐 vllm_tps=(response_length_mean+prompt_length_mean)×global_batch_size×n_samples_per_prompt/world_size /time_gen
训练脚本
#!/usr/bin/env bash
# set -xeuo pipefail
adv_estimator=grpo
use_kl_in_reward=False
kl_coef=0.0
use_kl_loss=True
kl_loss_coef=0.001
clip_ratio_low=0.2
clip_ratio_high=0.2
max_prompt_length=$((1024 * 8))
max_response_length=$((1204 * 4))
loss_agg_mode="token-mean"
train_prompt_bsz=${TRAIN_BS:-128}
n_resp_per_prompt=16
train_prompt_mini_bsz=128
# minimum nodes need for qwen3-235B-A22B
NNODES=16
# Paths
MODEL_PATH=/root/work/externalstorage/gpfsprd/OpenSourceModels/Qwen3-235B-A22B
DIST_CKPT_PATH="/root/work/externalstorage/gpfsprd/model_submit/Qwen3-235B-A22B-mcore"
TRAIN_FILE=/root/work/externalstorage/gpfsprd/verl/data/train.parquet
TEST_FILE=/root/work/externalstorage/gpfsprd/verl/data/test.parquet
# Algorithm
temperature=1.0
top_p=1.0
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
val_top_p=0.7
# Performance Related Parameter
use_dynamic_bsz=True
actor_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 10 / 10))
infer_ppo_max_token_len=$(((max_prompt_length + max_response_length) * 1))
offload=True
OPTIM_OFFLOAD=${OPTIM_OFFLOAD:-True}
gen_tp=16
train_tp=${TP:-4}
train_pp=${PP:-8}
EP=${EP:-4}
ETP=1
CP=1
optimizer_offload_fraction=${OFFLOAD_FRACTION:-1.}
last_layer=${LAST_LAYER:-10}
project_name='Qwen3-235B-A22B-Huawei'
experiment_name='gsm8k-huawei-16nodes'
exp_dir="$(pwd)/outputs/${project_name}/${experiment_name}"
mkdir -p $exp_dir
# 如果当前脚本不在$exp_dir里面则拷贝,方便二次运行
if [ "$(cd "$(dirname "$0")" && pwd)" != "$exp_dir" ]; then
cp "$0" "$exp_dir"
fi
exp_log="${exp_dir}/train-$(date '+%Y%m%d_%H%M%S').log"
nohup python3 -m verl.trainer.main_ppo \
--config-path=config \
--config-name='ppo_megatron_trainer.yaml' \
data.train_files="${TRAIN_FILE}" \
data.val_files="${TEST_FILE}" \
data.prompt_key=prompt \
data.truncation='left' \
data.max_prompt_length=${max_prompt_length} \
data.max_response_length=${max_response_length} \
data.train_batch_size=${train_prompt_bsz} \
actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.enforce_eager=True \
algorithm.adv_estimator=${adv_estimator} \
algorithm.use_kl_in_reward=${use_kl_in_reward} \
algorithm.kl_ctrl.kl_coef=${kl_coef} \
actor_rollout_ref.model.use_fused_kernels=True \
actor_rollout_ref.actor.megatron.use_mbridge=True \
actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
actor_rollout_ref.actor.clip_ratio_c=10.0 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=2 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=4 \
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.model.path="${MODEL_PATH}" \
actor_rollout_ref.actor.optim.lr=1e-6 \
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=${optimizer_offload_fraction} \
+actor_rollout_ref.actor.optim.override_optimizer_config.overlap_cpu_optimizer_d2h_h2d=True \
+actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True \
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True \
actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
actor_rollout_ref.actor.megatron.param_offload=${offload} \
actor_rollout_ref.actor.megatron.optimizer_offload=${OPTIM_OFFLOAD} \
actor_rollout_ref.actor.megatron.grad_offload=${offload} \
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${train_pp} \
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${train_tp} \
actor_rollout_ref.actor.megatron.expert_model_parallel_size=$EP \
actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=$ETP \
actor_rollout_ref.actor.megatron.context_parallel_size=${CP} \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.optim.clip_grad=1.0 \
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
actor_rollout_ref.rollout.gpu_memory_utilization=0.85 \
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
actor_rollout_ref.rollout.enable_chunked_prefill=True \
actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
actor_rollout_ref.rollout.temperature=${temperature} \
actor_rollout_ref.rollout.top_p=${top_p} \
actor_rollout_ref.rollout.top_k=${top_k} \
actor_rollout_ref.nccl_timeout=1200 \
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
actor_rollout_ref.rollout.val_kwargs.do_sample=True \
actor_rollout_ref.rollout.val_kwargs.n=1 \
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=${train_pp} \
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=${train_tp} \
actor_rollout_ref.ref.megatron.expert_model_parallel_size=$EP \
actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=$ETP \
actor_rollout_ref.ref.megatron.context_parallel_size=${CP} \
actor_rollout_ref.ref.megatron.param_offload=${offload} \
+actor_rollout_ref.actor.megatron.override_transformer_config.apply_rope_fusion=False \
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_router_dtype=fp32 \
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_shared_expert_overlap=False \
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_enable_deepep=True \
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_token_dispatcher_type=flex \
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform \
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full \
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1 \
+actor_rollout_ref.actor.megatron.override_transformer_config.gradient_accumulation_fusion=True \
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_permute_fusion=True \
+actor_rollout_ref.actor.megatron.override_transformer_config.account_for_embedding_in_pipeline_split=False \
+actor_rollout_ref.actor.megatron.override_transformer_config.account_for_loss_in_pipeline_split=False \
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=${last_layer} \
reward_model.reward_manager=naive \
trainer.logger='["tensorboard", "console"]' \
trainer.project_name="${project_name}" \
trainer.experiment_name="${experiment_name}" \
trainer.n_gpus_per_node=8 \
trainer.nnodes="${NNODES}" \
trainer.val_before_train=False \
trainer.test_freq=10 \
trainer.save_freq=50 \
trainer.total_epochs=10 \
trainer.default_local_dir="${exp_dir}" \
trainer.resume_mode=auto > $exp_log 2>&1 &
| 指标 | 对比H100 |
|---|---|
| 端到端时间 timing_s/step | 0.29 |
| 端到端吞吐 e2e_tps | 0.29 |
| 训练吞吐 update_tps | 0.12 |
| 推理吞吐 vllm_tps | 0.36 |
HuggingFace 权重转换成 Megatron 权重转换过程中出现如下错误:

解决方案:
主要原因是verl的verl/utils/device.py中,torch.cuda.is_available() 返回了true,导致走到GPU的分支了。将其直接赋值为 false 即可。
#is_cuda_available = torch.cuda.is_available()
is_cuda_available = False
is_npu_available = is_torch_npu_available()python3.10/site-packages/torch/distributed/_shard/sharding_spec/_internals.py 中的 validate_non_overlapping_shards_metadata 函数执行速度极慢,从名称来看其仅为校验功能。将此函数直接返回后,任务启动速度显著提升。

python3.10/site-packages/torch/distributed/checkpoint/default_planner.py 中 _validate_global_plan 函数涉及权重保存时的校验,将其直接返回 True。

在训练脚本中增加 trainer.val_before_train=false 以关闭评测,加快推理过程。
pip install tensordict==0.10.0设置 interleave=False,修改 batch 重排,规避推理序列过长,推理超时的问题。
GRPO算法
修改脚本 /src/rl-verl-2.2.0/verl/verl/trainer/ppo/ray_trainer.py
test_batch = test_batch.repeat(
repeat_times=self.config.actor_rollout_ref.rollout.val_kwargs.n, interleave=False
)
gen_batch = gen_batch.repeat(repeat_times=self.config.actor_rollout_ref.rollout.n, interleave=False)
batch = batch.repeat(repeat_times=self.config.actor_rollout_ref.rollout.n, interleave=False)


DAPO算法
修改脚本 /src/rl-verl-2.2.0/verl/recipe/dapo/dapo_ray_trainer.py 中如下代码:
gen_batch = gen_batch.repeat(repeat_times=self.config.actor_rollout_ref.rollout.n, interleave=False)
new_batch = new_batch.repeat(repeat_times=self.config.actor_rollout_ref.rollout.n, interleave=False)
