Ascend-SACT/Qwen3-30B-A3B-verl
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

[toc]

Qwen3-30B-A3B verl 训练实践指导

1. 模型概述及场景

1.1 Qwen3 亮点

Qwen3 是 Qwen 系列的最新一代大语言模型,提供了一套全面的稠密型和专家混合(MoE)模型。基于大规模训练,Qwen3 在推理、指令遵循、智能体能力和多语言支持方面实现了突破性进展,具有以下核心特性:

  • 独特支持在单一模型内无缝切换思维模式(用于复杂逻辑推理、数学和代码)和非思维模式(用于高效的通用对话),确保在各类场景下达到最优性能。
  • 推理能力大幅提升,在数学、代码生成和常识逻辑推理方面超越了此前的 QwQ(思维模式)和 Qwen2.5 指令模型(非思维模式)。
  • 卓越的人类偏好对齐,在创意写作、角色扮演、多轮对话和指令遵循方面表现优异,提供更自然、更具吸引力和沉浸感的对话体验。
  • 智能体能力专精,能够在思维和非思维模式下精准集成外部工具,在复杂智能体任务中达到开源模型的领先性能。
  • 支持 100 多种语言和方言,具备强大的多语言指令遵循和翻译能力。

1.2 模型概览

Qwen3-30B-A3B 具有以下特性:

  • 类型:因果语言模型
  • 训练阶段:预训练与后训练
  • 参数总量:总计 30.5B,激活 3.3B
  • 参数量(非嵌入层):29.9B
  • 层数:48
  • 注意力头数量(GQA):Q 为 32,KV 为 4
  • 专家数量:128
  • 激活专家数量:8
  • 上下文长度:原生支持 32,768 tokens,使用 YaRN 扩展至 131,072 tokens。

模型链接:https://huggingface.co/Qwen/Qwen3-30B-A3B

本案例使用昇腾 A2 机器,基于 verl 框架完成 Qwen3-30B-A3B DAPO 算法强化学习实践。

2. 准备运行环境

2.1. 硬件环境

硬件名称配置信息
机器型号A2
测试集群4机(32卡)
操作系统ARM

2.2 软件版本

软件版本部署方式
DriverAscendHDK 25.2.0宿主机
FirmwareAscendHDK 25.2.0宿主机
Docker镜像OSUbuntu 20.04.6容器
Python3.10.18容器
CANN8.3.RC1容器
Torch2.7.1容器
Torch_npu2.7.1容器
vllmmaster(38217877aa70041c0115ee367b75197af9cbc5ad)容器
vllm-ascendmaster(1de16ead8eecfec8903ec1b330b27a4fa2593c35)容器
transformersmaster(8365f70e925)容器
MindSpeedmaster(1cdd0abd75e40936ad31721c092f57c695dd72c4)容器
Megatron-LMcore_v0.12.1容器
verlmaster(796871d7d092f7cbc6a64e7f4a3796f7a2217f5e)容器
MindSpeed-RL2.2.0容器

2.3. 镜像准备

verl 镜像已发布,镜像下载链接:Ascend-SACT/ascend_rl_train_image

参考:https://gitcode.com/Ascend/MindSpeed-RL/blob/2.2.0/rl-plugin/README.MD

3. 运行指导

3.1 数据集准备

3.1.1 数据集概述

训练数据集 DAPO-Math-17k

  • DAPO-Math-17k 是一个大规模数据集,包含17,000个数学问题,将数学题目与单整数答案配对,用于基准自动数学推理。

  • 它支持高级训练方法,如直接偏好优化和通过动态、误差聚焦抽样实现自觉迭代 DPO。

  • 最小仅答案的监督设计促进了监督学习与强化学习技术的整合,提高了数学任务中的模型准确性。

评测数据集 AIME 2024

该数据集包含了 2024 年美国邀请数学考试(AIME)的问题,一共30道题目。AIME 是一项享有盛誉的高中数学竞赛,以其具有挑战性的数学题目闻名。

  • 涵盖多个数学领域(几何、代数、数论等)
  • 包含每个问题的详细解决方案流程
  • 所有问题都有特定的数值答案
  • 高难度,适合测试高级推理能力
  • 问题需要多步骤推理和数学洞察力

3.1.2 数据集下载

训练数据集下载:https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k/tree/main/data

测试数据集下载:https://huggingface.co/datasets/BytedTsinghua-SIA/AIME-2024/tree/main/data

3.2 模型权重准备

3.2.1 模型权重下载

可使用 HuggingFace CLI 进行下载:

export HF_ENDPOINT=https://hf-mirror.com
pip install -U huggingface_hub
hf download Qwen/Qwen3-30B-A3B --local-dir ./Qwen3-30B-A3B

下载完成后有如下文件(权重有16个,为节省空间,这里只列出部分):

Qwen3-30B-A3B/
├── LICENSE
├── README.md
├── config.json
├── generation_config.json
├── merges.txt
├── model-00001-of-00016.safetensors
├── model-00002-of-00016.safetensors
├── model-00003-of-00016.safetensors
├── model.safetensors.index.json
├── tokenizer.json
├── tokenizer_config.json
└── vocab.json

权重大小:

root@dl-de:~/work/filestorage/weights_hf# du -sh Qwen3-30B-A3B/
57G     Qwen3-30B-A3B/

3.3 A2 训练

3.3.1 启动Ray集群

此处操作在客户平台上进行,不同局点的具体情况可能存在差异,但总体命令基本一致:

Master节点:

ray start --head --port 6166 --dashboard-host=0.0.0.0 --dashboard-port=8260 --block

worker节点:

ray start --address="${MASTER_IP}:6166" --resources='{"NPU": 8}' --block

完整的代码:

master 节点:

bash ray_start_master.sh
export RAY_DEDUP_LOGS=0
export HYDRA_FULL_ERROR=1
#TASK_QUEUE_ENABLE,下发优化,图模式设置为1,非图模式设置为2
export TASK_QUEUE_ENABLE=1
export CPU_AFFINITY_CONF=1
export VLLM_USE_V1=1
export VLLM_VERSION=0.10.0
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_ASCEND_ENABLE_FLASHCOMM=1
export HCCL_ASYNC_ERROR_HANDLING=0
export HCCL_EXEC_TIMEOUT=3600
export HCCL_CONNECT_TIMEOUT=3600
export LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/driver:$LD_LIBRARY_PATH

ulimit -n 32768

SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0" # 替换为正确的IFNAME
export GLOO_SOCKET_IFNAME="eth0" # 替换为正确的IFNAME

ray start --head --port 6166 --dashboard-host=0.0.0.0 --dashboard-port=8260 --block 

worker节点:

bash ray_start_worker.sh ${MASTER_IP}
export RAY_DEDUP_LOGS=0
export HYDRA_FULL_ERROR=1
#TASK_QUEUE_ENABLE,下发优化,图模式设置为1,非图模式设置为2
export TASK_QUEUE_ENABLE=1
export CPU_AFFINITY_CONF=1
export VLLM_USE_V1=1
export VLLM_VERSION=0.10.0
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_ASCEND_ENABLE_FLASHCOMM=1
export HCCL_ASYNC_ERROR_HANDLING=0
export HCCL_EXEC_TIMEOUT=3600
export HCCL_CONNECT_TIMEOUT=3600
export LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/driver:$LD_LIBRARY_PATH

ulimit -n 32768

MASTER_IP=$1
SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0" # 替换为正确的IFNAME
export GLOO_SOCKET_IFNAME="eth0" # 替换为正确的IFNAME

ray start --address="${MASTER_IP}:6166" --resources='{"NPU": 8}' --block 

进入容器后查看ray状态:

ray status

检查卡数是否与预期一致,如下所示为4机32卡。Active显示机器数量,Total Usage显示使用的卡数。

(rl-verl-1030) root@dt-6b1cf:/workspace# ray status
======== Autoscaler status: 2025-12-20 16:37:16.050047 ========
Node status
---------------------------------------------------------------
Active:
 1 node_d7666ad2274f124c59ea91bb0f44ce379e70cc857766e48c0431778e
 1 node_d1d063dd3cec3561eec4c52e3927eeb932af6b3fd6c76348ef29b6c2
 1 node_d4eb95b7886d9e0fa0869e66bcdfbe31d3be4747d36e70a5f34c6412
 1 node_bef316e072da91a6014d998f7c1d937df7b33b32e3ca4890b0711c99
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/640.0 CPU
 0.0/32.0 NPU
 0B/7.38TiB memory
 0B/121.60GiB object_store_memory

3.3.2 启动训练任务

只需要在主节点执行即可:

nohup bash test_dapo_qwen3_30b_fsdp_A+X_6k_sp1_eager.sh > qwen3-30b-a3b.log 2>&1 &

完整脚本如下

#!/usr/bin/env bash
set -xeuo pipefail

cd /src/rl-verl-2.2.0/verl/
project_name='DAPO'
exp_name='DAPO-Qwen3-30B'

adv_estimator=grpo

use_kl_in_reward=False
kl_coef=0.0
use_kl_loss=False
kl_loss_coef=0.0

clip_ratio_low=0.2
clip_ratio_high=0.28

max_prompt_length=$((1024 * 2))
max_response_length=$((1024 * 6))
enable_overlong_buffer=True
overlong_buffer_len=$((1024 * 4))
overlong_penalty_factor=1.0

loss_agg_mode="token-mean"

enable_filter_groups=True
filter_groups_metric=acc
max_num_gen_batches=10
train_prompt_bsz=32
gen_prompt_bsz=$((train_prompt_bsz * 3))
n_resp_per_prompt=16
max_num_seqs=1024
train_prompt_mini_bsz=32

# Ray
NNODES=4
# Paths

MODEL_PATH=/root/work/filestorage/weights_hf/Qwen3-30B-A3B
CKPTS_DIR=/root/work/filestorage/ckpt/Qwen3-30B-A3B/1220/
TRAIN_FILE=/root/work/filestorage/datasets/dapo-math-17k.parquet
TEST_FILE=/root/work/filestorage/datasets/aime-2024.parquet

# Algorithm
temperature=1.0
top_p=1.0
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
val_top_p=0.7

# Performance Related Parameter
sp_size=1
use_dynamic_bsz=True
log_prob_micro_batch_size_per_gpu=1
ppo_micro_batch_size_per_gpu=1
actor_ppo_max_token_len=$((1024 * 16))
infer_ppo_max_token_len=$((1024 * 16))
offload=True
gen_tp=2
gen_dp=1
gen_world_size=$((NNODES*8))
enable_chunked_prefill=True

python3 -m recipe.dapo.main_dapo \
    data.train_files="${TRAIN_FILE}" \
    data.val_files="${TEST_FILE}" \
    data.prompt_key=prompt \
    data.truncation='left' \
    data.max_prompt_length=${max_prompt_length} \
    data.max_response_length=${max_response_length} \
    data.gen_batch_size=${gen_prompt_bsz} \
    data.train_batch_size=${train_prompt_bsz} \
    actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
    actor_rollout_ref.rollout.max_num_seqs=${max_num_seqs} \
    actor_rollout_ref.rollout.max_num_batched_tokens=$((1024 * 32)) \
    algorithm.adv_estimator=${adv_estimator} \
    algorithm.use_kl_in_reward=${use_kl_in_reward} \
    algorithm.kl_ctrl.kl_coef=${kl_coef} \
    actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
    actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
    actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
    actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
    actor_rollout_ref.actor.clip_ratio_c=10.0 \
    algorithm.filter_groups.enable=${enable_filter_groups} \
    algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
    algorithm.filter_groups.metric=${filter_groups_metric} \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
    actor_rollout_ref.model.path="${MODEL_PATH}" \
    +actor_rollout_ref.model.override_config.attention_dropout=0. \
    +actor_rollout_ref.model.override_config.embd_pdrop=0. \
    +actor_rollout_ref.model.override_config.resid_pdrop=0. \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
    actor_rollout_ref.actor.optim.weight_decay=0.1 \
    actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
    actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.actor.grad_clip=1.0 \
    actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.90 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
    +actor_rollout_ref.rollout.dp_model_parallel_size=${gen_dp} \
    +actor_rollout_ref.rollout.rollout_world_size=${gen_world_size} \
    +actor_rollout_ref.rollout.enable_expert_parallel=False \
    actor_rollout_ref.rollout.enable_chunked_prefill=${enable_chunked_prefill} \
    actor_rollout_ref.rollout.temperature=${temperature} \
    actor_rollout_ref.rollout.top_p=${top_p} \
    actor_rollout_ref.rollout.top_k="${top_k}" \
    actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
    actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
    actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
    actor_rollout_ref.rollout.val_kwargs.do_sample=True \
    actor_rollout_ref.rollout.val_kwargs.n=1 \
    actor_rollout_ref.rollout.enforce_eager=True \
    actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
    +actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
    actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
    actor_rollout_ref.actor.strategy=fsdp \
    actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
    actor_rollout_ref.rollout.free_cache_engine=True \
    reward_model.reward_manager=dapo \
    reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
    reward_model.overlong_buffer.len=${overlong_buffer_len} \
    reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
    trainer.logger='["console"]' \
    trainer.project_name="${project_name}" \
    trainer.experiment_name="${exp_name}" \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes="${NNODES}" \
    trainer.device='npu' \
    trainer.val_before_train=False \
    trainer.test_freq=200 \
    trainer.save_freq=50 \
    trainer.total_epochs=100 \
    trainer.default_local_dir="${CKPTS_DIR}" \
    trainer.resume_mode=auto \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${log_prob_micro_batch_size_per_gpu} \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${log_prob_micro_batch_size_per_gpu} \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${ppo_micro_batch_size_per_gpu} \
    ++actor_rollout_ref.nccl_timeout=7200 \
    actor_rollout_ref.actor.use_torch_compile=False \
    actor_rollout_ref.ref.use_torch_compile=False $@

脚本参考:https://gitcode.com/Ascend/MindSpeed-RL/blob/master/tests/verl_examples/configs/test_dapo_qwen3_30b_fsdp_A+X_6k.sh

训练过程中可以查看日志:

grep -nr "response" /tmp/ray/session_latest/logs

grep -nr "response" /tmp/ray/session_latest/logs > qwen3-235b-verl.log

可将日志拖动到https://curryrice233.github.io/TrainingLogParser/ 网站中查看 reward 曲线。

公司内部可使用:https://traininglogparser.openx.huawei.com/

image-20251222222632106

日志:

step:9 - batch/solve_none:37 - batch/solve_all:0 - batch/solve_partial:59 - global_seqlen/min:35980 - global_seqlen/max:100672 - global_seqlen/minmax_diff:64692 - global_seqlen/balanced_min:78230 - global_seqlen/balanced_max:78235 - global_seqlen/mean:78232.5 - actor/entropy:0.2248232364654541 - actor/pg_loss:0.06919658184051514 - actor/pg_clipfrac:0.0 - actor/ppo_kl:0.0 - actor/pg_clipfrac_lower:0.0 - actor/grad_norm:0.0859375 - perf/mfu/actor:0.020165570134103004 - perf/max_memory_allocated_gb:122.66994619369507 - perf/max_memory_reserved_gb:126.806640625 - perf/cpu_memory_used_gb:232.34037017822266 - actor/lr:8e-07 - critic/score/mean:-0.7969036102294922 - critic/score/max:1.0 - critic/score/min:-2.0 - critic/rewards/mean:-0.7969036102294922 - critic/rewards/max:1.0 - critic/rewards/min:-2.0 - critic/advantages/mean:-0.059114668518304825 - critic/advantages/max:3.745802402496338 - critic/advantages/min:-2.352499008178711 - critic/returns/mean:-0.059114668518304825 - critic/returns/max:3.745802402496338 - critic/returns/min:-2.352499008178711 - response_length/mean:4725.28125 - response_length/max:6144.0 - response_length/min:1439.0 - response_length/clip_ratio:0.310546875 - response_length_non_aborted/mean:4725.28125 - response_length_non_aborted/max:6144.0 - response_length_non_aborted/min:1439.0 - response_length_non_aborted/clip_ratio:0.310546875 - response/aborted_ratio:0.0 - prompt_length/mean:164.25 - prompt_length/max:383.0 - prompt_length/min:91.0 - prompt_length/clip_ratio:0.0 - timing_s/start_profile:8.850730955600739e-05 - timing_s/generate_sequences:142.09425354003906 - timing_s/reshard:10.942972183227539 - timing_s/generation_timing/max:148.48916625976562 - timing_s/generation_timing/min:131.9227752685547 - timing_s/generation_timing/topk_ratio:0.125 - timing_s/gen:161.7286260649562 - timing_s/reward:2.2595805628225207 - timing_s/old_log_prob:16.38148122187704 - timing_s/adv:0.017708600498735905 - timing_s/update_actor:130.07783674728125 - timing_s/step:310.60584014933556 - timing_s/stop_profile:9.483285248279572e-05 - timing_per_token_ms/adv:7.073706778966504e-06 - timing_per_token_ms/gen:0.06684813158647807 - timing_per_token_ms/update_actor:0.0519596382366988 - perf/total_num_tokens:2503440 - perf/time_per_step:310.60584014933556 - perf/throughput:251.87066657338687 - train/num_gen_batches:1

关键指标:

  • prompt_length 推理输入长度
  • response_length 推理输出长度
  • reward 得分

critic/rewards/mean:奖励的均值,该值应为奖励模型的奖励数值求和后除以global_batch_size,min和max则为奖励模型及规则奖励对同一个样本的奖励最大值和最小值。

prompt_length/mean,prompt_length/max;response_length/mean,response_length/max:输入与输出的最大值和均值,其中response_length会显著影响生成的性能,同时需要关注其与max_tokens的差异,防止过多回答被截断。

timing_s/[role]:对应进程组花费的时间,其中gen对应生成时间,通常gen时间为总时间的主要组成部分;update为actor更新时间。

吞吐:

端到端吞吐 e2e_tps=(response_length_mean+prompt_length_mean)×global_batch_size×n_samples_per_prompt/world_size /time_all

训练吞吐 update_tps=(response_length_mean+prompt_length_mean)×global_batch_size×n_samples_per_prompt/world_size /time_update

推理吞吐 vllm_tps=(response_length_mean+prompt_length_mean)×global_batch_size×n_samples_per_prompt/world_size /time_gen

3.4 H100 训练

训练脚本

#!/usr/bin/env bash
set -xeuo pipefail

project_name='DAPO-myconfig'
exp_name='DAPO-Qwen3-30B'

adv_estimator=grpo

use_kl_in_reward=False
kl_coef=0.0
use_kl_loss=False
kl_loss_coef=0.0

clip_ratio_low=0.2
clip_ratio_high=0.28

max_prompt_length=$((1024 * 2))
max_response_length=$((1024 * 6))
enable_overlong_buffer=True
overlong_buffer_len=$((1024 * 4))
overlong_penalty_factor=1.0

loss_agg_mode="token-mean"

enable_filter_groups=True
filter_groups_metric=acc
max_num_gen_batches=10
train_prompt_bsz=32
gen_prompt_bsz=$((train_prompt_bsz * 3))
n_resp_per_prompt=16
max_num_seqs=1024
train_prompt_mini_bsz=32

# Ray
NNODES=4
# Paths
MODEL_PATH="/root/work/filestorage/GroupPostTrain/OpenSourceModels/Qwen3-30B-A3B"
CKPTS_DIR="/root/work/externalstorage/gpfsprd/verl/outputs/${project_name}/${exp_name}"
TRAIN_FILE="/root/work/externalstorage/gpfsprd/datasets/DAPO-Math-17k/raw/dapo-math-17k.parquet"
TEST_FILE="/root/work/externalstorage/gpfsprd/datasets/DAPO-Math-17k/dapo-aime24.parquet"

# Algorithm
temperature=1.0
top_p=1.0
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
val_top_p=0.7

# Performance Related Parameter
sp_size=1
use_dynamic_bsz=True
log_prob_micro_batch_size_per_gpu=1
ppo_micro_batch_size_per_gpu=1
actor_ppo_max_token_len=$((1024 * 16))
infer_ppo_max_token_len=$((1024 * 32))
offload=True
gen_tp=2
enable_chunked_prefill=True

python3 -m recipe.dapo.main_dapo \
    data.train_files="${TRAIN_FILE}" \
    data.val_files="${TEST_FILE}" \
    data.prompt_key=prompt \
    data.truncation='left' \
    data.max_prompt_length=${max_prompt_length} \
    data.max_response_length=${max_response_length} \
    data.gen_batch_size=${gen_prompt_bsz} \
    data.train_batch_size=${train_prompt_bsz} \
    actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
    actor_rollout_ref.rollout.max_num_seqs=${max_num_seqs} \
    actor_rollout_ref.rollout.max_num_batched_tokens=$((1024 * 32)) \
    algorithm.adv_estimator=${adv_estimator} \
    algorithm.use_kl_in_reward=${use_kl_in_reward} \
    algorithm.kl_ctrl.kl_coef=${kl_coef} \
    actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
    actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
    actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
    actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
    actor_rollout_ref.actor.clip_ratio_c=10.0 \
    algorithm.filter_groups.enable=${enable_filter_groups} \
    algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
    algorithm.filter_groups.metric=${filter_groups_metric} \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
    actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
    actor_rollout_ref.model.path="${MODEL_PATH}" \
    +actor_rollout_ref.model.override_config.attention_dropout=0. \
    +actor_rollout_ref.model.override_config.embd_pdrop=0. \
    +actor_rollout_ref.model.override_config.resid_pdrop=0. \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
    actor_rollout_ref.actor.optim.weight_decay=0.1 \
    actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
    actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.actor.grad_clip=1.0 \
    actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.9 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
    actor_rollout_ref.rollout.enable_chunked_prefill=${enable_chunked_prefill} \
    actor_rollout_ref.rollout.temperature=${temperature} \
    actor_rollout_ref.rollout.top_p=${top_p} \
    actor_rollout_ref.rollout.top_k="${top_k}" \
    actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
    actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
    actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
    actor_rollout_ref.rollout.val_kwargs.do_sample=True \
    actor_rollout_ref.rollout.val_kwargs.n=1 \
    actor_rollout_ref.rollout.enforce_eager=False \
    actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
    +actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
    actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
    actor_rollout_ref.actor.strategy=fsdp \
    actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
    actor_rollout_ref.rollout.free_cache_engine=True \
    reward_model.reward_manager=dapo \
    reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
    reward_model.overlong_buffer.len=${overlong_buffer_len} \
    reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
    trainer.logger='["tensorboard", "console"]' \
    trainer.project_name="${project_name}" \
    trainer.experiment_name="${exp_name}" \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes="${NNODES}" \
    trainer.val_before_train=False \
    trainer.test_freq=200 \
    trainer.save_freq=50 \
    trainer.total_epochs=100 \
    trainer.default_local_dir="${CKPTS_DIR}" \
    trainer.resume_mode=auto \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${log_prob_micro_batch_size_per_gpu} \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${log_prob_micro_batch_size_per_gpu} \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${ppo_micro_batch_size_per_gpu} \
    ++actor_rollout_ref.nccl_timeout=7200 \
    actor_rollout_ref.actor.use_torch_compile=False \
    actor_rollout_ref.ref.use_torch_compile=False $@

4. 指标对比

4.1 各项指标曲线

image-20251222223407145

4.2 性能对比

指标对比H100
端到端时间 timing_s/step0.27
训练吞吐 update_tps1.59
推理吞吐 vllm_tps0.16