[toc]
Qwen3 是 Qwen 系列的最新一代大语言模型,提供了一套全面的稠密型和专家混合(MoE)模型。基于大规模训练,Qwen3 在推理、指令遵循、智能体能力和多语言支持方面实现了突破性进展,具有以下核心特性:
Qwen3-30B-A3B 具有以下特性:
模型链接:https://huggingface.co/Qwen/Qwen3-30B-A3B
本案例使用昇腾 A2 机器,基于 verl 框架完成 Qwen3-30B-A3B DAPO 算法强化学习实践。
| 硬件名称 | 配置信息 |
|---|---|
| 机器型号 | A2 |
| 测试集群 | 4机(32卡) |
| 操作系统 | ARM |
| 软件 | 版本 | 部署方式 |
|---|---|---|
| Driver | AscendHDK 25.2.0 | 宿主机 |
| Firmware | AscendHDK 25.2.0 | 宿主机 |
| Docker镜像OS | Ubuntu 20.04.6 | 容器 |
| Python | 3.10.18 | 容器 |
| CANN | 8.3.RC1 | 容器 |
| Torch | 2.7.1 | 容器 |
| Torch_npu | 2.7.1 | 容器 |
| vllm | master(38217877aa70041c0115ee367b75197af9cbc5ad) | 容器 |
| vllm-ascend | master(1de16ead8eecfec8903ec1b330b27a4fa2593c35) | 容器 |
| transformers | master(8365f70e925) | 容器 |
| MindSpeed | master(1cdd0abd75e40936ad31721c092f57c695dd72c4) | 容器 |
| Megatron-LM | core_v0.12.1 | 容器 |
| verl | master(796871d7d092f7cbc6a64e7f4a3796f7a2217f5e) | 容器 |
| MindSpeed-RL | 2.2.0 | 容器 |
verl 镜像已发布,镜像下载链接:Ascend-SACT/ascend_rl_train_image
参考:https://gitcode.com/Ascend/MindSpeed-RL/blob/2.2.0/rl-plugin/README.MD
训练数据集 DAPO-Math-17k
DAPO-Math-17k 是一个大规模数据集,包含17,000个数学问题,将数学题目与单整数答案配对,用于基准自动数学推理。
它支持高级训练方法,如直接偏好优化和通过动态、误差聚焦抽样实现自觉迭代 DPO。
最小仅答案的监督设计促进了监督学习与强化学习技术的整合,提高了数学任务中的模型准确性。
评测数据集 AIME 2024
该数据集包含了 2024 年美国邀请数学考试(AIME)的问题,一共30道题目。AIME 是一项享有盛誉的高中数学竞赛,以其具有挑战性的数学题目闻名。
训练数据集下载:https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17k/tree/main/data
测试数据集下载:https://huggingface.co/datasets/BytedTsinghua-SIA/AIME-2024/tree/main/data
可使用 HuggingFace CLI 进行下载:
export HF_ENDPOINT=https://hf-mirror.com
pip install -U huggingface_hub
hf download Qwen/Qwen3-30B-A3B --local-dir ./Qwen3-30B-A3B下载完成后有如下文件(权重有16个,为节省空间,这里只列出部分):
Qwen3-30B-A3B/
├── LICENSE
├── README.md
├── config.json
├── generation_config.json
├── merges.txt
├── model-00001-of-00016.safetensors
├── model-00002-of-00016.safetensors
├── model-00003-of-00016.safetensors
├── model.safetensors.index.json
├── tokenizer.json
├── tokenizer_config.json
└── vocab.json权重大小:
root@dl-de:~/work/filestorage/weights_hf# du -sh Qwen3-30B-A3B/
57G Qwen3-30B-A3B/此处操作在客户平台上进行,不同局点的具体情况可能存在差异,但总体命令基本一致:
Master节点:
ray start --head --port 6166 --dashboard-host=0.0.0.0 --dashboard-port=8260 --blockworker节点:
ray start --address="${MASTER_IP}:6166" --resources='{"NPU": 8}' --block完整的代码:
master 节点:
bash ray_start_master.shexport RAY_DEDUP_LOGS=0
export HYDRA_FULL_ERROR=1
#TASK_QUEUE_ENABLE,下发优化,图模式设置为1,非图模式设置为2
export TASK_QUEUE_ENABLE=1
export CPU_AFFINITY_CONF=1
export VLLM_USE_V1=1
export VLLM_VERSION=0.10.0
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_ASCEND_ENABLE_FLASHCOMM=1
export HCCL_ASYNC_ERROR_HANDLING=0
export HCCL_EXEC_TIMEOUT=3600
export HCCL_CONNECT_TIMEOUT=3600
export LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/driver:$LD_LIBRARY_PATH
ulimit -n 32768
SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0" # 替换为正确的IFNAME
export GLOO_SOCKET_IFNAME="eth0" # 替换为正确的IFNAME
ray start --head --port 6166 --dashboard-host=0.0.0.0 --dashboard-port=8260 --block worker节点:
bash ray_start_worker.sh ${MASTER_IP}export RAY_DEDUP_LOGS=0
export HYDRA_FULL_ERROR=1
#TASK_QUEUE_ENABLE,下发优化,图模式设置为1,非图模式设置为2
export TASK_QUEUE_ENABLE=1
export CPU_AFFINITY_CONF=1
export VLLM_USE_V1=1
export VLLM_VERSION=0.10.0
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_ASCEND_ENABLE_FLASHCOMM=1
export HCCL_ASYNC_ERROR_HANDLING=0
export HCCL_EXEC_TIMEOUT=3600
export HCCL_CONNECT_TIMEOUT=3600
export LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/driver:$LD_LIBRARY_PATH
ulimit -n 32768
MASTER_IP=$1
SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0" # 替换为正确的IFNAME
export GLOO_SOCKET_IFNAME="eth0" # 替换为正确的IFNAME
ray start --address="${MASTER_IP}:6166" --resources='{"NPU": 8}' --block 进入容器后查看ray状态:
ray status检查卡数是否与预期一致,如下所示为4机32卡。Active显示机器数量,Total Usage显示使用的卡数。
(rl-verl-1030) root@dt-6b1cf:/workspace# ray status
======== Autoscaler status: 2025-12-20 16:37:16.050047 ========
Node status
---------------------------------------------------------------
Active:
1 node_d7666ad2274f124c59ea91bb0f44ce379e70cc857766e48c0431778e
1 node_d1d063dd3cec3561eec4c52e3927eeb932af6b3fd6c76348ef29b6c2
1 node_d4eb95b7886d9e0fa0869e66bcdfbe31d3be4747d36e70a5f34c6412
1 node_bef316e072da91a6014d998f7c1d937df7b33b32e3ca4890b0711c99
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
0.0/640.0 CPU
0.0/32.0 NPU
0B/7.38TiB memory
0B/121.60GiB object_store_memory只需要在主节点执行即可:
nohup bash test_dapo_qwen3_30b_fsdp_A+X_6k_sp1_eager.sh > qwen3-30b-a3b.log 2>&1 &完整脚本如下
#!/usr/bin/env bash
set -xeuo pipefail
cd /src/rl-verl-2.2.0/verl/
project_name='DAPO'
exp_name='DAPO-Qwen3-30B'
adv_estimator=grpo
use_kl_in_reward=False
kl_coef=0.0
use_kl_loss=False
kl_loss_coef=0.0
clip_ratio_low=0.2
clip_ratio_high=0.28
max_prompt_length=$((1024 * 2))
max_response_length=$((1024 * 6))
enable_overlong_buffer=True
overlong_buffer_len=$((1024 * 4))
overlong_penalty_factor=1.0
loss_agg_mode="token-mean"
enable_filter_groups=True
filter_groups_metric=acc
max_num_gen_batches=10
train_prompt_bsz=32
gen_prompt_bsz=$((train_prompt_bsz * 3))
n_resp_per_prompt=16
max_num_seqs=1024
train_prompt_mini_bsz=32
# Ray
NNODES=4
# Paths
MODEL_PATH=/root/work/filestorage/weights_hf/Qwen3-30B-A3B
CKPTS_DIR=/root/work/filestorage/ckpt/Qwen3-30B-A3B/1220/
TRAIN_FILE=/root/work/filestorage/datasets/dapo-math-17k.parquet
TEST_FILE=/root/work/filestorage/datasets/aime-2024.parquet
# Algorithm
temperature=1.0
top_p=1.0
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
val_top_p=0.7
# Performance Related Parameter
sp_size=1
use_dynamic_bsz=True
log_prob_micro_batch_size_per_gpu=1
ppo_micro_batch_size_per_gpu=1
actor_ppo_max_token_len=$((1024 * 16))
infer_ppo_max_token_len=$((1024 * 16))
offload=True
gen_tp=2
gen_dp=1
gen_world_size=$((NNODES*8))
enable_chunked_prefill=True
python3 -m recipe.dapo.main_dapo \
data.train_files="${TRAIN_FILE}" \
data.val_files="${TEST_FILE}" \
data.prompt_key=prompt \
data.truncation='left' \
data.max_prompt_length=${max_prompt_length} \
data.max_response_length=${max_response_length} \
data.gen_batch_size=${gen_prompt_bsz} \
data.train_batch_size=${train_prompt_bsz} \
actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
actor_rollout_ref.rollout.max_num_seqs=${max_num_seqs} \
actor_rollout_ref.rollout.max_num_batched_tokens=$((1024 * 32)) \
algorithm.adv_estimator=${adv_estimator} \
algorithm.use_kl_in_reward=${use_kl_in_reward} \
algorithm.kl_ctrl.kl_coef=${kl_coef} \
actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
actor_rollout_ref.actor.clip_ratio_c=10.0 \
algorithm.filter_groups.enable=${enable_filter_groups} \
algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
algorithm.filter_groups.metric=${filter_groups_metric} \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.model.path="${MODEL_PATH}" \
+actor_rollout_ref.model.override_config.attention_dropout=0. \
+actor_rollout_ref.model.override_config.embd_pdrop=0. \
+actor_rollout_ref.model.override_config.resid_pdrop=0. \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
actor_rollout_ref.actor.optim.weight_decay=0.1 \
actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.grad_clip=1.0 \
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.90 \
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
+actor_rollout_ref.rollout.dp_model_parallel_size=${gen_dp} \
+actor_rollout_ref.rollout.rollout_world_size=${gen_world_size} \
+actor_rollout_ref.rollout.enable_expert_parallel=False \
actor_rollout_ref.rollout.enable_chunked_prefill=${enable_chunked_prefill} \
actor_rollout_ref.rollout.temperature=${temperature} \
actor_rollout_ref.rollout.top_p=${top_p} \
actor_rollout_ref.rollout.top_k="${top_k}" \
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
actor_rollout_ref.rollout.val_kwargs.do_sample=True \
actor_rollout_ref.rollout.val_kwargs.n=1 \
actor_rollout_ref.rollout.enforce_eager=True \
actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
+actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
actor_rollout_ref.actor.strategy=fsdp \
actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
actor_rollout_ref.rollout.free_cache_engine=True \
reward_model.reward_manager=dapo \
reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
reward_model.overlong_buffer.len=${overlong_buffer_len} \
reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
trainer.logger='["console"]' \
trainer.project_name="${project_name}" \
trainer.experiment_name="${exp_name}" \
trainer.n_gpus_per_node=8 \
trainer.nnodes="${NNODES}" \
trainer.device='npu' \
trainer.val_before_train=False \
trainer.test_freq=200 \
trainer.save_freq=50 \
trainer.total_epochs=100 \
trainer.default_local_dir="${CKPTS_DIR}" \
trainer.resume_mode=auto \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${log_prob_micro_batch_size_per_gpu} \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${log_prob_micro_batch_size_per_gpu} \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${ppo_micro_batch_size_per_gpu} \
++actor_rollout_ref.nccl_timeout=7200 \
actor_rollout_ref.actor.use_torch_compile=False \
actor_rollout_ref.ref.use_torch_compile=False $@训练过程中可以查看日志:
grep -nr "response" /tmp/ray/session_latest/logs
grep -nr "response" /tmp/ray/session_latest/logs > qwen3-235b-verl.log可将日志拖动到https://curryrice233.github.io/TrainingLogParser/ 网站中查看 reward 曲线。
公司内部可使用:https://traininglogparser.openx.huawei.com/

日志:
step:9 - batch/solve_none:37 - batch/solve_all:0 - batch/solve_partial:59 - global_seqlen/min:35980 - global_seqlen/max:100672 - global_seqlen/minmax_diff:64692 - global_seqlen/balanced_min:78230 - global_seqlen/balanced_max:78235 - global_seqlen/mean:78232.5 - actor/entropy:0.2248232364654541 - actor/pg_loss:0.06919658184051514 - actor/pg_clipfrac:0.0 - actor/ppo_kl:0.0 - actor/pg_clipfrac_lower:0.0 - actor/grad_norm:0.0859375 - perf/mfu/actor:0.020165570134103004 - perf/max_memory_allocated_gb:122.66994619369507 - perf/max_memory_reserved_gb:126.806640625 - perf/cpu_memory_used_gb:232.34037017822266 - actor/lr:8e-07 - critic/score/mean:-0.7969036102294922 - critic/score/max:1.0 - critic/score/min:-2.0 - critic/rewards/mean:-0.7969036102294922 - critic/rewards/max:1.0 - critic/rewards/min:-2.0 - critic/advantages/mean:-0.059114668518304825 - critic/advantages/max:3.745802402496338 - critic/advantages/min:-2.352499008178711 - critic/returns/mean:-0.059114668518304825 - critic/returns/max:3.745802402496338 - critic/returns/min:-2.352499008178711 - response_length/mean:4725.28125 - response_length/max:6144.0 - response_length/min:1439.0 - response_length/clip_ratio:0.310546875 - response_length_non_aborted/mean:4725.28125 - response_length_non_aborted/max:6144.0 - response_length_non_aborted/min:1439.0 - response_length_non_aborted/clip_ratio:0.310546875 - response/aborted_ratio:0.0 - prompt_length/mean:164.25 - prompt_length/max:383.0 - prompt_length/min:91.0 - prompt_length/clip_ratio:0.0 - timing_s/start_profile:8.850730955600739e-05 - timing_s/generate_sequences:142.09425354003906 - timing_s/reshard:10.942972183227539 - timing_s/generation_timing/max:148.48916625976562 - timing_s/generation_timing/min:131.9227752685547 - timing_s/generation_timing/topk_ratio:0.125 - timing_s/gen:161.7286260649562 - timing_s/reward:2.2595805628225207 - timing_s/old_log_prob:16.38148122187704 - timing_s/adv:0.017708600498735905 - timing_s/update_actor:130.07783674728125 - timing_s/step:310.60584014933556 - timing_s/stop_profile:9.483285248279572e-05 - timing_per_token_ms/adv:7.073706778966504e-06 - timing_per_token_ms/gen:0.06684813158647807 - timing_per_token_ms/update_actor:0.0519596382366988 - perf/total_num_tokens:2503440 - perf/time_per_step:310.60584014933556 - perf/throughput:251.87066657338687 - train/num_gen_batches:1关键指标:
critic/rewards/mean:奖励的均值,该值应为奖励模型的奖励数值求和后除以global_batch_size,min和max则为奖励模型及规则奖励对同一个样本的奖励最大值和最小值。
prompt_length/mean,prompt_length/max;response_length/mean,response_length/max:输入与输出的最大值和均值,其中response_length会显著影响生成的性能,同时需要关注其与max_tokens的差异,防止过多回答被截断。
timing_s/[role]:对应进程组花费的时间,其中gen对应生成时间,通常gen时间为总时间的主要组成部分;update为actor更新时间。
吞吐:
端到端吞吐 e2e_tps=(response_length_mean+prompt_length_mean)×global_batch_size×n_samples_per_prompt/world_size /time_all
训练吞吐 update_tps=(response_length_mean+prompt_length_mean)×global_batch_size×n_samples_per_prompt/world_size /time_update
推理吞吐 vllm_tps=(response_length_mean+prompt_length_mean)×global_batch_size×n_samples_per_prompt/world_size /time_gen
训练脚本
#!/usr/bin/env bash
set -xeuo pipefail
project_name='DAPO-myconfig'
exp_name='DAPO-Qwen3-30B'
adv_estimator=grpo
use_kl_in_reward=False
kl_coef=0.0
use_kl_loss=False
kl_loss_coef=0.0
clip_ratio_low=0.2
clip_ratio_high=0.28
max_prompt_length=$((1024 * 2))
max_response_length=$((1024 * 6))
enable_overlong_buffer=True
overlong_buffer_len=$((1024 * 4))
overlong_penalty_factor=1.0
loss_agg_mode="token-mean"
enable_filter_groups=True
filter_groups_metric=acc
max_num_gen_batches=10
train_prompt_bsz=32
gen_prompt_bsz=$((train_prompt_bsz * 3))
n_resp_per_prompt=16
max_num_seqs=1024
train_prompt_mini_bsz=32
# Ray
NNODES=4
# Paths
MODEL_PATH="/root/work/filestorage/GroupPostTrain/OpenSourceModels/Qwen3-30B-A3B"
CKPTS_DIR="/root/work/externalstorage/gpfsprd/verl/outputs/${project_name}/${exp_name}"
TRAIN_FILE="/root/work/externalstorage/gpfsprd/datasets/DAPO-Math-17k/raw/dapo-math-17k.parquet"
TEST_FILE="/root/work/externalstorage/gpfsprd/datasets/DAPO-Math-17k/dapo-aime24.parquet"
# Algorithm
temperature=1.0
top_p=1.0
top_k=-1 # 0 for HF rollout, -1 for vLLM rollout
val_top_p=0.7
# Performance Related Parameter
sp_size=1
use_dynamic_bsz=True
log_prob_micro_batch_size_per_gpu=1
ppo_micro_batch_size_per_gpu=1
actor_ppo_max_token_len=$((1024 * 16))
infer_ppo_max_token_len=$((1024 * 32))
offload=True
gen_tp=2
enable_chunked_prefill=True
python3 -m recipe.dapo.main_dapo \
data.train_files="${TRAIN_FILE}" \
data.val_files="${TEST_FILE}" \
data.prompt_key=prompt \
data.truncation='left' \
data.max_prompt_length=${max_prompt_length} \
data.max_response_length=${max_response_length} \
data.gen_batch_size=${gen_prompt_bsz} \
data.train_batch_size=${train_prompt_bsz} \
actor_rollout_ref.rollout.n=${n_resp_per_prompt} \
actor_rollout_ref.rollout.max_num_seqs=${max_num_seqs} \
actor_rollout_ref.rollout.max_num_batched_tokens=$((1024 * 32)) \
algorithm.adv_estimator=${adv_estimator} \
algorithm.use_kl_in_reward=${use_kl_in_reward} \
algorithm.kl_ctrl.kl_coef=${kl_coef} \
actor_rollout_ref.actor.use_kl_loss=${use_kl_loss} \
actor_rollout_ref.actor.kl_loss_coef=${kl_loss_coef} \
actor_rollout_ref.actor.clip_ratio_low=${clip_ratio_low} \
actor_rollout_ref.actor.clip_ratio_high=${clip_ratio_high} \
actor_rollout_ref.actor.clip_ratio_c=10.0 \
algorithm.filter_groups.enable=${enable_filter_groups} \
algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
algorithm.filter_groups.metric=${filter_groups_metric} \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=${use_dynamic_bsz} \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=${actor_ppo_max_token_len} \
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.model.path="${MODEL_PATH}" \
+actor_rollout_ref.model.override_config.attention_dropout=0. \
+actor_rollout_ref.model.override_config.embd_pdrop=0. \
+actor_rollout_ref.model.override_config.resid_pdrop=0. \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.optim.lr_warmup_steps=10 \
actor_rollout_ref.actor.optim.weight_decay=0.1 \
actor_rollout_ref.actor.ppo_mini_batch_size=${train_prompt_mini_bsz} \
actor_rollout_ref.actor.fsdp_config.param_offload=${offload} \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=${offload} \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.grad_clip=1.0 \
actor_rollout_ref.actor.loss_agg_mode=${loss_agg_mode} \
actor_rollout_ref.actor.ulysses_sequence_parallel_size=${sp_size} \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.gpu_memory_utilization=0.9 \
actor_rollout_ref.rollout.tensor_model_parallel_size=${gen_tp} \
actor_rollout_ref.rollout.enable_chunked_prefill=${enable_chunked_prefill} \
actor_rollout_ref.rollout.temperature=${temperature} \
actor_rollout_ref.rollout.top_p=${top_p} \
actor_rollout_ref.rollout.top_k="${top_k}" \
actor_rollout_ref.rollout.val_kwargs.temperature=${temperature} \
actor_rollout_ref.rollout.val_kwargs.top_p=${val_top_p} \
actor_rollout_ref.rollout.val_kwargs.top_k=${top_k} \
actor_rollout_ref.rollout.val_kwargs.do_sample=True \
actor_rollout_ref.rollout.val_kwargs.n=1 \
actor_rollout_ref.rollout.enforce_eager=False \
actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
+actor_rollout_ref.actor.fsdp_config.model_dtype=bfloat16 \
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
actor_rollout_ref.actor.strategy=fsdp \
actor_rollout_ref.actor.fsdp_config.fsdp_size=-1 \
actor_rollout_ref.rollout.free_cache_engine=True \
reward_model.reward_manager=dapo \
reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
reward_model.overlong_buffer.len=${overlong_buffer_len} \
reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \
trainer.logger='["tensorboard", "console"]' \
trainer.project_name="${project_name}" \
trainer.experiment_name="${exp_name}" \
trainer.n_gpus_per_node=8 \
trainer.nnodes="${NNODES}" \
trainer.val_before_train=False \
trainer.test_freq=200 \
trainer.save_freq=50 \
trainer.total_epochs=100 \
trainer.default_local_dir="${CKPTS_DIR}" \
trainer.resume_mode=auto \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=${log_prob_micro_batch_size_per_gpu} \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${log_prob_micro_batch_size_per_gpu} \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=${ppo_micro_batch_size_per_gpu} \
++actor_rollout_ref.nccl_timeout=7200 \
actor_rollout_ref.actor.use_torch_compile=False \
actor_rollout_ref.ref.use_torch_compile=False $@
| 指标 | 对比H100 |
|---|---|
| 端到端时间 timing_s/step | 0.27 |
| 训练吞吐 update_tps | 1.59 |
| 推理吞吐 vllm_tps | 0.16 |