Ascend-SACT/train_Qwen3-8B-mindspeed-llm
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

Qwen3-8B基于mindspeed-llm的微调训练适配

零、模型概述

Qwen3 是 Qwen 系列中的最新一代大型语言模型,提供了一整套密集型和专家混合(MoE)模型。基于广泛的训练,Qwen3 在推理、指令执行、代理能力和多语言支持方面取得了突破性进展,具有以下关键特性:

  • 在同一模型内无缝切换思考模式(适用于复杂的逻辑推理、数学和编码)和非思考模式(适用于高效的通用对话),确保在各种场景下的最佳性能。
  • 显著增强其推理能力,在数学、代码生成和常识逻辑推理方面超越了之前的 QwQ(在思考模式下)和 Qwen2.5 指令模型(在非思考模式下)。
  • 优越的人类偏好对齐,在创意写作、角色扮演、多轮对话和指令执行方面表现出色,提供了更自然、吸引人和沉浸式的对话体验。
  • 在代理能力方面的专长,能够在思考和非思考模式下精确集成外部工具,并在复杂的基于代理的任务中达到开源模型中的领先性能。
  • 支持 100 多种语言和方言,具有强大的多语言指令执行和翻译能力。

模型概述

Qwen3-8B 具有以下特点:

  • 类型:因果语言模型
  • 训练阶段:预训练和后训练
  • 参数数量:82 亿
  • 非嵌入参数数量:69.5 亿
  • 层数:36
  • 注意力头数(GQA):Q 为 32 个,KV 为 8 个
  • 上下文长度:原生 32,768 和 使用 YaRN 的 131,072 个令牌。

一、训练环境

硬件:

设备型号NPU配置
Atlas 800T A38*128G

软件配套:

软件版本部署方式
DriverAscendHDK 25.0.RC1宿主机
FirmwareAscendHDK 25.0.RC1宿主机
Python3.10容器内
CANNCANN 8.1.RC1容器内
Torch2.6.0容器内
Torch_npurelease v7.0.0容器内
MindSpeed2.0.0_core_r0.8.0容器内
MindSpeed-LLM2.1.0容器内
Megatron-LMcore_r0.8.0容器内
Docker镜像OSUbuntu 20.04.6/

镜像:

  • 镜像Dockerfile
  • 镜像tar包

二、前期准备

2.1 训练脚本准备

已准备了Qwen3-8B模型的训练验证脚本,包括数据预处理、模型格式转换、全参微调和LoRA微调等功能,放在到scripts/qwen3-8B路径下。

scripts/qwen3-8b#
    ├── ckpt_convert_qwen3_hf2mcore.sh   # 将HF权重转换为Mcore格式
    ├── ckpt_convert_qwen3_lora_merge.sh  # Lora微调Lora权重合并脚本
    ├── ckpt_convert_qwen3_mcore2hf_full.sh  # 将全参微调的Mcore权重转换为HF格式
    ├── ckpt_convert_qwen3_mcore2hf_lora.sh  # 将全参微调的Mcore权重转换为HF格式
    ├── data_convert_qwen3_instruction_customed10k.sh # qwen3 customed慢思考数据转换脚本
    ├── eval_qwen3_8b_full_customed10k.sh  # 全参微调的评测脚本
    ├── eval_qwen3_8b_lora_customed10k.sh  # Lora微调的评测脚本
    ├── tune_qwen3_8b_4K_full_customed10k_ptd.sh   # 全参微调训练脚本
    └── tune_qwen3_8b_4K_lora_customed10k_ptd.sh   # Lora微调训练脚本

2.2 权重准备

从HuggingFace或ModelScope下载模型权重Qwen3-8B,已放置在models/Qwen3-8B目录下。

models/Qwen3-8B# 
    ├── config.json
    ├── configuration.json
    ├── generation_config.json
    ├── merges.txt
    ├── model-00001-of-00005.safetensors
    ├── model-00002-of-00005.safetensors
    ├── model-00003-of-00005.safetensors
    ├── model-00004-of-00005.safetensors
    ├── model-00005-of-00005.safetensors
    ├── model.safetensors.index.json
    ├── README.md
    ├── tokenizer_config.json
    ├── tokenizer.json
    └── vocab.json

2.3 数据集准备

准备训练与测试数据集,放在datasets/目录下

datasets#
    ├── customed_test_1k.jsonl    # 测试集
    └── customed_train_10k.jsonl  # 训练集

三、HF权重转换为MG格式

昇腾MindSpeed-LLM要求模型权重采用Megatron-LM格式,在这里我们将原始HuggingFace权重格式转换为Megatron-Mcore格式。

使用转换脚本,获取对应切分的mg权重。这里权重切分为tp2pp2,权重转换转换脚本 ckpt_convert_qwen3_8B_hf2mcore.sh 关键代码如下:

OUTPUT_BASE_DIR=/data/qwen3
log_path="${OUTPUT_BASE_DIR}/8B/logs/ckpt_convert_qwen3_8B_hf2mcore.log"
mkdir -p ${OUTPUT_BASE_DIR}/8B/mg_weights/qwen3_8B_mcore_tp2pp2/
mkdir -p ${OUTPUT_BASE_DIR}/8B/logs

# 设置需要的权重转换参数
python convert_ckpt.py \
    --use-mcore-models \
    --model-type GPT \
    --load-model-type hf \
    --save-model-type mg \
    --target-tensor-parallel-size 2 \
    --target-pipeline-parallel-size 2 \
    --spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
    --load-dir models/Qwen3-8B/ \
    --save-dir ${OUTPUT_BASE_DIR}/8B/mg_weights/qwen3_8B_mcore_tp2pp2/ \
    --tokenizer-model models/Qwen3-8B/tokenizer.json \
    --params-dtype bf16 \
    --model-type-hf qwen3 \
        | tee $log_path

参数解析

参数说明必填
--model-type GPT指定模型类型为GPT系列是
--use-mcore-models转换为Megatron-Mcore格式是
--target-tensor-parallel-size张量并行度设置是
--target-pipeline-parallel-size流水线并行度设置是
--tokenizer-model指定分词器路径是
--load-model-type加载权重的类别(可以是hf、mg)是
--save-model-type存储权重的类别(可以是hf、mg)是
--load-dir权重文件加载路径是
--save-dir权重文件保存路径是
--model-type-hfhuggingface模型类别,默认为llama2否
--params-dtype指定权重转换后的权重精度模式,默认为fp16,如果源文件格式为bf16,则需要设置为bf16是

在平台托管训练下发任务,单机单卡就能运行,更多资源会更快。执行脚本

cd scripts/qwen3-8b && bash ckpt_convert_qwen3_8B_hf2mcore.sh

运行后部分日志如下

  wgrad_deferral_limit ............................ 0
  world_size ...................................... 4
  yaml_cfg ........................................ None
-------------------- end of MindSpeed-LLM Arguments ---------------------
 > padded vocab (size: 151936) with 0 dummy tokens (new size: 151936)
building GPT model ...
building GPT model ...
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [12:33<00:00, 150.73s/it]
building GPT model ...
set layer states: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:03<00:00,  9.01it/s]
INFO:root:sending embeddings
INFO:root:sending transformer layer 0
INFO:root:sending transformer layer 1
...
INFO:root:sending transformer layer 35
INFO:root:sending final norm
INFO:root:sending output layer
INFO:root:Waiting for saver to complete...
INFO:root:received transformer layer 7
...
INFO:root:received transformer layer 33
INFO:root:received transformer layer 34
INFO:root:received transformer layer 35
INFO:root:received final norm
INFO:root:received output layer
....
  successfully saved checkpoint from iteration       1 to /data/private/qwen3/8b/mg_weights/qwen3_8b_mcore_tp2pp2/
INFO:root:Done!

转换后生成目录文件如下:

|-- iter_0000001
|   |-- mp_rank_00_000
|   |   `-- model_optim_rng.pt
|   |-- mp_rank_00_001
|   |   `-- model_optim_rng.pt
|   |-- mp_rank_01_000
|   |   `-- model_optim_rng.pt
|   `-- mp_rank_01_001
|       `-- model_optim_rng.pt
`-- latest_checkpointed_iteration.txt

四、微调数据集转换为bin格式

这里使用的数据集为customed数据集,原始数据集目录为:

  • 训练数据集:datasets/customed_train_10k.jsonl
  • 评测数据集:datasets/customed_test_1k.jsonl

训练数据集样例如下:

{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "请你判读用户的意图。...交互历史:用户:广州塔为什么也叫小蛮腰\n用户最后一句话所属的技能和意图是:"}, {"role": "assistant", "content": "闲聊百科_知识查询"}]}

由于制作慢思考数据,需要使用MindSpeed-LLM的master分支代码,相关master分支代码已提前放置在镜像/src/train25.1.0/MindSpeed-LLM 目录。

其中微调数据集转换脚本 data_convert_qwen3_instruction_customed10k_qwen3.sh 大致如下

OUTPUT_BASE_DIR=/data/qwen3
mkdir -p ${OUTPUT_BASE_DIR}/finetune_dataset_customed10k_qwen3_think
log_path="${OUTPUT_BASE_DIR}/8B/logs/data_convert_qwen3_instruction_customed10k_qwen3.log"
mkdir -p ${OUTPUT_BASE_DIR}/8B/logs

python ./preprocess_data.py \
    --input  datasets/customed_train_10k.jsonl \
    --tokenizer-name-or-path models/Qwen3-8B/ \
    --output-prefix ${OUTPUT_BASE_DIR}/finetune_dataset_customed10k_qwen3_think/customed \
    --handler-name SharegptStyleInstructionHandler \
    --tokenizer-type PretrainedFromHF \
    --enable-thinking true \
    --workers 4 \
    --log-interval 1000 \
    --map-keys '{"messages":"messages", "tags":{"role_tag": "role","content_tag": "content","user_tag": "user","assistant_tag": "assistant","system_tag": "system"}}' \
    --prompt-type qwen3 \
        | tee $log_path

参数说明:

参数说明必填
--input输入数据文件的路径,输入数据集目录或具体文件,如果是目录,则处理全部文件, 支持 .parquet \ .csv \ .json \ .jsonl \ .txt \ .arrow 格式, 同一个文件夹下的数据格式需要保持一致是
--tokenizer-name-or-path用于分词器的预训练模型的路径。是
--output-prefix输出文件的前缀。预处理后的数据会被保存为多个文件(例如,后缀为.bin和.idx),这个参数指定这些文件的前缀路径。是
--workers指定数据预处理时使用的进程数(多进程)。更多的进程数可以加快处理速度。是
-tokenizer-type指定tokenizer的类型。是
--log-interval处理进度更新的间隔步数。是
--enable-thinking快慢思考模板开关,可设定为[true,false,none],默认值是none。对于qwen3,开启后,会在数据集的Label中添加</think>\n\n</think>\n\n,并参与到loss计算,所有数据被当成慢思考数据;当关闭后,<RichMediaReference>\n\nsuperscript:\n\n将被添加到提示词的`<im_start
--handler-name指定数据处理器(handler)的类名。常用的有AlpacaStylePairwiseHandler,SharegptStyleInstructionHandler,AlpacaStylePairwiseHandler等是
--prompt-type指定模型模板,能够让base模型微调后能具备更好的对话能力。prompt-type的可选项可以在templates文件内查看。是
--map-keys参数用于配置字段映射来使用数据集。其中key值"messages"、"tags"代表数据集列映射后的属性,在代码中是固定的,不应改变。value值中"conversations"对应数据集的列名、"from"对应角色标志、"human"、"gpt"、"system"、"observation"、"function_call"对应角色种类、"value"对应具体内容标志。
OpenAI风格为'{"messages":"messages", "tags":{"role_tag": "role","content_tag": "content","user_tag": "user","assistant_tag": "assistant","system_tag": "system"}}'
ShareGPT风格为 '{"messages":"conversations", "tags":{"role_tag": "from","content_tag": "value","user_tag": "human","assistant_tag": "gpt","system_tag": "system", "observation_tag":"observation", "function_tag":"function_call"}}'
是

在平台托管训练下发任务,单机单卡就能运行,更多资源会更快。执行脚本

cd scripts/qwen3-8b && bash data_convert_qwen3_instruction_customed10k_qwen3.sh

转换后生成的文件列表如下:

-rw-r--r--. 1 root root 18841600 Jan 20 21:20 customed_packed_attention_mask_document.bin
-rw-r--r--. 1 root root   200042 Jan 20 21:20 customed_packed_attention_mask_document.idx
-rw-r--r--. 1 root root 18841600 Jan 20 21:20 customed_packed_input_ids_document.bin
-rw-r--r--. 1 root root   200042 Jan 20 21:20 customed_packed_input_ids_document.idx
-rw-r--r--. 1 root root 18841600 Jan 20 21:20 customed_packed_labels_document.bin
-rw-r--r--. 1 root root   200042 Jan 20 21:20 customed_packed_labels_document.idx

V. Full Parameter Fine-Tuning

5.1 Full Parameter Fine-Tuning Script

To start the full parameter fine-tuning training, the key code of the training startup script tune_qwen3_8B_4K_full_customed10k_ptd.sh is as follows:

NPUS_PER_NODE=16
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))

# please fill these path configurations
OUTPUT_BASE_DIR=/data/private/qwen3
mkdir -p ${OUTPUT_BASE_DIR}/8b/logs

# 根据实际情况配置权重保存、权重加载、词表、数据集路径
CKPT_LOAD_DIR="${OUTPUT_BASE_DIR}/8b/mg_weights/qwen3_8b_mcore_tp2pp2/"  #权重加载路径,填入权重转换时保存的权重路径
CKPT_SAVE_DIR="${OUTPUT_BASE_DIR}/8b/save_weights_full/qwen3_8b_mcore_tp2pp2_customed10k_qwen3_think/"  #训练完成后的权重保存路径
DATA_PATH="${OUTPUT_BASE_DIR}/finetune_dataset_customed10k_qwen3_think/customed"  #数据集路径,填入数据预处理时保存的数据路径,注意需要添加后缀
TOKENIZER_PATH="/data/models/Qwen3-8B/" #词表路径,填入下载的开源权重词表路径
log_path="${OUTPUT_BASE_DIR}/8b/logs/tune_qwen3_8b_4K_full_customed10k_ptd.log"


TP=2
PP=2
MBS=8
GBS=128

SEQ_LENGTH=800
TRAIN_ITERS=390 #训练步数

DISTRIBUTED_ARGS="
    --nproc_per_node $NPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

GPT_ARGS="
    --use-mcore-models \
    --spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
    --kv-channels 128 \
    --qk-layernorm \
    --tensor-model-parallel-size ${TP} \
    --pipeline-model-parallel-size ${PP} \
    --sequence-parallel \
    --use-distributed-optimizer \
    --use-flash-attn \
    --num-layers 36 \
    --hidden-size 4096  \
    --use-rotary-position-embeddings \
    --num-attention-heads 32 \
    --ffn-hidden-size 12288 \
    --max-position-embeddings 32768 \
    --seq-length ${SEQ_LENGTH} \
    --make-vocab-size-divisible-by 1 \
    --padded-vocab-size 151936 \
    --rotary-base 1000000 \
    --micro-batch-size ${MBS} \
    --global-batch-size ${GBS} \
    --disable-bias-linear \
    --train-iters 2000 \
    --swiglu \
    --tokenizer-type PretrainedFromHF \
    --tokenizer-name-or-path ${TOKENIZER_PATH} \
    --normalization RMSNorm \
    --position-embedding-type rope \
    --norm-epsilon 1e-6 \
    --hidden-dropout 0 \
    --attention-dropout 0 \
    --no-gradient-accumulation-fusion \
    --attention-softmax-in-fp32 \
    --exit-on-missing-checkpoint \
    --no-masked-softmax-fusion \
    --group-query-attention \
    --untie-embeddings-and-output-weights \
    --num-query-groups 8 \
    --min-lr 1.25e-7 \
    --lr 1.25e-6 \
    --weight-decay 1e-1 \
    --clip-grad 1.0 \
    --adam-beta1 0.9 \
    --adam-beta2 0.95 \
    --initial-loss-scale 4096 \
    --no-load-optim \
    --no-load-rng \
    --seed 42 \
    --train-iters ${TRAIN_ITERS} \
    --bf16
"

DATA_ARGS="
    --data-path $DATA_PATH \
    --split 100,0,0
"

OUTPUT_ARGS="
    --log-interval 1 \
    --save-interval ${TRAIN_ITERS} \
    --eval-interval ${TRAIN_ITERS} \
    --eval-iters 0 \
"

TUNE_ARGS="
    --finetune \
    --stage sft \
    --is-instruction-dataset \
    --prompt-type qwen \
    --variable-seq-lengths
"

torchrun $DISTRIBUTED_ARGS posttrain_gpt.py \
    $GPT_ARGS \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    $TUNE_ARGS \
    --tensorboard-dir ${OUTPUT_BASE_DIR}/8b/tb/full/customed10k \
    --distributed-backend nccl \
    --load ${CKPT_LOAD_DIR} \
    --save ${CKPT_SAVE_DIR} \
    | tee $log_path

全参微调关键训练参数MindSpeed-LLM与LLaMAFactory对照表如下:

MindSpeed-LLMLLaMAFactory备注
SEQ_LEN=800cutoff_len=800训练的序列长度
TP=2-模型并行切分
PP=2-流水线并行切分
--log-interval=1logging_steps=10打印日志频率
MBS=16per_device_train_batch_size=2micro_batch_size,需要根据显存进行配置
-gradient_accumulation_steps=8MindSpeed-LLM梯度累计会根据MBS、GBS、DP自动计算。
World_Size = TP * PP * DP
计算方式为:GBS=MBS * MB * DP
GBS=128-全局batch_size
TRAIN_ITERS=390num_train_epochs=5训练步数,总步数=总样本数/GBS,训练步数会影响每一步的具体lr
--lr 1.25e-6learning_rate=1.25e-6学习率
--bf16bf16=true使用bf16格式训练
--tensorboard-dir ${OUTPUT_BASE_DIR}/8B/tb/full/customed10kreport_to: tensorboard开启tensorboard日志输出

5.3 全参任务下发

单机8卡,执行脚本

cd scripts/qwen3-8b && bash tune_qwen3_8b_4K_full_customed10k_ptd.sh

运行后部分日志如下

all tp gourps [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15]]
all ep groups [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]]
all dp groups [[0, 2, 4, 6], [1, 3, 5, 7], [8, 10, 12, 14], [9, 11, 13, 15]]
all_dp_modulo_exp_group_ranks [[0, 2, 4, 6], [1, 3, 5, 7], [8, 10, 12, 14], [9, 11, 13, 15]]
all_tensor_and_expert_group_ranks [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15]]
all_data_parallel_group_ranks_with_cp [[0, 2, 4, 6], [1, 3, 5, 7], [8, 10, 12, 14], [9, 11, 13, 15]]
...
training ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (4802.77, 4813.21)
    train/valid/test-data-iterators-setup ..........: (1616.17, 1791.04)
[before the start of training step] datetime: 2026-01-20 22:07:15 
WARNING:megatron.core.models.common.embeddings.rotary_pos_embedding:Setting apply_rope_fusion to false because its implementation is not included in Apex. Try upgrading to the latest version
Number of parameters in transformer layers in billions:  6.95
Number of parameters in embedding layers in billions: 1.24
Total number of parameters in billions: 8.19
Number of parameters in most loaded shard in billions: 2.0478
Number of parameters in other shards in billions: 1.7366
Theoretical memory footprints: weight and optimizer=17576.03 MB
...
 [2026-01-20 14:18:04] iteration      388/     390 | consumed samples:        49664 | elapsed time per iteration (ms): 1633.5 | learning rate: 1.307692E-07 | global batch size:   128 | lm loss: 1.011770E-02 | loss scale: 1.0 | grad norm: 1.443 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2026-01-20 14:18:06] iteration      389/     390 | consumed samples:        49792 | elapsed time per iteration (ms): 1647.3 | learning rate: 1.278846E-07 | global batch size:   128 | lm loss: 5.900864E-03 | loss scale: 1.0 | grad norm: 1.321 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2026-01-20 14:18:07] iteration      390/     390 | consumed samples:        49920 | elapsed time per iteration (ms): 1664.2 | learning rate: 1.250000E-07 | global batch size:   128 | lm loss: 5.781100E-03 | loss scale: 1.0 | grad norm: 1.647 | number of skipped iterations:   0 | number of nan iterations:   0 |
saving checkpoint at iteration     390 to /data/private/qwen3/8b/save_weights_full/qwen3_8b_mcore_tp2pp2_customed10k_qwen3_think/ in torch format
  successfully saved checkpoint from iteration     390 to /data/private/qwen3/8b/save_weights_full/qwen3_8b_mcore_tp2pp2_customed10k_qwen3_think/
(min, max) time across ranks (ms):
    save-checkpoint ................................: (51745.24, 51745.50)
[after training is done] datetime: 2026-01-20 22:18:59 

训练后,根据配置,将生成训练日志和tensorboard日志。

镜像已预置tensorboard,根据上面训练脚本设置,tensorboard日志目录为:/data/qwen3/8b/tb/full/customed10k

打开实验环境的terminal,使用以下命令启动tensorboard

tensorboard --logdir=/data/private/qwen3/8b/tb/full/customed10k --port=3000

观察Loss曲线如下:

image-20260121115058578

5.5 权重评测

训练好的权重基于customed评测数据集,进行评测的脚本eval_qwen3_8B_full_customed10k.sh 关键代码如下

OUTPUT_BASE_DIR=/data/private/qwen3
CHECKPOINT="${OUTPUT_BASE_DIR}/8b/save_weights_lora/qwen3_8b_mcore_tp2pp2_customed10k_qwen3_think_merge/" # 指向微调后权重的保存路径
TOKENIZER_PATH="/data/share/models/Qwen3-8B/" # 指向模型tokenizer的路径
eval_log_path="${OUTPUT_BASE_DIR}/8b/logs/eval_qwen3_8b_lora_customed10k.log"
EVAL_DATA_PATH="/data/share/datasets/customed_test_1k.jsonl"

# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1 # 集群里的节点数,以实际情况填写,
NODE_RANK=0  # 当前节点的RANK,多个节点不能重复,主节点为0, 其他节点可以是1,2..
NPUS_PER_NODE=16
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))

TP=2
PP=2
SEQ_LENGTH=2048

DISTRIBUTED_ARGS="
    --nproc_per_node $NPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

torchrun $DISTRIBUTED_ARGS inference_4_midea_test.py \
        --use-mcore-models \
        --tensor-model-parallel-size ${TP} \
        --pipeline-model-parallel-size ${PP} \
        --load ${CHECKPOINT} \
        --spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
        --kv-channels 128 \
        --qk-layernorm \
        --num-layers 36 \
        --hidden-size 4096 \
        --ffn-hidden-size 12288 \
        --untie-embeddings-and-output-weights \
        --num-attention-heads 32 \
        --group-query-attention \
        --num-query-groups 8 \
        --max-position-embeddings 32768 \
        --seq-length ${SEQ_LENGTH} \
        --make-vocab-size-divisible-by 1 \
        --padded-vocab-size 151936 \
        --untie-embeddings-and-output-weights \
        --rotary-base 1000000 \
        --micro-batch-size 1 \
        --disable-bias-linear \
        --swiglu \
        --use-rotary-position-embeddings \
        --tokenizer-type PretrainedFromHF \
        --tokenizer-name-or-path ${TOKENIZER_PATH} \
        --normalization RMSNorm \
        --position-embedding-type rope \
        --norm-epsilon 1e-6 \
        --hidden-dropout 0.0 \
        --attention-dropout 0.0 \
        --max-new-tokens 256 \
        --no-gradient-accumulation-fusion \
        --attention-softmax-in-fp32 \
        --tokenizer-not-use-fast \
        --exit-on-missing-checkpoint \
        --no-masked-softmax-fusion \
        --no-load-rng \
        --no-load-optim \
        --seed 42 \
        --bf16 \
        --eval-data-path ${EVAL_DATA_PATH} \
       	--eval-data-size 1000 \
	    --temperature 0.00001 \
       		| tee $eval_log_path

在平台托管训练下发任务,执行脚本

cd scripts/qwen3-8b && bash eval_qwen3_8B_full_customed10k.sh

部分日志如下:

...
> initialized tensor model parallel with size 2
> initialized pipeline model parallel with size 2
...
correct / total = 10 / 10
=====>finish Batch=98, Accuracy: 1.0, index: 980 =======
==============  Batch 99 / 100 : Start (980 + 10) / 1000 =============
================ Do Sample =================
....
=====>finish Batch=100, Accuracy: 0.8, index: 1000 =======
eval data: 91.40000000000002 / 100
Average accuracy: 0.9140000000000001

5.6 MG权重转为HF格式

使用MindSpeed-LLM训练后,若希望将权重转换为HF格式,可使用脚本ckpt_convert_qwen3_mcore2hf_full.sh,关键代码如下

export CUDA_DEVICE_MAX_CONNECTIONS=1

OUTPUT_BASE_DIR=/data/qwen3
log_path="${OUTPUT_BASE_DIR}/8B/logs/ckpt_convert_qwen3_mcore2hf_full.log"
mkdir -p ${OUTPUT_BASE_DIR}/8B/logs

python convert_ckpt.py \
    --use-mcore-models \
    --model-type GPT \
    --load-model-type mg \
    --save-model-type hf \
    --target-tensor-parallel-size 2 \
    --target-pipeline-parallel-size 2 \
    --spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
    --load-dir ${OUTPUT_BASE_DIR}/8B/save_weights_full/qwen3_8B_mcore_tp2pp2_customed10k_qwen3_think/ \
    --save-dir models/Qwen3-8B/ \
    --params-dtype bf16 \
    --model-type-hf qwen3 \
        | tee $log_path

cp -rf models/Qwen3-8B/mg2hf ${OUTPUT_BASE_DIR}/8B/save_weights_full/

注意:

  • --save-model-type参数设置为 hf
  • save-dir 参数设置为原始hf权重的路径,脚本执行后,会在该目录下生成mg2hf的目录存放转换后的权重。
  • tp和pp都设置为1: --target-tensor-parallel-size 2 , --target-pipeline-parallel-size 2

在平台托管训练下发任务,执行脚本

cd scripts/qwen3-8b && bash ckpt_convert_qwen3_mcore2hf_full.sh

运行后部分日志如下

INFO:root:Starting saver...
INFO:root:Starting loader...
Setting num_layers to 36 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 12288 from checkpoint
Setting seq_length to 800 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting num_query_groups to 8 from checkpoint
Setting group_query_attention to True from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 32768 from checkpoint
Setting position_embedding_type to rope from checkpoint
Setting add_position_embedding to True from checkpoint
Setting use_rotary_position_embeddings to True from checkpoint
Setting rotary_percent to 1.0 from checkpoint
Setting rotary_interleaved to False from checkpoint
Setting add_bias_linear to False from checkpoint
Setting add_qkv_bias to False from checkpoint
Setting swiglu to True from checkpoint
Setting untie_embeddings_and_output_weights to True from checkpoint
Setting apply_layernorm_1p to False from checkpoint
Setting normalization to RMSNorm from checkpoint
Setting tokenizer_type to PretrainedFromHF from checkpoint
Setting padded_vocab_size to 151936 from checkpoint
Setting apply_query_key_layer_scaling to False from checkpoint
Setting tensor_model_parallel_size to 2 from checkpoint
Setting pipeline_model_parallel_size to 2 from checkpoint
Checkpoint did not provide arguments virtual_pipeline_model_parallel_size
Checkpoint did not provide arguments num_layers_per_virtual_pipeline_stage
Checkpoint did not provide arguments num_layer_list
...
INFO:root:Setting consumed_train_samples to 0 and consumed_valid_samples to 0
using world size: 1, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
...
INFO:root:received transformer layer 32
INFO:root:received transformer layer 33
INFO:root:received transformer layer 34
INFO:root:received transformer layer 35
INFO:root:received final norm
INFO:root:received output layer
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [12:29<00:00, 149.91s/it]
set layer states: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:02<00:00, 15.94it/s]
INFO:root:save weight to /data/models/Qwen3-8B/mg2hf
INFO:root:Done!

转换后生成目录文件如下:

.
|-- [ 729]  config.json
|-- [ 214]  generation_config.json
|-- [4.6G]  model-00001-of-00004.safetensors
|-- [4.6G]  model-00002-of-00004.safetensors
|-- [4.6G]  model-00003-of-00004.safetensors
|-- [1.5G]  model-00004-of-00004.safetensors
`-- [ 32K]  model.safetensors.index.json

六、Lora微调

6.1 Lora微调脚本

Lora微调训练,训练启动脚本tune_qwen3_8B_4K_lora_customed10k_ptd.sh 关键代码如下


# 基础配置
NPUS_PER_NODE=2  #使用单节点的2卡NPU
MASTER_ADDR=localhost #以本节点ip地址为master_ip
MASTER_PORT=6015 #本节点端口号为6014
NNODES=1  #单机,即一台节点,多机即多节点
NODE_RANK=0  #单机RANK为0,多机为(0,NNODES-1),不同节点不可重复
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES)) #最终使用的NPU数

OUTPUT_BASE_DIR=/data/private/qwen3
mkdir -p ${OUTPUT_BASE_DIR}/8b/logs

# 根据实际情况配置权重保存、权重加载、词表、数据集路径
CKPT_LOAD_DIR="${OUTPUT_BASE_DIR}/8b/mg_weights/qwen3_8b_mcore_tp2pp2/"  #权重加载路径,填入权重转换时保存的权重路径
CKPT_SAVE_DIR="${OUTPUT_BASE_DIR}/8b/save_weights_lora/qwen3_8b_mcore_tp2pp2_customed10k_qwen3_think/"  #训练完成后的权重保存路径
DATA_PATH="${OUTPUT_BASE_DIR}/finetune_dataset_customed10k_qwen3_think/customed"  #数据集路径,填入数据预处理时保存的数据路径,注意需要添加后缀
TOKENIZER_PATH="/data/models/Qwen3-8B/" #词表路径,填入下载的开源权重词表路径
log_path="${OUTPUT_BASE_DIR}/8b/logs/tune_qwen3_8b_4K_lora_customed10k_ptd.log"
TP=2
PP=2
SEQ_LENGTH=2048
TRAIN_ITERS=800

DISTRIBUTED_ARGS="
    --nproc_per_node $NPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"

OPTIMIZE_ARGS="
    --use-flash-attn \
    --use-fused-rotary-pos-emb \
    --use-rotary-position-embeddings \
    --use-fused-swiglu \
    --use-fused-rmsnorm \
    --no-masked-softmax-fusion \
    --use-distributed-optimizer
"

TRAIN_ARGS="
    --micro-batch-size 4 \
    --global-batch-size 64 \
    --lr 1.25e-5 \
    --lr-decay-style cosine \
    --min-lr 1.25e-7 \
    --weight-decay 1e-1 \
    --lr-warmup-fraction 0.1 \
    --attention-dropout 0.0 \
    --init-method-std 0.01 \
    --hidden-dropout 0.0 \
    --clip-grad 1.0 \
    --adam-beta1 0.9 \
    --adam-beta2 0.999 \
    --initial-loss-scale 4096 \
    --seed 42 \
    --bf16 \
    --train-iters ${TRAIN_ITERS} \
    --seq-length ${SEQ_LENGTH} \
    --no-shared-storage
"

MODEL_PARALLEL_ARGS="
    --tensor-model-parallel-size ${TP} \
    --pipeline-model-parallel-size ${PP}
"

GPT_ARGS="
    --use-mcore-models \
    --spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
    --kv-channels 128 \
    --qk-layernorm \
    --tokenizer-name-or-path ${TOKENIZER_PATH} \
    --max-position-embeddings ${SEQ_LENGTH} \
    --num-layers 36 \
    --hidden-size 2560 \
    --ffn-hidden-size 9728 \
    --num-attention-heads 32 \
    --tokenizer-type PretrainedFromHF \
    --make-vocab-size-divisible-by 1 \
    --padded-vocab-size 151936 \
    --rotary-base 1000000 \
    --disable-bias-linear \
    --position-embedding-type rope \
    --normalization RMSNorm \
    --swiglu \
    --attention-softmax-in-fp32 \
    --no-gradient-accumulation-fusion \
    --group-query-attention \
    --num-query-groups 8
"

DATA_ARGS="
    --data-path $DATA_PATH \
    --split 100,0,0
"

OUTPUT_ARGS="
    --load ${CKPT_LOAD_DIR} \
    --save ${CKPT_SAVE_DIR} \
    --log-interval 1 \
    --save-interval ${TRAIN_ITERS} \
    --eval-interval ${TRAIN_ITERS} \
    --eval-iters 0 \
    --no-load-optim \
    --no-load-rng
"

TUNE_ARGS="
    --finetune \
    --stage sft \
    --is-instruction-dataset \
    --tokenizer-not-use-fast \
    --prompt-type qwen \
    --variable-seq-lengths \
    --lora-r 16 \
    --lora-alpha 32 \
    --lora-fusion \
    --lora-target-modules linear_qkv linear_proj linear_fc1 linear_fc2
"

torchrun $DISTRIBUTED_ARGS posttrain_gpt.py \
    $GPT_ARGS \
    $DATA_ARGS \
    $OUTPUT_ARGS \
    $OPTIMIZE_ARGS \
    $TRAIN_ARGS \
    $TUNE_ARGS \
    $MODEL_PARALLEL_ARGS \
    --tensorboard-dir ${OUTPUT_BASE_DIR}/8b/tb/lora/customed10k \
    --distributed-backend nccl \
    | tee $log_path

6.2 训练参数对照表

Lora微调关键训练参数MindSpeed-LLM与LLaMAFactory对照表如下:

MindSpeed-LLMLLaMAFactory备注
--lora-r 16lora_rank: 16低秩矩阵的维度。较低的 rank 值模型在训练时会使用更少的参数更新,从而减少计算量和内存消耗。然而,过低的 rank 可能限制模型的表达能力。
--lora-alpha 32-控制 LoRA 权重对原始权重的影响比例, 数值越高则影响越大。一般保持为lora-r的2倍。
--lora-target-modules linear_qkv linear_proj linear_fc1 linear_fc2lora_target: all模型中添加 LoRA 的模块
--lora-fusion-是否启用CCLoRA算法,该算法通过计算通信掩盖提高性能。
SEQ_LEN=2048cutoff_len=2048训练的序列长度
TP=2-模型并行切分
PP=2-流水线并行切分
--log-interval 1logging_steps=10打印日志频率
MBS=16per_device_train_batch_size=4micro_batch_size, 需要根据显存进行配置
-gradient_accumulation_steps=4MindSpeed-LLM梯度累计会根据MBS,GBS,DP自动计算。
World_Size = TP * PP * DP
计算方式为:GBS=MBS * MB * DP
GBS=128-全局batch_size
TRAIN_ITERS=800num_train_epochs=5训练步数,总步数=总样本数/GBS,训练步数会影响每一步的具体lr
--lr 1.25e-5learning_rate=1.25e-5学习率
--lr-decay-style cosinelr_scheduler_type=cosine余弦退火学习率衰减
--lr-warmup-fraction 0.1warmup_ratio=0.1学习率预热比例
--bf16bf16=true使用bf16格式训练
--tensorboard-dir ${OUTPUT_BASE_DIR}/8B/tb/full/customed10kreport_to: tensorboard开启tensorboard日志输出

6.3 Lora任务下发

选择单机4卡,下发任务,执行脚本

cd scripts/qwen3-8b && bash tune_qwen3_8b_4K_lora_customed10k_ptd.sh

运行后部分日志如下

...
all tp gourps [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15]]
all ep groups [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]]
all dp groups [[0, 2, 4, 6], [1, 3, 5, 7], [8, 10, 12, 14], [9, 11, 13, 15]]
all_dp_modulo_exp_group_ranks [[0, 2, 4, 6], [1, 3, 5, 7], [8, 10, 12, 14], [9, 11, 13, 15]]
all_tensor_and_expert_group_ranks [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15]]
all_data_parallel_group_ranks_with_cp [[0, 2, 4, 6], [1, 3, 5, 7], [8, 10, 12, 14], [9, 11, 13, 15]]
...
INFO:megatron.core.distributed.param_and_grad_buffer:Number of buckets for gradient all-reduce / reduce-scatter: 1
Params for bucket 1 (11501568 elements):
	module.base_model.model.decoder.layers.17.self_attention.linear_qkv.lora_B.default.weight
	module.base_model.model.decoder.layers.11.self_attention.linear_proj.lora_A.default.weight
	module.base_model.model.decoder.layers.8.self_attention.linear_qkv.lora_A.default.weight
	module.base_model.model.decoder.layers.5.mlp.linear_fc1.lora_B.default.weight
	module.base_model.model.decoder.layers.15.mlp.linear_fc2.lora_B.default.weight
	module.base_model.model.decoder.layers.12.self_attention.linear_proj.lora_B.default.weight
	module.base_model.model.decoder.layers.9.self_attention.linear_qkv.lora_B.default.weight
	module.base_model.model.decoder.layers.4.self_attention.linear_proj.lora_B.default.weight
	module.base_model.model.decoder.layers.1.mlp.linear_fc1.lora_B.default.weight
	module.base_model.model.decoder.layers.14.mlp.linear_fc2.lora_A.default.weight
	module.base_model.model.decoder.layers.13.mlp.linear_fc2.lora_B.default.weight
...
 Number of parameters in transformer layers in billions:  6.95
Number of parameters in embedding layers in billions: 1.24
Total number of parameters in billions: 8.19
Number of parameters in most loaded shard in billions: 2.0478
Number of parameters in other shards in billions: 1.7366
Theoretical memory footprints: weight and optimizer=17576.03 MB
 ...
 [2026-01-20 14:20:01] iteration      798/     800 | consumed samples:        51072 | elapsed time per iteration (ms): 638.6 | learning rate: 1.252356E-07 | global batch size:    64 | lm loss: 2.746646E-02 | loss scale: 1.0 | grad norm: 0.425 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2026-01-20 14:20:01] iteration      799/     800 | consumed samples:        51136 | elapsed time per iteration (ms): 665.2 | learning rate: 1.250589E-07 | global batch size:    64 | lm loss: 3.423290E-02 | loss scale: 1.0 | grad norm: 0.530 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2026-01-20 14:20:02] iteration      800/     800 | consumed samples:        51200 | elapsed time per iteration (ms): 652.2 | learning rate: 1.250000E-07 | global batch size:    64 | lm loss: 6.727248E-03 | loss scale: 1.0 | grad norm: 0.293 | number of skipped iterations:   0 | number of nan iterations:   0 |
saving checkpoint at iteration     800 to /data/private/qwen3/8b/save_weights_lora/qwen3_8b_mcore_tp2pp2_customed10k_qwen3_think/ in torch format
  successfully saved checkpoint from iteration     800 to /data/private/qwen3/8b/save_weights_lora/qwen3_8b_mcore_tp2pp2_customed10k_qwen3_think/
(min, max) time across ranks (ms):
    save-checkpoint ................................: (5011.91, 5012.08)
[after training is done] datetime: 2026-01-20 22:20:07 

训练后,根据配置,将生成训练日志和tensorboard日志。

镜像已预置tensorboard,根据上面训练脚本设置,tensorboard日志目录为:/data/qwen3/8b/tb/lora/customed10k

打开实验环境的terminal,使用以下命令启动tensorboard

tensorboard  --port=3000 --logdir=/data/private/qwen3/8b/tb/full/customed10k

观察Loss曲线如下:

image-20260121115326868

6.4 权重合并

将LoRA权重合并到原始权重

OUTPUT_BASE_DIR=/data/private/qwen3
log_path="${OUTPUT_BASE_DIR}/8b/logs/ckpt_convert_qwen3_lora_merge.log"
mkdir -p ${OUTPUT_BASE_DIR}/8b/logs

python convert_ckpt.py \
    --use-mcore-models \
    --model-type GPT \
    --load-model-type mg \
    --save-model-type mg \
    --target-tensor-parallel-size 2 \
    --target-pipeline-parallel-size 2 \
    --spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
    --lora-r 16 \
    --lora-alpha 32 \
    --lora-target-modules linear_qkv linear_proj linear_fc1 linear_fc2 \
    --load-dir ${OUTPUT_BASE_DIR}/8b/mg_weights/qwen3_8b_mcore_tp2pp2/ \
    --lora-load ${OUTPUT_BASE_DIR}/8b/save_weights_lora/qwen3_8b_mcore_tp2pp2_customed10k_qwen3_think/ \
    --save-dir ${OUTPUT_BASE_DIR}/8b/save_weights_lora/qwen3_8b_mcore_tp2pp2_customed10k_qwen3_think_merge/ \
    --model-type-hf qwen3 \
        | tee $log_path

部分日志如下

...
INFO:root:received transformer layer 34
INFO:root:received transformer layer 35
INFO:root:received final norm
INFO:root:received output layer
saving checkpoint at iteration       1 to /data/private/qwen3/8b/save_weights_lora/qwen3_8b_mcore_tp2pp2_customed10k_qwen3_think_merge/ in torch format
  successfully saved checkpoint from iteration       1 to /data/private/qwen3/8b/save_weights_lora/qwen3_8b_mcore_tp2pp2_customed10k_qwen3_think_merge/
saving checkpoint at iteration       1 to /data/private/qwen3/8b/save_weights_lora/qwen3_8b_mcore_tp2pp2_customed10k_qwen3_think_merge/ in torch format
  successfully saved checkpoint from iteration       1 to /data/private/qwen3/8b/save_weights_lora/qwen3_8b_mcore_tp2pp2_customed10k_qwen3_think_merge/
INFO:root:Done!

6.5 权重评测

训练好的权重基于customed评测数据集,进行评测的脚本eval_qwen3_8B_lora_customed10k.sh 关键代码如下

OUTPUT_BASE_DIR=/data/private/qwen3
CHECKPOINT="${OUTPUT_BASE_DIR}/8b/save_weights_lora/qwen3_8b_mcore_tp2pp2_customed10k_qwen3_think_merge/" # 指向微调后权重的保存路径
TOKENIZER_PATH="/data/share/models/Qwen3-8B/" # 指向模型tokenizer的路径
eval_log_path="${OUTPUT_BASE_DIR}/8b/logs/eval_qwen3_8b_lora_customed10k.log"
EVAL_DATA_PATH="/data/share/datasets/customed_test_1k.jsonl"

# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1 # 集群里的节点数,以实际情况填写,
NODE_RANK=0  # 当前节点的RANK,多个节点不能重复,主节点为0, 其他节点可以是1,2..
NPUS_PER_NODE=8
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))

TP=2
PP=2
SEQ_LENGTH=2048

DISTRIBUTED_ARGS="
    --nproc_per_node $NPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT
"


torchrun $DISTRIBUTED_ARGS inference_4_midea_test.py \
        --use-mcore-models \
        --tensor-model-parallel-size ${TP} \
        --pipeline-model-parallel-size ${PP} \
        --load ${CHECKPOINT} \
        --spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
        --kv-channels 128 \
        --qk-layernorm \
        --num-layers 36 \
        --hidden-size 4096 \
        --ffn-hidden-size 12288 \
        --untie-embeddings-and-output-weights \
        --num-attention-heads 32 \
        --group-query-attention \
        --num-query-groups 8 \
        --max-position-embeddings 32768 \
        --seq-length ${SEQ_LENGTH} \
        --make-vocab-size-divisible-by 1 \
        --padded-vocab-size 151936 \
        --untie-embeddings-and-output-weights \
        --rotary-base 1000000 \
        --micro-batch-size 1 \
        --disable-bias-linear \
        --swiglu \
        --use-rotary-position-embeddings \
        --tokenizer-type PretrainedFromHF \
        --tokenizer-name-or-path ${TOKENIZER_PATH} \
        --normalization RMSNorm \
        --position-embedding-type rope \
        --norm-epsilon 1e-6 \
        --hidden-dropout 0.0 \
        --attention-dropout 0.0 \
        --max-new-tokens 256 \
        --no-gradient-accumulation-fusion \
        --attention-softmax-in-fp32 \
        --tokenizer-not-use-fast \
        --exit-on-missing-checkpoint \
        --no-masked-softmax-fusion \
        --no-load-rng \
        --no-load-optim \
        --seed 42 \
        --bf16 \
        --eval-data-path ${EVAL_DATA_PATH} \
       	--eval-data-size 1000 \
	    --temperature 0.00001 \
       		| tee $eval_log_path

在平台托管训练下发任务,执行脚本

cd scripts/qwen3-8b && bash eval_qwen3_8b_lora_customed10k.sh

部分日志如下:

...
using world size: 16, data-parallel size: 4, context-parallel size: 1 tensor-model-parallel size: 2, pipeline-model-parallel size: 2 
WARNING: overriding default arguments for no_load_rng:True                        with no_load_rng:True
WARNING: overriding default arguments for no_load_optim:True                        with no_load_optim:True
WARNING: Please specify --split when using --data-path. Using legacy default value of "969, 30, 1"
setting global batch size to 4
WARNING: Setting args.overlap_p2p_comm to False since non-interleaved schedule does not support overlapping p2p communication
accumulate and all-reduce gradients in fp32 for bfloat16 data type.
using torch.bfloat16 for parameters ...
...
Loading tokenizer...
Loading data...
==============  Batch 1 / 100 : Start (0 + 10) / 1000 =============
...
================ Do Sample =================
...
=====>finish Batch=100, Accuracy: 0.7, index: 1000 =======
eval data: 85.40000000000002 / 100
Average accuracy: 0.8540000000000002

6.6 MG权重转为HF格式

使用MindSpeed-LLM训练后,若希望将权重转换为HF格式,可使用脚本ckpt_convert_qwen3_mcore2hf_lora.sh,关键代码如下

export CUDA_DEVICE_MAX_CONNECTIONS=1

OUTPUT_BASE_DIR=/data/qwen3
log_path="${OUTPUT_BASE_DIR}/8B/logs/ckpt_convert_qwen3_mcore2hf_lora.log"
mkdir -p ${OUTPUT_BASE_DIR}/8B/logs

python convert_ckpt.py \
    --use-mcore-models \
    --model-type GPT \
    --load-model-type mg \
    --save-model-type hf \
    --target-tensor-parallel-size 2 \
    --target-pipeline-parallel-size 1 \
    --spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
    --load-dir  ${OUTPUT_BASE_DIR}/8B/save_weights_lora/qwen3_8B_mcore_tp2pp2_customed10k_qwen3_think_merge/ \
    --save-dir models/Qwen3-8B_lora/ \
    --model-type-hf qwen3 \
        | tee $log_path

cp -rf models/Qwen3-8B_lora/mg2hf ${OUTPUT_BASE_DIR}/8B/save_weights_lora/

注意:

  • --save-model-type参数设置为 hf

  • save-dir 参数设置为原始hf权重的路径,脚本执行后,会在该目录下生成mg2hf的目录存放转换后的权重。

  • tp和pp都设置为1: --target-tensor-parallel-size 2 , --target-pipeline-parallel-size 2

  • 由于Lora训练过程中使用fp16完成,为了不影响精度,在原始HF权重的配置文件config.json,需将 "torch_dtype": "bfloat16", 配置改为 "torch_dtype": "float16",修改后检查如下

    # grep -rn  "torch_dtype" models/Qwen3-8B_lora/config.json 
    25:  "torch_dtype": "float16",

在平台托管训练下发任务,执行脚本

cd scripts/qwen3-8b && bash ckpt_convert_qwen3_mcore2hf_lora.sh

运行后部分日志如下

Setting num_layers to 36 from checkpoint
Setting hidden_size to 4096 from checkpoint
Setting ffn_hidden_size to 12288 from checkpoint
Setting seq_length to 40960 from checkpoint
Setting num_attention_heads to 32 from checkpoint
Setting num_query_groups to 8 from checkpoint
Setting group_query_attention to True from checkpoint
Setting kv_channels to 128 from checkpoint
Setting max_position_embeddings to 40960 from checkpoint
Setting position_embedding_type to rope from checkpoint
Setting add_position_embedding to True from checkpoint
Setting use_rotary_position_embeddings to True from checkpoint
Setting rotary_percent to 1.0 from checkpoint
Setting rotary_interleaved to False from checkpoint
Setting add_bias_linear to False from checkpoint
Setting add_qkv_bias to False from checkpoint
Setting swiglu to True from checkpoint
Setting untie_embeddings_and_output_weights to True from checkpoint
Setting apply_layernorm_1p to False from checkpoint
Setting normalization to RMSNorm from checkpoint
Setting tokenizer_type to Llama2Tokenizer from checkpoint
Setting padded_vocab_size to 151936 from checkpoint
Setting apply_query_key_layer_scaling to False from checkpoint
Setting tensor_model_parallel_size to 2 from checkpoint
Setting pipeline_model_parallel_size to 2 from checkpoint
...
Setting expert_model_parallel_size to 1 from checkpoint
using world size: 4, data-parallel size: 1, context-parallel size: 1 tensor-model-parallel size: 2, pipeline-model-parallel size: 2 
...
INFO:root:received transformer layer 33
INFO:root:received transformer layer 34
INFO:root:received transformer layer 35
INFO:root:received final norm
INFO:root:received output layer
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [06:05<00:00, 73.08s/it]
set layer states: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:03<00:00,  9.47it/s]
INFO:root:save weight to /data/share/1950833868382666752/platform_test/models/Qwen3-8B/mg2hf
INFO:root:Done!

转换后生成目录文件如下:

|-- config.json
|-- generation_config.json
|-- model-00001-of-00004.safetensors
|-- model-00002-of-00004.safetensors
|-- model-00003-of-00004.safetensors
|-- model-00004-of-00004.safetensors
`-- model.safetensors.index.json