Qwen3 是 Qwen 系列中的最新一代大型语言模型,提供了一整套密集型和专家混合(MoE)模型。基于广泛的训练,Qwen3 在推理、指令遵循、代理能力和多语言支持方面实现了突破性进展,具有以下关键特性:
Qwen3-4B 具有以下特点:
硬件:
| 设备型号 | NPU配置 |
|---|---|
| Atlas 800T A3 | 8*128G |
软件配套:
| 软件 | 版本 | 部署方式 |
|---|---|---|
| Driver | AscendHDK 25.0.RC1 | 宿主机 |
| Firmware | AscendHDK 25.0.RC1 | 宿主机 |
| Python | 3.10 | 容器内 |
| CANN | CANN 8.1.RC1 | 容器内 |
| Torch | 2.6.0 | 容器内 |
| Torch_npu | release v7.0.0 | 容器内 |
| MindSpeed | 2.0.0_core_r0.8.0 | 容器内 |
| MindSpeed-LLM | 2.1.0 | 容器内 |
| Megatron-LM | core_r0.8.0 | 容器内 |
| Docker镜像OS | Ubuntu 20.04.6 | / |
镜像:
已准备了Qwen3-4B模型的训练验证脚本,包括数据预处理、模型格式转换、全参微调和LoRA微调等功能,放在到scripts/qwen3-4B路径下。
scripts/qwen3-4B#
├── ckpt_convert_qwen3_4B_hf2mcore.sh # 将HF权重转换为Mcore格式
├── ckpt_convert_qwen3_lora_merge.sh # Lora微调Lora权重合并脚本
├── ckpt_convert_qwen3_mcore2hf_full.sh # 将全参微调的Mcore权重转换为HF格式
├── ckpt_convert_qwen3_mcore2hf_lora.sh # 将全参微调的Mcore权重转换为HF格式
├── data_convert_qwen3_instruction_customed10k.sh # qwen3 customed慢思考数据转换脚本
├── eval_qwen3_4B_full_customed10k.sh # 全参微调的评测脚本
├── eval_qwen3_4B_lora_customed10k.sh # Lora微调的评测脚本
├── tune_qwen3_4B_4K_full_customed10k_ptd.sh # 全参微调训练脚本
└── tune_qwen3_4B_4K_lora_customed10k_ptd.sh # Lora微调训练脚本从HuggingFace或ModelScope下载模型权重Qwen3-4B,已放置在models/Qwen3-4B目录下。
models/Qwen3-4B#
├── config.json
├── configuration.json
├── generation_config.json
├── merges.txt
├── model-00001-of-00008.safetensors
├── model-00002-of-00008.safetensors
├── model-00003-of-00008.safetensors
├── model.safetensors.index.json
├── README.md
├── tokenizer_config.json
├── tokenizer.json
└── vocab.json准备训练与测试数据集,放在datasets/目录下
datasets#
├── customed_test_1k.jsonl # 测试集
└── customed_train_10k.jsonl # 训练集昇腾MindSpeed-LLM要求模型权重采用Megatron-LM格式,在这里我们将原始HuggingFace权重格式转换为Megatron-Mcore格式。
使用转换脚本,获取对应切分的mg权重。这里权重切分为tp1pp2,权重转换转换脚本 ckpt_convert_qwen3_4B_hf2mcore.sh 关键代码如下:
OUTPUT_BASE_DIR=/data/qwen3
log_path="${OUTPUT_BASE_DIR}/4B/logs/ckpt_convert_qwen3_4B_hf2mcore.log"
mkdir -p ${OUTPUT_BASE_DIR}/4B/mg_weights/qwen3_4B_mcore_tp1pp2/
mkdir -p ${OUTPUT_BASE_DIR}/4B/logs
# 设置需要的权重转换参数
python convert_ckpt.py \
--use-mcore-models \
--model-type GPT \
--load-model-type hf \
--save-model-type mg \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 2 \
--spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
--load-dir models/Qwen3-4B/ \
--save-dir ${OUTPUT_BASE_DIR}/4B/mg_weights/qwen3_4B_mcore_tp1pp2/ \
--tokenizer-model models/Qwen3-4B/tokenizer.json \
--params-dtype bf16 \
--model-type-hf qwen3 \
| tee $log_path参数解析
| 参数 | 说明 | 必填 |
|---|---|---|
--model-type GPT | 指定模型类型为GPT系列 | 是 |
--use-mcore-models | 转换为Megatron-Mcore格式 | 是 |
--target-tensor-parallel-size | 张量并行度设置 | 是 |
--target-pipeline-parallel-size | 流水线并行度设置 | 是 |
--tokenizer-model | 指定分词器路径 | 是 |
--load-model-type | 加载权重的类别(可以是hf、mg) | 是 |
--save-model-type | 存储权重的类别(可以是hf、mg) | 是 |
--load-dir | 权重文件加载路径 | 是 |
--save-dir | 权重文件保存路径 | 是 |
--model-type-hf | huggingface模型类别,默认为llama2 | 否 |
--params-dtype | 指定权重转换后的权重精度模式,默认为fp16,如果源文件格式为bf16,则需要设置为bf16 | 是 |
在平台托管训练下发任务,单机单卡就能运行,更多资源会更快。执行脚本
cd scripts/qwen3-4B && bash ckpt_convert_qwen3_4B_hf2mcore.sh运行后部分日志如下
...
building GPT model ...
WARNING:mindspeed_llm.core.models.common.language_module.language_module:Distributed processes aren't initialized, so the output layer is not initialized with weights from the word embeddings. If you are just manipulating a model this is fine, but this needs to be handled manually. If you are training something is definitely wrong.
building GPT model ...
Loading checkpoint shards: 33%|███████████████████████████████████████████▋ | 1/3 [02:25<04:50, 145.39s/it]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [05:37<00:00, 112.39s/it]
building GPT model ...
set layer states: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:01<00:00, 21.35it/s]
INFO:root:sending embeddings
INFO:root:sending transformer layer 0
INFO:root:sending transformer layer 1
...
INFO:root:received transformer layer 30
INFO:root:received transformer layer 31
INFO:root:received transformer layer 32
INFO:root:received transformer layer 33
INFO:root:received transformer layer 34
INFO:root:received transformer layer 35
INFO:root:received final norm
saving checkpoint at iteration 1 to /data/private/qwen3/4b/mg_weights/qwen3_4b_mcore_tp1pp2/ in torch format
successfully saved checkpoint from iteration 1 to /data/private/qwen3/4b/mg_weights/qwen3_4b_mcore_tp1pp2/
INFO:root:Done!转换后生成目录文件如下:
/data/qwen3/4B/mg_weights/qwen3_4B_mcore_tp1pp2/
├── iter_0000001
│ └── mp_rank_00_000
└── model_optim_rng.pt
└── latest_checkpointed_iteration.txt这里使用的数据集为customed数据集,原始数据集目录为:
datasets/customed_train_10k.jsonldatasets/customed_test_1k.jsonl训练数据集样例如下:
{"messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "请你判读用户的意图。...交互历史:用户:广州塔为什么也叫小蛮腰\n用户最后一句话所属的技能和意图是:"}, {"role": "assistant", "content": "闲聊百科_知识查询"}]}由于制作慢思考数据,需要使用MindSpeed-LLM的master分支代码,相关master分支代码已提前放置在镜像/src/train25.1.0/MindSpeed-LLM 目录。
其中微调数据集转换脚本 data_convert_qwen3_instruction_customed10k_qwen3.sh 大致如下
OUTPUT_BASE_DIR=/data/qwen3
mkdir -p ${OUTPUT_BASE_DIR}/finetune_dataset_customed10k_qwen3_think
log_path="${OUTPUT_BASE_DIR}/4B/logs/data_convert_qwen3_instruction_customed10k_qwen3.log"
mkdir -p ${OUTPUT_BASE_DIR}/4B/logs
python ./preprocess_data.py \
--input datasets/customed_train_10k.jsonl \
--tokenizer-name-or-path models/Qwen3-4B/ \
--output-prefix ${OUTPUT_BASE_DIR}/finetune_dataset_customed10k_qwen3_think/customed \
--handler-name SharegptStyleInstructionHandler \
--tokenizer-type PretrainedFromHF \
--enable-thinking true \
--workers 4 \
--log-interval 1000 \
--map-keys '{"messages":"messages", "tags":{"role_tag": "role","content_tag": "content","user_tag": "user","assistant_tag": "assistant","system_tag": "system"}}' \
--prompt-type qwen3 \
| tee $log_path参数说明:
| 参数 | 说明 | 必填 |
|---|---|---|
--input | 输入数据文件的路径,输入数据集目录或具体文件,如果是目录,则处理全部文件, 支持 .parquet \ .csv \ .json \ .jsonl \ .txt \ .arrow 格式, 同一个文件夹下的数据格式需要保持一致 | 是 |
--tokenizer-name-or-path | 用于分词器的预训练模型的路径。 | 是 |
--output-prefix | 输出文件的前缀。预处理后的数据会被保存为多个文件(例如,后缀为.bin和.idx),这个参数指定这些文件的前缀路径。 | 是 |
--workers | 指定数据预处理时使用的进程数(多进程)。更多的进程数可以加快处理速度。 | 是 |
-tokenizer-type | 指定tokenizer的类型。 | 是 |
--log-interval | 处理进度更新的间隔步数。 | 是 |
--enable-thinking | 快慢思考模板开关,可设定为[true,false,none],默认值是none。对于qwen3,开启后,会在数据集的Label中添加</think>\n\n</think>\n\n,并参与到loss计算,所有数据被当成慢思考数据;当关闭后,</think>\n\nsuperscript:\n\n将被添加到提示词的`< | im_start |
--handler-name | 指定数据处理器(handler)的类名。常用的有AlpacaStylePairwiseHandler,SharegptStyleInstructionHandler,AlpacaStylePairwiseHandler等 | 是 |
--prompt-type | 指定模型模板,能够让base模型微调后能具备更好的对话能力。prompt-type的可选项可以在templates文件内查看。 | 是 |
--map-keys | 参数用于配置字段映射来使用数据集。其中key值"messages"、"tags"代表数据集列映射后的属性,在代码中是固定的,不应改变。value值中"conversations"对应数据集的列名、"from"对应角色标志、"human"、"gpt"、"system"、"observation"、"function_call"对应角色种类、"value"对应具体内容标志。OpenAI风格为 '{"messages":"messages", "tags":{"role_tag": "role","content_tag": "content","user_tag": "user","assistant_tag": "assistant","system_tag": "system"}}'ShareGPT风格为 '{"messages":"conversations", "tags":{"role_tag": "from","content_tag": "value","user_tag": "human","assistant_tag": "gpt","system_tag": "system", "observation_tag":"observation", "function_tag":"function_call"}}' | 是 |
在平台托管训练下发任务,单机单卡就能运行,更多资源会更快。执行脚本
cd scripts/qwen3-4B && bash data_convert_qwen3_instruction_customed10k_qwen3.sh转换后生成的文件列表如下:
.
├── customed_packed_attention_mask_document.bin
├── customed_packed_attention_mask_document.idx
├── customed_packed_input_ids_document.bin
├── customed_packed_input_ids_document.idx
├── customed_packed_labels_document.bin
├── customed_packed_labels_document.idx
├── customed_train_indexmap_49920ns_42s_shuffle_decoder_packed_idx.npy
└── customed_train_indexmap_51200ns_42s_shuffle_decoder_packed_idx.npy启动全参微调训练,训练启动脚本tune_qwen3_4B_4K_full_customed10k_ptd.sh 关键代码如下
NPUS_PER_NODE=2
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
OUTPUT_BASE_DIR=/data/private/qwen3
mkdir -p ${OUTPUT_BASE_DIR}/4b/logs
# 根据实际情况配置权重保存、权重加载、词表、数据集路径
CKPT_LOAD_DIR="${OUTPUT_BASE_DIR}/4b/mg_weights/qwen3_4b_mcore_tp1pp2/" #权重加载路径,填入权重转换时保存的权重路径
CKPT_SAVE_DIR="${OUTPUT_BASE_DIR}/4b/save_weights_full/qwen3_4b_mcore_tp1pp2_customed10k_qwen3_think/" #训练完成后的权重保存路径
DATA_PATH="${OUTPUT_BASE_DIR}/finetune_dataset_customed10k_qwen3_think/customed" #数据集路径,填入数据预处理时保存的数据路径,注意需要添加后缀
TOKENIZER_PATH="/models/Qwen3-4B/" #词表路径,填入下载的开源权重词表路径
log_path="${OUTPUT_BASE_DIR}/4b/logs/tune_qwen3_4b_4K_full_customed10k_ptd.log"
TP=1
PP=2
MBS=8
GBS=128
SEQ_LENGTH=800
TRAIN_ITERS=390 #训练步数
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
GPT_ARGS="
--use-mcore-models \
--spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
--kv-channels 128 \
--qk-layernorm \
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--num-layers 36 \
--hidden-size 2560 \
--sequence-parallel \
--use-distributed-optimizer \
--use-flash-attn \
--use-rotary-position-embeddings \
--num-attention-heads 32 \
--ffn-hidden-size 9728 \
--max-position-embeddings 32768 \
--seq-length ${SEQ_LENGTH} \
--make-vocab-size-divisible-by 1 \
--padded-vocab-size 151936 \
--rotary-base 1000000 \
--micro-batch-size ${MBS} \
--global-batch-size ${GBS} \
--disable-bias-linear \
--train-iters 2000 \
--swiglu \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--normalization RMSNorm \
--position-embedding-type rope \
--norm-epsilon 1e-6 \
--hidden-dropout 0 \
--attention-dropout 0 \
--no-gradient-accumulation-fusion \
--attention-softmax-in-fp32 \
--exit-on-missing-checkpoint \
--no-masked-softmax-fusion \
--group-query-attention \
--num-query-groups 8 \
--min-lr 1.25e-7 \
--lr 1.25e-6 \
--weight-decay 1e-1 \
--clip-grad 1.0 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--initial-loss-scale 4096 \
--no-load-optim \
--no-load-rng \
--seed 42 \
--train-iters ${TRAIN_ITERS} \
--bf16
"
DATA_ARGS="
--data-path $DATA_PATH \
--split 100,0,0
"
OUTPUT_ARGS="
--log-interval 1 \
--save-interval ${TRAIN_ITERS} \
--eval-interval ${TRAIN_ITERS} \
--eval-iters 0 \
"
TUNE_ARGS="
--finetune \
--stage sft \
--is-instruction-dataset \
--prompt-type qwen \
--variable-seq-lengths
"
torchrun $DISTRIBUTED_ARGS posttrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
$TUNE_ARGS \
--tensorboard-dir ${OUTPUT_BASE_DIR}/4b/tb/full/customed10k \
--distributed-backend nccl \
--load ${CKPT_LOAD_DIR} \
--save ${CKPT_SAVE_DIR} \
| tee $log_path全参微调关键训练参数MindSpeed-LLM与LLaMAFactory对照表如下:
| MindSpeed-LLM | LLaMAFactory | 备注 |
|---|---|---|
| SEQ_LEN=800 | cutoff_len=800 | 训练的序列长度 |
| TP=1 | - | 模型并行切分 |
| PP=2 | - | 流水线并行切分 |
| --log-interval=1 | logging_steps=10 | 打印日志频率 |
| MBS=16 | per_device_train_batch_size=2 | micro_batch_size,需要根据显存进行配置 |
| - | gradient_accumulation_steps=8 | MindSpeed-LLM梯度累计会根据MBS、GBS、DP自动计算。 World_Size = TP * PP * DP 计算方式为:GBS=MBS * MB * DP |
| GBS=128 | - | 全局batch_size |
| TRAIN_ITERS=390 | num_train_epochs=5 | 训练步数,总步数=总样本数/GBS,训练步数会影响每一步的具体lr |
| --lr 1.25e-6 | learning_rate=1.25e-6 | 学习率 |
| --bf16 | bf16=true | 使用bf16格式训练 |
| --tensorboard-dir ${OUTPUT_BASE_DIR}/4B/tb/full/customed10k | report_to: tensorboard | 开启tensorboard日志输出 |
单机8卡,执行脚本
cd scripts/qwen3-4B && bash tune_qwen3_4B_4K_full_customed10k_ptd.sh运行后部分日志如下
INFO:megatron.core.datasets.indexed_dataset:> total number of sequences: 10000
INFO:megatron.core.datasets.indexed_dataset:> total number of documents: 10000
> loading shuffle-idx mapping from /data/private/qwen3/finetune_dataset_customed10k_qwen3_think/customed_train_indexmap_49920ns_42s_shuffle_decoder_packed_idx.npy
loaded indexed file in 0.004 seconds
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2026-01-06 22:11:35
done with setup ...
(min, max) time across ranks (ms):
model-and-optimizer-setup ......................: (3462.20, 3470.84)
train/valid/test-data-iterators-setup ..........: (1055.05, 1295.68)
training ...
[before the start of training step] datetime: 2026-01-06 22:11:35
WARNING:megatron.core.models.common.embeddings.rotary_pos_embedding:Setting apply_rope_fusion to false because its implementation is not included in Apex. Try upgrading to the latest version
Number of parameters in transformer layers in billions: 3.63
[2026-01-06 14:11:44] iteration 1/ 390 | consumed samples: 128 | elapsed time per iteration (ms): 9116.1 | learning rate: 1.247115E-06 | global batch size: 128 | lm loss: 3.889437E+00 | loss scale: 1.0 | grad norm: 191.792 | number of skipped iterations: 0 | number of nan iterations: 0 |Number of parameters in embedding layers in billions: 0.39
Total number of parameters in billions: 4.02
Number of parameters in most loaded shard in billions: 2.2058
Number of parameters in other shards in billions: 1.8168
Theoretical memory footprints: weight and optimizer=37865.08 MB
[Rank 1] (after 1 iterations) memory (MB) | allocated: 37903.056640625 | max allocated: 39386.81396484375 | reserved: 40102.0 | max reserved: 40102.0
[Rank 0] (after 1 iterations) memory (MB) | allocated: 37865.50927734375 | max allocated: 41541.4873046875 | reserved: 43922.0 | max reserved: 43922.0
[2026-01-06 14:11:51] iteration 2/ 390 | consumed samples: 256 | elapsed time per iteration (ms): 6675.9 | learning rate: 1.244231E-06 | global batch size: 128 | lm loss: 3.908767E+00 | loss scale: 1.0 | grad norm: 187.320 | number of skipped iterations: 0 | number of nan iterations: 0 |
...
[2026-01-06 14:13:27] iteration 10/ 390 | consumed samples: 1280 | elapsed time per iteration (ms): 11919.0 | learning rate: 1.221154E-06 | global batch size: 128 | lm loss: 1.043371E+00 | loss scale: 1.0 | grad norm: 53.924 | number of skipped iterations: 0 | number of nan iterations: 0 |
...
[2026-01-07 09:48:13] iteration 390/ 390 | consumed samples: 49920 | elapsed time per iteration (ms): 1895.7 | learning rate: 1.250000E-07 | global batch size: 128 | lm loss: 2.260044E-03 | loss scale: 1.0 | grad norm: 0.657 | number of skipped iterations: 0 | number of nan iterations: 0 |
saving checkpoint at iteration 390 to /data/private/qwen3/4b/save_weights_full/qwen3_4b_mcore_tp1pp2_customed10k_qwen3_think/ in torch format训练后,根据配置,将生成训练日志和tensorboard日志。
镜像已预置tensorboard,根据上面训练脚本设置,tensorboard日志目录为:/data/qwen3/4B/tb/full/customed10k
打开实验环境的terminal,使用以下命令启动tensorboard
tensorboard --logdir=/data/qwen3/4B/tb/full/customed10k --port=3000观察Loss曲线如下:

训练好的权重基于customed评测数据集,进行评测的脚本eval_qwen3_4B_full_customed10k.sh 关键代码如下
export CUDA_DEVICE_MAX_CONNECTIONS=1
OUTPUT_BASE_DIR=/data/qwen3
CHECKPOINT="${OUTPUT_BASE_DIR}/4B/save_weights_full/qwen3_4B_mcore_tp1pp2_customed10k_qwen3_think/" # 指向微调后权重的保存路径
TOKENIZER_PATH="models/Qwen3-4B/" # 指向模型tokenizer的路径
eval_log_path="${OUTPUT_BASE_DIR}/4B/logs/eval_qwen3_4B_full_customed10k.log"
EVAL_DATA_PATH="datasets/customed_test_1k.jsonl"
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1 # 集群里的节点数,以实际情况填写,
NODE_RANK=0 # 当前节点的RANK,多个节点不能重复,主节点为0, 其他节点可以是1,2..
NPUS_PER_NODE=8
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
TP=1 # 张量并行维度
PP=2 # 流水线并行维度
SEQ_LENGTH=800 # 最大序列长度
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS inference_4_midea_test.py \
--use-mcore-models \
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--load ${CHECKPOINT} \
--spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
--kv-channels 128 \
--qk-layernorm \
--num-layers 40 \
--hidden-size 5120 \
--use-rotary-position-embeddings \
--untie-embeddings-and-output-weights \
--num-attention-heads 40 \
--ffn-hidden-size 17408 \
--max-position-embeddings 40960 \
--seq-length ${SEQ_LENGTH} \
--make-vocab-size-divisible-by 1 \
--padded-vocab-size 151936 \
--rotary-base 1000000 \
--micro-batch-size 1 \
--disable-bias-linear \
--swiglu \
--use-rotary-position-embeddings \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--normalization RMSNorm \
--position-embedding-type rope \
--norm-epsilon 1e-6 \
--hidden-dropout 0 \
--attention-dropout 0 \
--max-new-tokens 256 \
--no-gradient-accumulation-fusion \
--attention-softmax-in-fp32 \
--exit-on-missing-checkpoint \
--no-masked-softmax-fusion \
--group-query-attention \
--num-query-groups 8 \
--seed 42 \
--bf16 \
--eval-data-path ${EVAL_DATA_PATH} \
--eval-data-size 1000 \
--temperature 0.00001 \
| tee $eval_log_path在平台托管训练下发任务,执行脚本
cd scripts/qwen3-4B && bash eval_qwen3_4B_full_customed10k.sh使用MindSpeed-LLM训练后,若希望将权重转换为HF格式,可使用脚本ckpt_convert_qwen3_mcore2hf_full.sh,关键代码如下
export CUDA_DEVICE_MAX_CONNECTIONS=1
OUTPUT_BASE_DIR=/data/qwen3
log_path="${OUTPUT_BASE_DIR}/4B/logs/ckpt_convert_qwen3_mcore2hf_full.log"
mkdir -p ${OUTPUT_BASE_DIR}/4B/logs
python convert_ckpt.py \
--use-mcore-models \
--model-type GPT \
--load-model-type mg \
--save-model-type hf \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 2 \
--spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
--load-dir ${OUTPUT_BASE_DIR}/4B/save_weights_full/qwen3_4B_mcore_tp1pp2_customed10k_qwen3_think/ \
--save-dir models/Qwen3-4B/ \
--params-dtype bf16 \
--model-type-hf qwen3 \
| tee $log_path
cp -rf models/Qwen3-4B/mg2hf ${OUTPUT_BASE_DIR}/4B/save_weights_full/注意:
- --save-model-type参数设置为 hf
- save-dir 参数设置为原始hf权重的路径,脚本执行后,会在该目录下生成mg2hf的目录存放转换后的权重。
- tp和pp都设置为1: --target-tensor-parallel-size 1 , --target-pipeline-parallel-size 2
在平台托管训练下发任务,执行脚本
cd scripts/qwen3-4B && bash ckpt_convert_qwen3_mcore2hf_full.sh运行后部分日志如下
...
wandb_save_dir ..................................
weight_decay .................................... 0.01
weight_decay_incr_style ......................... constant
wgrad_deferral_limit ............................ 0
world_size ...................................... 1
yaml_cfg ........................................ None
-------------------- end of MindSpeed-LLM Arguments ---------------------
building GPT model ...
WARNING:mindspeed_llm.core.models.common.language_module.language_module:Distributed processes aren't initialized, so the output layer is not initialized with weights from the word embeddings. If you are just manipulating a model this is fine, but this needs to be handled manually. If you are training something is definitely wrong.
loading checkpoint from /data/private/qwen3/4b/save_weights_full/qwen3_4b_mcore_tp1pp2_customed10k_qwen3_think/ at iteration 390
could not find arguments in the checkpoint ...
checkpoint version 3.0
successfully loaded checkpoint from /data/private/qwen3/4b/save_weights_full/qwen3_4b_mcore_tp1pp2_customed10k_qwen3_think/ [ t 0, p 0 ] at iteration 0
INFO:root:sending embeddings
...
INFO:root:received transformer layer 33
INFO:root:received transformer layer 34
INFO:root:received transformer layer 35
INFO:root:received final norm
INFO:root:save weight to /data/share/models/Qwen3-4B/mg2hf
INFO:root:Done!转换后生成目录文件如下:
-rw-r--r--. 1 root root 727 Jan 8 18:07 config.json
-rw-r--r--. 1 root root 214 Jan 8 18:07 generation_config.json
-rw-r--r--. 1 root root 4967215360 Jan 8 18:07 model-00001-of-00002.safetensors
-rw-r--r--. 1 root root 3077766632 Jan 8 18:07 model-00002-of-00002.safetensors
-rw-r--r--. 1 root root 32819 Jan 8 18:07 model.safetensors.index.jsonLora微调训练,训练启动脚本tune_qwen3_4B_4K_lora_customed10k_ptd.sh 关键代码如下
# 基础配置
NPUS_PER_NODE=2 #使用单节点的2卡NPU
MASTER_ADDR=localhost #以本节点ip地址为master_ip
MASTER_PORT=6015 #本节点端口号为6014
NNODES=1 #单机,即一台节点,多机即多节点
NODE_RANK=0 #单机RANK为0,多机为(0,NNODES-1),不同节点不可重复
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES)) #最终使用的NPU数
OUTPUT_BASE_DIR=/data/private/qwen3
mkdir -p ${OUTPUT_BASE_DIR}/4b/logs
# 根据实际情况配置权重保存、权重加载、词表、数据集路径
CKPT_LOAD_DIR="${OUTPUT_BASE_DIR}/4b/mg_weights/qwen3_4b_mcore_tp1pp2/" #权重加载路径,填入权重转换时保存的权重路径
CKPT_SAVE_DIR="${OUTPUT_BASE_DIR}/4b/save_weights_lora/qwen3_4b_mcore_tp1pp2_customed10k_qwen3_think/" #训练完成后的权重保存路径
DATA_PATH="${OUTPUT_BASE_DIR}/finetune_dataset_customed10k_qwen3_think/customed" #数据集路径,填入数据预处理时保存的数据路径,注意需要添加后缀
TOKENIZER_PATH="/data/models/Qwen3-4B/" #词表路径,填入下载的开源权重词表路径
log_path="${OUTPUT_BASE_DIR}/4b/logs/tune_qwen3_4b_4K_lora_customed10k_ptd.log"
TP=1
PP=2
SEQ_LENGTH=2048
TRAIN_ITERS=800
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
OPTIMIZE_ARGS="
--use-flash-attn \
--use-fused-rotary-pos-emb \
--use-rotary-position-embeddings \
--use-fused-swiglu \
--use-fused-rmsnorm \
--no-masked-softmax-fusion \
--use-distributed-optimizer
"
TRAIN_ARGS="
--micro-batch-size 4 \
--global-batch-size 64 \
--lr 1.25e-5 \
--lr-decay-style cosine \
--min-lr 1.25e-7 \
--weight-decay 1e-1 \
--lr-warmup-fraction 0.1 \
--attention-dropout 0.0 \
--init-method-std 0.01 \
--hidden-dropout 0.0 \
--clip-grad 1.0 \
--adam-beta1 0.9 \
--adam-beta2 0.999 \
--initial-loss-scale 4096 \
--seed 42 \
--bf16 \
--train-iters ${TRAIN_ITERS} \
--seq-length ${SEQ_LENGTH} \
--no-shared-storage
"
MODEL_PARALLEL_ARGS="
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP}
"
GPT_ARGS="
--use-mcore-models \
--spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
--kv-channels 128 \
--qk-layernorm \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--max-position-embeddings ${SEQ_LENGTH} \
--num-layers 36 \
--hidden-size 2560 \
--ffn-hidden-size 9728 \
--num-attention-heads 32 \
--tokenizer-type PretrainedFromHF \
--make-vocab-size-divisible-by 1 \
--padded-vocab-size 151936 \
--rotary-base 1000000 \
--disable-bias-linear \
--position-embedding-type rope \
--normalization RMSNorm \
--swiglu \
--attention-softmax-in-fp32 \
--no-gradient-accumulation-fusion \
--group-query-attention \
--num-query-groups 8
"
DATA_ARGS="
--data-path $DATA_PATH \
--split 100,0,0
"
OUTPUT_ARGS="
--load ${CKPT_LOAD_DIR} \
--save ${CKPT_SAVE_DIR} \
--log-interval 1 \
--save-interval ${TRAIN_ITERS} \
--eval-interval ${TRAIN_ITERS} \
--eval-iters 0 \
--no-load-optim \
--no-load-rng
"
TUNE_ARGS="
--finetune \
--stage sft \
--is-instruction-dataset \
--tokenizer-not-use-fast \
--prompt-type qwen \
--variable-seq-lengths \
--lora-r 16 \
--lora-alpha 32 \
--lora-fusion \
--lora-target-modules linear_qkv linear_proj linear_fc1 linear_fc2
"
torchrun $DISTRIBUTED_ARGS posttrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
$OPTIMIZE_ARGS \
$TRAIN_ARGS \
$TUNE_ARGS \
$MODEL_PARALLEL_ARGS \
--tensorboard-dir ${OUTPUT_BASE_DIR}/4b/tb/lora/customed10k \
--distributed-backend nccl \
| tee $log_pathLora微调关键训练参数MindSpeed-LLM与LLaMAFactory对照表如下:
| MindSpeed-LLM | LLaMAFactory | 备注 |
|---|---|---|
| --lora-r 16 | lora_rank: 16 | 低秩矩阵的维度。较低的 rank 值模型在训练时会使用更少的参数更新,从而减少计算量和内存消耗。然而,过低的 rank 可能限制模型的表达能力。 |
| --lora-alpha 32 | - | 控制 LoRA 权重对原始权重的影响比例, 数值越高则影响越大。一般保持为lora-r的2倍。 |
| --lora-target-modules linear_qkv linear_proj linear_fc1 linear_fc2 | lora_target: all | 模型中添加 LoRA 的模块 |
| --lora-fusion | - | 是否启用CCLoRA算法,该算法通过计算通信掩盖提高性能。 |
| SEQ_LEN=2048 | cutoff_len=2048 | 训练的序列长度 |
| TP=1 | - | 模型并行切分 |
| PP=2 | - | 流水线并行切分 |
| --log-interval 1 | logging_steps=10 | 打印日志频率 |
| MBS=16 | per_device_train_batch_size=4 | micro_batch_size, 需要根据显存进行配置 |
| - | gradient_accumulation_steps=4 | MindSpeed-LLM梯度累计会根据MBS,GBS,DP自动计算。 World_Size = TP * PP * DP 计算方式为:GBS=MBS * MB * DP |
| GBS=128 | - | 全局batch_size |
| TRAIN_ITERS=800 | num_train_epochs=5 | 训练步数,总步数=总样本数/GBS,训练步数会影响每一步的具体lr |
| --lr 1.25e-5 | learning_rate=1.25e-5 | 学习率 |
| --lr-decay-style cosine | lr_scheduler_type=cosine | 余弦退火学习率衰减 |
| --lr-warmup-fraction 0.1 | warmup_ratio=0.1 | 学习率预热比例 |
| --bf16 | bf16=true | 使用bf16格式训练 |
| --tensorboard-dir ${OUTPUT_BASE_DIR}/4B/tb/full/customed10k | report_to: tensorboard | 开启tensorboard日志输出 |
选择单机4卡,下发任务,执行脚本
cd scripts/qwen3-4B && bash tune_qwen3_4B_4K_lora_customed10k_ptd.sh运行后部分日志如下
Number of parameters in transformer layers in billions: 3.63
Number of parameters in embedding layers in billions: 0.39
Total number of parameters in billions: 4.02
Number of parameters in most loaded shard in billions: 2.2058
Number of parameters in other shards in billions: 1.8168
Theoretical memory footprints: weight and optimizer=37865.08 MB
[2026-01-07 09:44:07] iteration 1/ 800 | consumed samples: 64 | elapsed time per iteration (ms): 28850.9 | learning rate: 1.562500E-07 | global batch size: 64 | lm loss: 3.807633E+00 | loss scale: 1.0 | grad norm: 14.335 | number of skipped iterations: 0 | number of nan iterations: 0 |
[Rank 0] (after 1 iterations) memory (MB) | allocated: 4455.0498046875 | max allocated: 12654.9833984375 | reserved: 14510.0 | max reserved: 14510.0
[Rank 1] (after 1 iterations) memory (MB) | allocated: 4478.2373046875 | max allocated: 11350.0546875 | reserved: 12706.0 | max reserved: 12706.0
[2026-01-07 09:44:09] iteration 2/ 800 | consumed samples: 128 | elapsed time per iteration (ms): 2548.7 | learning rate: 3.125000E-07 | global batch size: 64 | lm loss: 3.997280E+00 | loss scale: 1.0 | grad norm: 14.771 | number of skipped iterations: 0 | number of nan iterations: 0 |
...
[2026-01-07 10:17:50] iteration 800/ 800 | consumed samples: 51200 | elapsed time per iteration (ms): 2512.2 | learning rate: 1.250000E-07 | global batch size: 64 | lm loss: 1.005985E-02 | loss scale: 1.0 | grad norm: 0.376 | number of skipped iterations: 0 | number of nan iterations: 0 |
saving checkpoint at iteration 800 to /data/private/qwen3/4b/save_weights_lora/qwen3_4b_mcore_tp1pp2_customed10k_qwen3_think/ in torch format
successfully saved checkpoint from iteration 800 to /data/private/qwen3/4b/save_weights_lora/qwen3_4b_mcore_tp1pp2_customed10k_qwen3_think/训练后,根据配置,将生成训练日志和tensorboard日志。
镜像已预置tensorboard,根据上面训练脚本设置,tensorboard日志目录为:/data/qwen3/4B/tb/lora/customed10k
打开实验环境的terminal,使用以下命令启动tensorboard
tensorboard --port=3000 --logdir=/data/qwen3/4B/tb/full/customed10k观察Loss曲线如下:

训练好的权重基于customed评测数据集,进行评测的脚本eval_qwen3_4B_lora_customed10k.sh 关键代码如下
export CUDA_DEVICE_MAX_CONNECTIONS=1
OUTPUT_BASE_DIR=/data/private/qwen3
CHECKPOINT="${OUTPUT_BASE_DIR}/4b/save_weights_full/qwen3_4b_mcore_tp1pp2_customed10k_qwen3_think/" # 指向微调后权重的保存路径
TOKENIZER_PATH="/data/models/Qwen3-4B/" # 指向模型tokenizer的路径
eval_log_path="${OUTPUT_BASE_DIR}/4b/logs/eval_qwen3_4b_full_customed10k.log"
EVAL_DATA_PATH="/data/datasets/customed_test_1k.jsonl"
# Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1 # 集群里的节点数,以实际情况填写,
NODE_RANK=0 # 当前节点的RANK,多个节点不能重复,主节点为0, 其他节点可以是1,2..
NPUS_PER_NODE=2
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
TP=1
PP=2
SEQ_LENGTH=4096
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
torchrun $DISTRIBUTED_ARGS inference_4_midea_test.py \
--spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
--use-mcore-models \
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--num-layers 36 \
--hidden-size 2560 \
--ffn-hidden-size 9728 \
--num-attention-heads 32 \
--group-query-attention \
--num-query-groups 8 \
--seq-length ${SEQ_LENGTH} \
--max-new-tokens 2 \
--max-position-embeddings 32768 \
--disable-bias-linear \
--swiglu \
--norm-epsilon 1e-6 \
--padded-vocab-size 151936 \
--make-vocab-size-divisible-by 1 \
--position-embedding-type rope \
--load ${CHECKPOINT} \
--kv-channels 128 \
--qk-layernorm \
--norm-topk-prob \
--rotary-base 1000000 \
--use-rotary-position-embeddings \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--normalization RMSNorm \
--attention-dropout 0.0 \
--hidden-dropout 0.0 \
--no-gradient-accumulation-fusion \
--attention-softmax-in-fp32 \
--exit-on-missing-checkpoint \
--no-masked-softmax-fusion \
--micro-batch-size 1 \
--no-load-rng \
--no-load-optim \
--seed 42 \
--bf16 \
--eval-data-path ${EVAL_DATA_PATH} \
--eval-data-size 1000 \
--temperature 0.00001 \
| tee $eval_log_path在平台托管训练下发任务,执行脚本
cd scripts/qwen3-4B && bash eval_qwen3_4B_lora_customed10k.sh使用MindSpeed-LLM训练后,若希望将权重转换为HF格式,可使用脚本ckpt_convert_qwen3_mcore2hf_lora.sh,关键代码如下
export CUDA_DEVICE_MAX_CONNECTIONS=1
OUTPUT_BASE_DIR=/data/qwen3
log_path="${OUTPUT_BASE_DIR}/4B/logs/ckpt_convert_qwen3_mcore2hf_lora.log"
mkdir -p ${OUTPUT_BASE_DIR}/4B/logs
python convert_ckpt.py \
--use-mcore-models \
--model-type GPT \
--load-model-type mg \
--save-model-type hf \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
--load-dir ${OUTPUT_BASE_DIR}/4B/save_weights_lora/qwen3_4B_mcore_tp1pp2_customed10k_qwen3_think_merge/ \
--save-dir models/Qwen3-4B_lora/ \
--model-type-hf qwen3 \
| tee $log_path
cp -rf models/Qwen3-4B_lora/mg2hf ${OUTPUT_BASE_DIR}/4B/save_weights_lora/注意:
--save-model-type参数设置为 hf
save-dir 参数设置为原始hf权重的路径,脚本执行后,会在该目录下生成mg2hf的目录存放转换后的权重。
tp和pp都设置为1: --target-tensor-parallel-size 1 , --target-pipeline-parallel-size 2
由于Lora训练过程中使用fp16完成,为了不影响精度,在原始HF权重的配置文件config.json,需将 "torch_dtype": "bfloat16", 配置改为 "torch_dtype": "float16",修改后检查如下
# grep -rn "torch_dtype" models/Qwen3-4B_lora/config.json 25: "torch_dtype": "float16",
在平台托管训练下发任务,执行脚本
cd scripts/qwen3-4B && bash ckpt_convert_qwen3_mcore2hf_lora.sh运行后部分日志如下
...
checkpoint version 3.0
successfully loaded checkpoint from /data/private/qwen3/4b/save_weights_lora/qwen3_4b_mcore_tp1pp2_customed10k_qwen3_think_merge/ [ t 1, p 0 ] at iteration 0
INFO:root:sending embeddings
INFO:root:sending transformer layer 0
INFO:root:sending transformer layer 1
...
INFO:root:received transformer layer 32
INFO:root:received transformer layer 33
INFO:root:received transformer layer 34
INFO:root:received transformer layer 35
INFO:root:received final norm
INFO:root:save weight to /data/share/1950833868382666752/platform_test/models/Qwen3-4B/mg2hf
INFO:root:Done!转换后生成目录文件如下:
/data/qwen3//4B/save_weights_lora/mg2hf/
-rw-r--r--. 1 root root 727 Jan 8 18:17 config.json
-rw-r--r--. 1 root root 214 Jan 8 18:17 generation_config.json
-rw-r--r--. 1 root root 4967215360 Jan 8 18:17 model-00001-of-00002.safetensors
-rw-r--r--. 1 root root 3077766632 Jan 8 18:17 model-00002-of-00002.safetensors
-rw-r--r--. 1 root root 32819 Jan 8 18:17 model.safetensors.index.json