Qwen3-8B 是通义千问系列第三代 82 亿参数的稠密大语言模型,主打“小身材、大能量”,支持“思考/非思考”两种模式一键切换,被社区视为性价比最高的 8B 级模型之一。
enable_thinking=True/False 即可让模型“动脑子”或“秒回”,无需额外微调。模型继续采用“Decoder-only + 自回归”框架,但在工程层面做了系统级瘦身与加速:
| 模块 | 改进点 | 收益 |
|---|---|---|
| 注意力 | 分组查询注意力 GQA(Q=32,KV=8) | KV 缓存 ↓30%,长文本推理速度 ↑18% |
| 激活 | SwiGLU 替代 GeLU/ReLU | 非线性拟合能力↑,训练更稳 |
| 归一化 | RMSNorm 替代 LayerNorm | 每层减少约 5% 计算 |
| 位置编码 | RoPE + ALiBi 混合策略 | 无需额外参数即可扩展 32 K 上下文 |
| 训练数据 | 大量链式思维、代码日志、辩论式语料 | 推理行为接近“一步一步思考” |
公开榜单(MMLU、C-Eval、GSM8K、HumanEval)显示,Qwen3-8B 在中文理解与数理推理上普遍优于 Llama3-8B、Mixtral-8×7B,英文能力与 Llama3-8B 互有胜负。
本案例在 Atlas 800T A3 4 卡 上,基于 MindSpeed-LLM 框架,完成 Qwen3-8B 的 全参SFT微调训练 实践。
| 组件 | 版本 |
|---|---|
| CANN | 8.3.RC1.alpha003 |
| Python | 3.10 |
| torch | 2.7.1 |
| torch_npu | 2.7.1rc1 |
| MindSpeed | 2.2.0_core_r0.12.1 |
| Megatron-LM | core_v0.12.1 |
| MindSpeed-LLM | 2.3.0 |
# 基础参数
ARG USER="ma-user"
ARG U_ID="1000"
ARG GROUP="ma-group"
ARG GID="100"
ARG CANN_TOOLKIT="Ascend-cann-toolkit_8.3.RC1.alpha003_linux-aarch64.run"
ARG CANN_KERNELS="Atlas-A3-cann-kernels_8.3.RC1.alpha003_linux-aarch64.run"
ARG CANN_NNAL="Ascend-cann-kernels-910b_8.3.RC1.alpha003_linux-aarch64.run"
ARG TORCH_NPU="torch_npu-2.7.1rc1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl"
ARG APEX="apex-0.1+ascend-cp310-cp310-linux_aarch64.whl"
ARG PORT=3100
#------------------- 1. 系统基础镜像 -------------------
FROM ubuntu:22.04 AS modelbase
ARG USER U_ID GROUP GID
ARG CANN_TOOLKIT CANN_KERNELS CANN_NNAL
ARG TORCH_NPU APEX PORT
WORKDIR /root
# 1) 安装系统依赖
COPY Install_script/install_system_package.sh .
RUN bash install_system_package.sh && rm -f install_system_package.sh
ENV TZ=Asia/Shanghai \
LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64/common:$LD_LIBRARY_PATH
#------------------- 2. 创建用户 & 基础配置 -------------------
RUN apt-get update && apt-get install -y wget && \
wget -q "http://172.17.0.1:${PORT}/Install_script/create_user.sh" -O create_user.sh && \
bash create_user.sh ${USER} ${U_ID} ${GROUP} ${GID} && \
echo "${USER}:自定义password" | chpasswd && \
echo "root:自定义password" | chpasswd && \
rm -f ~/*.sh && \
chown -R ${USER}:${GROUP} /etc/apt && \
ln -sf /lib /lib64
USER ${USER}
WORKDIR /home/${USER}
ENV HOMEPATH=/home/${USER} \
PATH=$PATH:${HOMEPATH}/.local/bin \
PIP_INDEX_URL=https://repo.huaweicloud.com/repository/pypi/simple \
PIP_TRUSTED_HOST=repo.huaweicloud.com
#------------------- 3. 安装 PTA(torch_npu + apex) -------------------
RUN wget -q "http://172.17.0.1:${PORT}/Install_script/set_user.sh" -O set_user.sh && bash set_user.sh && \
wget -q "http://172.17.0.1:${PORT}/package/${TORCH_NPU}" -O ${TORCH_NPU} && \
wget -q "http://172.17.0.1:${PORT}/package/${APEX}" -O ${APEX} && \
wget -q "http://172.17.0.1:${PORT}/Install_script/install_pta.sh" -O install_pta.sh && \
bash install_pta.sh "${TORCH_NPU}" "${APEX}" && \
pip cache purge && \
rm -f ~/*.whl ~/*.sh
#------------------- 4. 安装 CANN -------------------
RUN wget -q "http://172.17.0.1:${PORT}/Install_script/install_toolkit.sh" -O install_toolkit.sh && \
wget -q "http://172.17.0.1:${PORT}/package/${CANN_TOOLKIT}" -O ${CANN_TOOLKIT} && \
bash install_toolkit.sh "${CANN_TOOLKIT}" && \
rm -f ~/*.run ~/*.sh
RUN wget -q "http://172.17.0.1:${PORT}/Install_script/install_kernel.sh" -O install_kernel.sh && \
wget -q "http://172.17.0.1:${PORT}/package/${CANN_KERNELS}" -O ${CANN_KERNELS} && \
bash install_kernel.sh "${CANN_KERNELS}" && \
rm -f ~/*.run ~/*.sh
RUN wget -q "http://172.17.0.1:${PORT}/Install_script/install_nnal.sh" -O install_nnal.sh && \
wget -q "http://172.17.0.1:${PORT}/package/${CANN_NNAL}" -O ${CANN_NNAL} && \
bash install_nnal.sh "${CANN_NNAL}" && \
rm -f ~/*.run ~/*.sh
#------------------- 5. 克隆 & 安装 MindSpeed + Megatron + MindSpeed-LLM -------------------
RUN git clone https://gitcode.com/ascend/MindSpeed.git && \
cd MindSpeed && \
git checkout 2.2.0_core_r0.12.1 && \
pip install -r requirements.txt && \
pip install -e . && \
cd ..
RUN git clone https://gitcode.com/ascend/MindSpeed-LLM.git && \
git clone https://github.com/NVIDIA/Megatron-LM.git && \
cd Megatron-LM && \
git checkout core_v0.12.1 && \
cp -r megatron ../MindSpeed-LLM/ && \
cd ../MindSpeed-LLM && \
git checkout 2.2.0 && \
pip install -r requirements.txt
#------------------- 6. 其他依赖 -------------------
RUN wget -q "http://172.17.0.1:${PORT}/Install_script/moxing_framework-2.2.7+d78e9bef-py2.py3-none-any.whl" -O moxing.whl && \
pip install moxing.whl && rm -f moxing.whl && \
pip install transformers==4.51.0
#------------------- 7. 环境变量 -------------------
ENV LD_LIBRARY_PATH=/home/${USER}/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:\
/home/${USER}/Ascend/ascend-toolkit/latest/lib64:\
/home/${USER}/Ascend/ascend-toolkit/latest/tools/aml/lib64:\
/usr/local/Ascend/driver/lib64/driver:\
/usr/local/Ascend/driver/lib64/common:$LD_LIBRARY_PATH \
ATB_HOME_PATH=/home/${USER}/Ascend/nnal/atb/latest/atb/cxx_abi_0 \
TOOLCHAIN_HOME=/home/${USER}/Ascend/ascend-toolkit/latest/toolkit \
ASCEND_TOOLKIT_HOME=/home/${USER}/Ascend/ascend-toolkit/latest \
PYTHONPATH=/home/${USER}/Ascend/ascend-toolkit/latest/python/site-packages:$PYTHONPATH \
ASCEND_OPP_PATH=/home/${USER}/Ascend/ascend-toolkit/latest/opp \
ASCEND_AICPU_PATH=/home/${USER}/Ascend/ascend-toolkit/latest \
ATB_OPSRUNNER_KERNEL_CACHE_TYPE=3 \
ATB_RUNNER_POOL_SIZE=64 \
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0 \
ATB_MATMUL_SHUFFLE_K_ENABLE=1 \
ATB_LAUNCH_KERNEL_WITH_TILING=1 \
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1 \
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128 \
ASDOPS_LOG_LEVEL=ERROR \
LCCL_DETERMINISTIC=0 \
ASDOPS_MATMUL_PP_FLAG=1 \
ASDOPS_LOG_TO_BOOST_TYPE=atb \
ASDOPS_LOG_TO_FILE_FLUSH=0 \
ATB_COMPARE_TILING_EVERY_KERNEL=0 \
ASCEND_HOME_PATH=/home/${USER}/Ascend/ascend-toolkit/latest \
ASDOPS_LOG_TO_STDOUT=0
#------------------- 8. 自启动脚本 -------------------
RUN echo "export GLOG_v=2" >> ~/.bashrc && \
echo "source /usr/local/Ascend/driver/bin/setenv.bash" >> ~/.bashrc && \
echo "source ~/Ascend/ascend-toolkit/set_env.sh" >> ~/.bashrc && \
echo "source ~/Ascend/nnal/atb/set_env.sh" >> ~/.bashrc
#------------------- 9. 清理 & 确认 -------------------
RUN cd ~ && rm -rf *.whl *.run *.tar.gz install_*.sh && ls -ldocker run \
--privileged \
--cap-add=SYS_PTRACE \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
--net=host \
--shm-size=500g \
--name 容器名字 \
-v /挂载宿主机文件夹路径:/容器内路径 \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /var/log/npu/:/usr/slog \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /sys/fs/cgroup:/sys/fs/cgroup:ro \
-itd \
--entrypoint /bin/bash \
镜像名称社区下载模型文件到指定路径,以魔搭社区为例。
pip install modelscope modelscope download --model Qwen/Qwen3-8B --local_dir /xxx/xxxx训练前需将模型权重由hf格式转换为megatron的mcore格式。转换脚本可使用 MindSpeed-LLM/examples/mcore/qwen3/ckpt_convert_qwen3_hf2mcore.sh
export CUDA_DEVICE_MAX_CONNECTIONS=1
python convert_ckpt.py \
--use-mcore-models \
--model-type GPT \
--load-model-type hf \
--save-model-type mg \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 2 \
--spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
--load-dir ./原始模型hf权重保存路径/ \
--save-dir ./转换后模型mcore权重保存路径/ \
--tokenizer-model ./model_from_hf/qwen3_hf/tokenizer.json \
--params-dtype bf16 \
--model-type-hf qwen3
可自行准备Instruct数据集或使用开源数据。这里以Alpaca数据集为例。
cd dataset/
wget https://huggingface.co/datasets/tatsu-lab/alpaca/resolve/main/data/train-00000-of-00001-a09b74b3ef9c3b56.parquet
cd ..数据预处理脚本可参考MindSpeed-LLM/examples/mcore/qwen3_next/ data_convert_qwen3_next_instruction.sh
python ./preprocess_data.py \
--input ./dataset/train-00000-of-00001-a09b74b3ef9c3b56.parquet \ #原始数据保存位置
--tokenizer-name-or-path ./model_from_hf/qwen3_8b/ \ #模型tokenizer保存位置
--output-prefix ./finetune_dataset/alpaca \ #处理后数据保存位置
--handler-name AlpacaStyleInstructionHandler \
--tokenizer-type PretrainedFromHF \
--workers 4 \
--log-interval 1000 \
--prompt-type qwen3注意 1)workers为处理时使用的cpu核数,可增加数量加速处理。处理后数据保存位置后需规定处理后数据的头名称,样例中为alpaca,处理后的文件形式如下所示:
finetune_dataset
├── alpaca_packed_attention_mask_document.bin
├── alpaca_packed_attention_mask_document.idx
├── alpaca_packed_input_ids_document.bin
├── alpaca_packed_input_ids_document.idx
├── alpaca_packed_labels_document.bin
├── alpaca_packed_labels_document.idx2)--enable-thinking:可在上述数据处理脚本中添加快慢思考模板开关,可设定为[true,false,none],默认值是none。开启后,会在数据集的模型回复中添加
训练启动脚本可参考MindSpeed-LLM/examples/mcore/qwen3/ tune_qwen3_8b_4K_full_ptd.sh
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
NPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($NPUS_PER_NODE*$NNODES))
# please fill these path configurations
CKPT_LOAD_DIR="your model ckpt path"
CKPT_SAVE_DIR="your model save ckpt path"
DATA_PATH="your data path"
TOKENIZER_PATH="your tokenizer path"
TP=1
PP=2
MBS=1
GBS=16
DISTRIBUTED_ARGS="
--nproc_per_node $NPUS_PER_NODE \
--nnodes $NNODES \
--node_rank $NODE_RANK \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT
"
GPT_ARGS="
--use-mcore-models \
--spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
--kv-channels 128 \
--qk-layernorm \
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--sequence-parallel \
--use-distributed-optimizer \
--use-flash-attn \
--num-layers 36 \
--hidden-size 4096 \
--use-rotary-position-embeddings \
--num-attention-heads 32 \
--ffn-hidden-size 12288 \
--max-position-embeddings 32768 \
--seq-length 4096 \
--make-vocab-size-divisible-by 1 \
--padded-vocab-size 151936 \
--rotary-base 1000000 \
--micro-batch-size ${MBS} \
--global-batch-size ${GBS} \
--disable-bias-linear \
--train-iters 2000 \
--swiglu \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path ${TOKENIZER_PATH} \
--normalization RMSNorm \
--position-embedding-type rope \
--norm-epsilon 1e-6 \
--hidden-dropout 0 \
--attention-dropout 0 \
--no-gradient-accumulation-fusion \
--attention-softmax-in-fp32 \
--exit-on-missing-checkpoint \
--no-masked-softmax-fusion \
--group-query-attention \
--untie-embeddings-and-output-weights \
--num-query-groups 8 \
--min-lr 1.25e-7 \
--lr 1.25e-6 \
--weight-decay 1e-1 \
--clip-grad 1.0 \
--adam-beta1 0.9 \
--adam-beta2 0.95 \
--initial-loss-scale 4096 \
--no-load-optim \
--no-load-rng \
--seed 42 \
--bf16
"
DATA_ARGS="
--data-path $DATA_PATH \
--split 100,0,0
"
OUTPUT_ARGS="
--log-interval 1 \
--save-interval 1000 \
--eval-interval 1000 \
--eval-iters 0 \
"
TUNE_ARGS="
--finetune \
--stage sft \
--is-instruction-dataset \
--prompt-type qwen3 \
--variable-seq-lengths
"
torchrun $DISTRIBUTED_ARGS posttrain_gpt.py \
$GPT_ARGS \
$DATA_ARGS \
$OUTPUT_ARGS \
$TUNE_ARGS \
--distributed-backend nccl \
--load ${CKPT_LOAD_DIR} \
--save ${CKPT_SAVE_DIR} \
| tee logs/tune_qwen3_8b_full.log注意:
--num-layers-per-virtual-pipeline-stage N # N表示每个虚拟流水线阶段的层数
--overlap-grad-reduce \ #在反向传播(backward) 过程中,提前启动梯度的 All-Reduce 通信
--overlap-param-gather \ #在前向传播(forward)开始前,提前异步拉取(gather)分布式优化器中分片的参数,并与前向计算重叠。
训练结束后,需将ckpt转换回hf格式进行后续推理或评测。转换脚本可参考examples/mcore/qwen3/ckpt_convert_qwen3_mcore2hf.sh
export CUDA_DEVICE_MAX_CONNECTIONS=1
python convert_ckpt.py \
--use-mcore-models \
--model-type GPT \
--load-model-type mg \
--save-model-type hf \
--target-tensor-parallel-size 1 \
--target-pipeline-parallel-size 1 \
--spec mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec \
--load-dir ./model_weights/qwen3_mcore/ \ #ckpt权重保存位置
--save-dir ./model_from_hf/qwen3_hf/ \ #转换后hf权重保存位置
--params-dtype bf16 \
--model-type-hf qwen3注意:
1) 模型初始权重读取路径写错时MindSpeed-LLM框架不会自动报错,会自动随机生成模型权重。需注意打印日志中是否有successfully loading checkpoint,如出现will not load any checkpoints and will start from random则需检查路径是否填写错误。 2) 如果在数据读取时报错找不到input_ids,大概率为数据读取路径写错。