Qwen3 是 Qwen 系列最新一代大型语言模型,提供了一系列密集型和专家混合(MoE)模型。基于广泛的训练,Qwen3 在推理、指令执行、代理能力和多语言支持方面实现了突破性进展。
权重下载链接:https://www.modelscope.cn/models/Qwen/Qwen3-30B-A3B
| 组件 | 版本 |
|---|---|
| 硬件环境 | 910B(6卡) |
| 组件 | 版本 |
|---|---|
| sglang | main分支 |
| HDK | Ascend HDK 25.2.1 |
| CANN | 8.3.RC1 |
| 模型 | Qwen3-30B-A3B-Instruct |
ARG CANN_VERSION=8.3.rc1
ARG DEVICE_TYPE=a3
ARG OS=ubuntu22.04
ARG PYTHON_VERSION=py3.11
FROM quay.io/ascend/cann:$CANN_VERSION-$DEVICE_TYPE-$OS-$PYTHON_VERSION
# Update pip & apt sources
ARG PIP_INDEX_URL="https://pypi.org/simple/"
ARG APTMIRROR=""
ARG PYTORCH_VERSION="2.8.0"
ARG TORCHVISION_VERSION="0.23.0"
ARG PTA_URL="https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/sglang/torch_npu/torch_npu-2.8.0.post2.dev20251113-cp311-cp311-manylinux_2_28_aarch64.whl"
ARG TRITON_ASCEND_URL="https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/sglang/triton_ascend/triton_ascend-3.2.0.dev2025112116-cp311-cp311-manylinux_2_27_aarch64.manylinux_2_28_aarch64.whl"
ARG BISHENG_NAME="Ascend-BiSheng-toolkit_aarch64_20251121.run"
ARG BISHENG_URL="https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/sglang/triton_ascend/${BISHENG_NAME}"
ARG SGLANG_TAG=main
ARG ASCEND_CANN_PATH=/usr/local/Ascend/ascend-toolkit
ARG SGLANG_KERNEL_NPU_TAG=main
ARG PIP_INSTALL="python3 -m pip install --no-cache-dir"
ARG DEVICE_TYPE
WORKDIR /workspace
# Define environments
ENV DEBIAN_FRONTEND=noninteractive
RUN pip config set global.index-url $PIP_INDEX_URL
RUN if [ -n "$APTMIRROR" ];then sed -i "s|.*.ubuntu.com|$APTMIRROR|g" /etc/apt/sources.list ;fi
# Install development tools and utilities
RUN apt-get update -y && apt upgrade -y && apt-get install -y \
build-essential \
cmake \
vim \
wget \
curl \
net-tools \
zlib1g-dev \
lld \
clang \
locales \
ccache \
openssl \
libssl-dev \
pkg-config \
ca-certificates \
&& rm -rf /var/cache/apt/* \
&& rm -rf /var/lib/apt/lists/* \
&& update-ca-certificates \
&& locale-gen en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US:en
ENV LC_ALL=en_US.UTF-8
### Install MemFabric
RUN ${PIP_INSTALL} mf-adapter==1.0.0
### Install SGLang Model Gateway
RUN ${PIP_INSTALL} sglang-router
### Install PyTorch and PTA
RUN (${PIP_INSTALL} torch==${PYTORCH_VERSION} torchvision==${TORCHVISION_VERSION} --index-url https://download.pytorch.org/whl/cpu) \
&& (${PIP_INSTALL} ${PTA_URL})
# TODO: install from pypi released triton-ascend
RUN (${PIP_INSTALL} pybind11) \
&& (${PIP_INSTALL} ${TRITON_ASCEND_URL})
# Install SGLang
RUN git clone https://github.com/sgl-project/sglang --branch $SGLANG_TAG && \
(cd sglang/python && rm -rf pyproject.toml && mv pyproject_other.toml pyproject.toml && ${PIP_INSTALL} -v .[srt_npu]) && \
rm -rf sglang
# Install Deep-ep
# pin wheel to 0.45.1 ref: https://github.com/pypa/wheel/issues/662
RUN ${PIP_INSTALL} wheel==0.45.1 && git clone --branch $SGLANG_KERNEL_NPU_TAG https://github.com/sgl-project/sgl-kernel-npu.git \
&& export LD_LIBRARY_PATH=${ASCEND_CANN_PATH}/latest/runtime/lib64/stub:$LD_LIBRARY_PATH && \
source ${ASCEND_CANN_PATH}/set_env.sh && \
cd sgl-kernel-npu && \
bash build.sh \
&& ${PIP_INSTALL} output/deep_ep*.whl output/sgl_kernel_npu*.whl \
&& cd .. && rm -rf sgl-kernel-npu \
&& cd "$(python3 -m pip show deep-ep | awk '/^Location:/ {print $2}')" && ln -s deep_ep/deep_ep_cpp*.so
# Install CustomOps
RUN wget https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/ops/CANN-custom_ops-8.2.0.0-$DEVICE_TYPE-linux.aarch64.run && \
chmod a+x ./CANN-custom_ops-8.2.0.0-$DEVICE_TYPE-linux.aarch64.run && \
./CANN-custom_ops-8.2.0.0-$DEVICE_TYPE-linux.aarch64.run --quiet --install-path=/usr/local/Ascend/ascend-toolkit/latest/opp && \
wget https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/ops/custom_ops-1.0.$DEVICE_TYPE-cp311-cp311-linux_aarch64.whl && \
${PIP_INSTALL} ./custom_ops-1.0.$DEVICE_TYPE-cp311-cp311-linux_aarch64.whl
# Install Bisheng
RUN wget -O "${BISHENG_NAME}" "${BISHENG_URL}" && chmod a+x "${BISHENG_NAME}" && "./${BISHENG_NAME}" --install && rm "${BISHENG_NAME}"
CMD ["/bin/bash"]
注意DockerFile内需要更新两个软件包:
https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com:443/ops/CANN-custom_ops-8.3.0.1-910b-linux.aarch64.run?AccessKeyId=HPUAXT4YM0U8JNTERLST&Expires=1795868352&Signature=fpkhjfHGDNvJviVY3ezAJeavx%2BU%3D
https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com:443/ops/custom_ops-2.0.910b-cp311-cp311-linux_aarch64.whl?AccessKeyId=HPUAXT4YM0U8JNTERLST&Expires=1795868372&Signature=tYX1oA3J0NigpLYkLOkRSTBC9lY%3Ddocker run -itd --privileged --name=sglang-test --net=host \
--shm-size 500g \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device /dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /home:/home \
-v /disk1:/disk1 \
-v /disk2:/disk2 \
-v /disk3:/disk3 \
-v /opt:/opt \
-v /home:/home \
--entrypoint /bin/bash sg-langxxxpkill -9 python
pkill -9 sglang
export SGLANG_SET_CPU_AFFINITY=1
export ENABLE_PROFILING=0
export HCCL_OP_EXPANSION_MODE="AIV"
unset ASCEND_LAUNCH_BLOCKING
export ASCEND_USE_FIA=0
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=enp67s0f5
export GLOO_SOCKET_IFNAME=enp67s0f5
source /usr/local/Ascend/8.5.0/bisheng_toolkit/set_env.sh
python3 -m sglang_router.launch_server \
--model-path /disk1/liuchenbing/qwen3-30b-a3b \
--trust-remote-code \
--attention-backend ascend \
--mem-fraction-static 0.92 \
--disable-radix-cache \
--chunked-prefill-size 32768 \
--cuda-graph-bs 1 4 8 16 20 24 28 32 36 40 64 72 80 96 120 160 180 200 256 \
--max-running-requests 200 \
--warmup 10 \
--tp-size 2 \
--dp-size 3 \
--host 0.0.0.0 \
--port 8080 \
--log-level debug \
--router-health-success-threshold 2 \
--router-health-check-timeout-secs 6000 \
--router-health-check-interval-secs 60 \
--router-model-path /disk1/liuchenbing/qwen3-30b-a3b \
--router-policy round_robin
curl -X POST http:/xxx/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "xxx",
"messages": [
{
"role": "user",
"content": "你是谁?"
}
],
"max_tokens": 100,
"ignore_eos": false,
"stream": false
}'解决方案: 更新两个软件包:
https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com:443/ops/CANN-custom_ops-8.3.0.1-910b-linux.aarch64.run?AccessKeyId=HPUAXT4YM0U8JNTERLST&Expires=1795868352&Signature=fpkhjfHGDNvJviVY3ezAJeavx%2BU%3D
https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com:443/ops/custom_ops-2.0.910b-cp311-cp311-linux_aarch64.whl?AccessKeyId=HPUAXT4YM0U8JNTERLST&Expires=1795868372&Signature=tYX1oA3J0NigpLYkLOkRSTBC9lY%3DQwen3 是 Qwen 系列中的最新一代大型语言模型,提供了一整套密集型和专家混合(MoE)模型。基于广泛的训练,Qwen3 在推理、指令执行、代理能力和多语言支持方面取得了突破性进展。
| 组件 | 版本 |
|---|---|
| 硬件环境 | 910B(1卡) |
| 组件 | 版本 |
|---|---|
| vllm-ascend | 0.11.0.rc2 |
| CANN | 8.3.RC1 |
| 模型 | Qwen3-30B-A3B |
docker run -itd --privileged --name=v0.11.0rc3-a3-test --net=host \
--shm-size 500g \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device /dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /home:/home \
-v /disk1:/disk1 \
-v /disk2:/disk2 \
-v /disk3:/disk3 \
-v /opt:/opt \
-v /home:/home \
--entrypoint /bin/bash quay.io/ascend/vllm-ascend:v0.11.0rc3-a3由于0.11.0版本默认走图模式,若要在profiling内打印出每个算子执行shape,需要在启动命令里加入--enforce-eager,可以使能单算子模式,但性能较差,便于性能分析。
pkill -9 python
pkill -9 VLLM
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export VLLM_USE_V1=1
export VLLM_VERSION=0.11.0
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m vllm.entrypoints.openai.api_server \
--model=XXX/Qwen3-30B-A3B \
--served-model-name Qwen3-30B-A3B \
--trust-remote-code \
--max-model-len 32768 \
--max-num-batched-tokens 122880 \
-tp 8 \
--port 9998 \
--block-size 128 \
--no-enable-prefix-caching \
--enforce-eager \
--gpu-memory-utilization 0.95 由于单算子模式需要频繁地下发算子,会造成Host瓶颈,为缓解这一问题,可采用ACL Graph图模式,实现一次捕获、多次重放,进而减少CPU与框架的调度开销,提高吞吐性能。
pkill -9 python
pkill -9 VLLM
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export VLLM_USE_V1=1
export VLLM_VERSION=0.11.0
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m vllm.entrypoints.openai.api_server \
--model=XXX/Qwen3-30B-A3B \
--served-model-name Qwen3-30B-A3B \
--trust-remote-code \
--max-model-len 32768 \
--max-num-batched-tokens 122880 \
-tp 8 \
--port 9998 \
--block-size 128 \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.95 从日志中可以看出默认图捕获大小为:[512, 448, 384, 312, 248, 184, 112, 48, 1],默认图模式为PIECEWISE【分段图模式】,针对以上两点可以做相应优化:
通过调整--compilation-config内的cudagraph_capture_sizes参数,来修改图的捕获范围,确保能够覆盖业务中关键的形状或模型结构。通过合理调整这一参数,从而提升性能需要将吞吐打上去,以本次调优案例为例,需要将并发16加入到cudagraph_capture_sizes数组内,经实测分析,较默认图模式捕获而言,吞吐提升8%。 pkill -9 python pkill -9 VLLM source /usr/local/Ascend/ascend-toolkit/set_env.sh source /usr/local/Ascend/nnal/atb/set_env.sh export VLLM_USE_V1=1 export VLLM_VERSION=0.11.0 export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m vllm.entrypoints.openai.api_server
--model=XXX/Qwen3-30B-A3B
--served-model-name Qwen3-30B-A3B
--trust-remote-code
--max-model-len 32768
--max-num-batched-tokens 122880
-tp 8
--port 9998
--block-size 128
--no-enable-prefix-caching
--compilation-config '{"cudagraph_capture_sizes":[1,2,4,8,16,24,48]}'
--gpu-memory-utilization 0.95
默认图模式为PIECEWISE图模式,由于Full Graph模式在v0.11.0版本支持FULL_DECODE_ONLY模式,通过设置-compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4,8,16,24,48]}',可以使用FULL_DECODE_ONLY图模式,进一步提升推理性能,实测较PIECEWISE图模式,吞吐提升25%。
从日志中可以看出,已使能FULL_DECODE_ONLY图模式。 3.6 异步调度 异步调度特性【--async-scheduling】可以减少推理过程中token与token之间的空泡等待,提升整体推理性能,异步推理使能方法:启动推理服务时加上–async-scheduling启动选项。 pkill -9 python pkill -9 VLLM source /usr/local/Ascend/ascend-toolkit/set_env.sh source /usr/local/Ascend/nnal/atb/set_env.sh export VLLM_USE_V1=1 export VLLM_VERSION=0.11.0 export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m vllm.entrypoints.openai.api_server
--model=XXX/Qwen3-30B-A3B
--served-model-name Qwen3-30B-A3B
--trust-remote-code
--max-model-len 32768
--max-num-batched-tokens 122880
-tp 8
--port 9998
--block-size 128
--no-enable-prefix-caching
--async-scheduling
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,2,4,8,16,24,48]}'
--gpu-memory-utilization 0.95
性能提升效果:使能异步调度特性后,TOPT较未使能前减少0.7ms。
通过此环境变量可配置task_queue算子下发队列是否开启和优化等级。 • 配置为“0”时:关闭task_queue算子下发队列优化,算子下发任务:
• 配置为“1”或未配置时:开启task_queue算子下发队列Level 1优化。 Level 1优化:使能task_queue算子下发队列优化,将算子下发任务分为两段,一部分任务(主要是aclnn算子的调用)放在新增的二级流水上,一、二级流水通过算子队列传递任务,相互并行,通过部分掩盖减少整体的下发耗时,提升端到端性能。
• 配置为“2”时:开启task_queue算子下发队列Level 2优化。 Level 2优化:包含Level 1的优化并进一步平衡了一、二级流水的任务负载,主要是将workspace相关任务迁移至二级流水,掩盖效果更好,性能收益更大。该配置仅在二进制场景生效,建议配置值为Level 2优化。
详细介绍:https://www.hiascend.com/document/detail/zh/Pytorch/600/apiref/Envvariables/Envir_019.html 使能方法: export TASK_QUEUE_ENABLE=1
jemalloc是一款内存分配器,与传统内存分配器(例如,glibc)相比,其最大优势在于减少内存碎片和提升多线程高并发场景下内存的分配效率,进而充分发挥多核多并发的优势 在内存分配过程中,锁会造成线程等待,对性能影响很大。jemalloc采用如下措施避免线程竞争锁的发生:使用线程变量,每个线程有对应的内存管理器,内存分配在该线程内完成,无需和其它线程竞争锁。 详细参考: https://www.hiascend.com/document/detail/zh/canncommercial/83RC1/softwareinst/instg/instg_0099.html?Mode=PmIns&InstallType=local&OS=openEuler&Software=cannToolKit 使能方法: export LD_PRELOAD=/usr/local/Ascend/ascend-toolkit/latest/lib64/libjemalloc.so
开启HCCL AIV模式,代表通信算法的编排展开位置在Device侧的Vector Core,执行也在Vector Core。 详细可参考: https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/850alpha001/maintenref/envvar/envref_07_0096.html 性能提升效果:使能AIV模式后,TOPT较为使能前减少8ms,吞吐提升36%。 使能方法: export HCCL_OP_EXPANSION_MODE="AIV"

注意:性能数据以实测为准,仅供参考。