运行docker容器
# Update the vllm-ascend image
export IMAGE=quay.io/ascend/vllm-ascend:v0.12.0rc1
docker run --rm \
--name vllm-ascend \
--shm-size=1g \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash设置环境变量:
# Load model from ModelScope to speed up download
export VLLM_USE_MODELSCOPE=True
# Set `max_split_size_mb` to reduce memory fragmentation and avoid out of memory
export PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256注:max_split_size_mb 阻止本地分配器分割大于此大小(以 MB 为单位)的内存块。这可以减少内存碎片,并可能使一些临界工作负载在内存耗尽之前完成。您可以在 此处找到更多详细信息。
运行以下脚本,在 NPU 上执行离线推理:
import gc
import torch
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
destroy_model_parallel)
def clean_up():
destroy_model_parallel()
destroy_distributed_environment()
gc.collect()
torch.npu.empty_cache()
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=40)
llm = LLM(model="/model/Qwen3-0.6B",
tensor_parallel_size=4,
trust_remote_code=True,
distributed_executor_backend="mp",
max_model_len=5500,
max_num_batched_tokens=5500,
compilation_config={"cudagraph_mode": "FULL_DECODE_ONLY"})
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
del llm
clean_up()在NPU 上运行 Docker 容器以启动 vLLM 服务器:
# set the NPU device number
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
# Set the operator dispatch pipeline level to 1 and disable manual memory control in ACLGraph
export TASK_QUEUE_ENABLE=1
# [Optional] jemalloc
# jemalloc is for better performance, if `libjemalloc.so` is install on your machine, you can turn it on.
# if os is Ubuntu
# export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
# if os is openEuler
# export LD_PRELOAD=/usr/lib64/libjemalloc.so.2:$LD_PRELOAD
# Enable the AIVector core to directly schedule ROCE communication
export HCCL_OP_EXPANSION_MODE="AIV"
# Enable dense model and general optimizations for better performance.
export VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1
# Enable FlashComm_v1 optimization when tensor parallel is enabled.
export VLLM_ASCEND_ENABLE_FLASHCOMM1=1
vllm serve /model/Qwen3-0.6B \
--served-model-name qwen3 \
--trust-remote-code \
--async-scheduling \
--distributed-executor-backend mp \
--tensor-parallel-size 4 \
--max-model-len 5500 \
--max-num-batched-tokens 40960 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--port 8113 \
--block-size 128 \
--gpu-memory-utilization 0.9注:如果想要获得极致性能,可以启用 cudagraph_capture_sizes 参数,以下是 batchsize 为 72 的示例:--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY", "cudagraph_capture_sizes":[1,8,24,48,60,64,72,76]}'
如果您的服务启动成功,您可以看到以下信息:
INFO: Started server process [2736]
INFO: Waiting for application startup.
INFO: Application startup complete.服务器启动后,您可以使用输入提示查询模型:
curl http://localhost:8113/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "qwen3",
"messages": [
{"role": "user", "content": "Give me a short introduction to large language models."}
],
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"max_tokens": 4096
}'