GPT-OSS-120B是OPENAI发布的拥有 1200 亿参数的开源语言模型。 本项目提供该模型在昇腾Atlas A2推理服务器基于 vllm-ascend的推理使用指导
详细模型介绍可以参考模型开源社区内容 https://modelscope.cn/models/unsloth/gpt-oss-120b-BF16
| 型号 | 卡数 | 模型 |
|---|---|---|
| 910B3 | 8 | gpt-oss-120b |
| 软件名 | 版本 |
|---|---|
| CANN | 8.3.RC1 |
| Python | 3.11.13 |
| transformers | 4.57.1 |
| torch | 2.7.1 |
| torch-npu | 2.7.1 |
| vllm | v0.11.2 |
检查磁盘空间:首先确保您的系统有足够的存储空间。该模型有 120B 参数,需要约 200GB+ 的可用空间
df -h /home创建模型下载目录:
mkdir /home/models/
mkdir /home/models/GPT-OSS由于昇腾还不支持FP4,需要下载反量化版本: 首选Modelscope社区下载,比较快:
pip install modelscope
modelscope download --model=unsloth/gpt-oss-120b-BF16 --local_dir=/home/models/GPT-OSS如果模型在Modelscope没有的话,huggface平台的下载可以参考如下方法- 设置国内镜像加速(推荐),并下载模型
export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download unsloth/gpt-oss-120b-BF16 --local-dir /home/models/GPT-OSS下载过程中如果出现了网络超时错误(ReadTimeoutError),导致下载中断。可以通过重复执行上述命令继续下载 或者使用 huggingface_hub Python 库来下载模型
# 安装 huggingface_hub
pip install huggingface_hub
#使用 Python 下载模型
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='unsloth/gpt-oss-120b-BF16', local_dir='/home/model/unsloth/gpt-oss-120b-BF16')
"拉取CANN镜像
docker pull m.daocloud.io/quay.io/ascend/cann:8.3.rc1-910b-ubuntu22.04-py3.11启动开发容器
export IMAGE=m.daocloud.io/quay.io/ascend/cann:8.3.rc1-910b-ubuntu22.04-py3.11
docker run -itd -u 0 --ipc=host --privileged \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-e ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
--name vllm-gpt-oss-120b-v1 \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /home/models:/model \
-p 3050:8000 \
-it $IMAGE bash进入开发容器:
docker exec -it vllm-gpt-oss-120b-v1 bash卸载vllm
pip uninstall vllm -y从github安装vllm
git clone https://github.com/vllm-project/vllm.git切换到vllm路径下并安装指定的vllm版本:
cd vllm
git checkout 275de34170654274616082721348b7edd9741d32
pip install -v -e .说明:vllm版本为笔者适配时候最新的开发版本,早于v0.11.2版本发布,因此不能直接使用v0.11.2版本。
从gitcode安装vllm
有些环境无法连接github的,可以从gitcode镜像地址下载:
git clone https://gitcode.com/Joiin0392/vllm.git但由于vllm安装过程中,还需要从github下载其它依赖软件,因此需要同步修改相关文件中的依赖软件地址,把https://github.com地址修改成为https://githubfast.com,相关文件包括:
vllm/CMakeLists.txt
vllm/cmake/cpu_extension.cmake
vllm/requirements/test.in
vllm/requirements/test.txt
vllm/setup.py切换到vllm路径下并安装指定的vllm版本:
cd vllm
git checkout 275de34170654274616082721348b7edd9741d32
pip install -v -e .由于适配GPT-OSS,对vllm-ascend做了定制修改,需要从定制仓获取对应的代码,详细修改点请参考:[vllm-ascend适配修改](#vllm-ascend 修改适配)
卸载vllm-ascend 退出vllm目录,回到上层目录并卸载vllm-ascend
cd ..
pip uninstall vllm-ascend -y安装vllm-ascend
从gitcode下载vllm-ascend的版本
git clone https://gitcode.com/Joiin0392/vllm-ascend.git进入到vllm-ascend目标,切换到指定commit并进行安装:
cd vllm-ascend
git checkout 37d8fecc443786b7fd5c73be553013136a6bfeba
pip install -v -e .由于vllm默认会依赖torch的2.8.0版本,该场景下暂时不配套,需要回退到2.7.1版本
cd ..
pip uninstall torch -y
pip install torch==2.7.1同时需要在宿主机下把分词器拷贝到模型目录下,以便可以从容器外映射到容器内的模型目录中
pip install openai-harmony
wget https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
mv o200k_base.tiktoken fb374d419588a4632f3f557e76b4b70aebbca790
TIKTOKEN_RS_CACHE_DIR=$(pwd) python -c 'from openai_harmony import load_harmony_encoding; load_harmony_encoding("HarmonyGptOss")'pip install evalscope
pip install evalscope[perf] -Upip show vllm vllm-ascend torch torch-npu transformers结果显示:
[root:vllm]$ pip show vllm vllm-ascend torch torch-npu transformers
Name: vllm
Version: 0.11.3.dev0+g275de3417.d20251210
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
Author-email:
License-Expression: Apache-2.0
Location: /usr/local/Ascend/vllm
Requires: aiohttp, anthropic, blake3, cachetools, cbor2, cloudpickle, compressed-tensors, datasets, depyf, diskcache, einops, fastapi, filelock, gguf, intel-openmp, intel_extension_for_pytorch, lark, llguidance, lm-format-enforcer, mistral_common, model-hosting-container-standards, msgspec, ninja, numba, numpy, openai, openai-harmony, opencv-python-headless, outlines_core, packaging, partial-json-parser, pillow, prometheus-fastapi-instrumentator, prometheus_client, protobuf, psutil, py-cpuinfo, pybase64, pydantic, python-json-logger, pyyaml, pyzmq, regex, requests, scipy, sentencepiece, setproctitle, setuptools, tiktoken, tokenizers, torch, torchaudio, torchvision, tqdm, transformers, triton, typing_extensions, watchfiles, xgrammar
Required-by:
---
Name: vllm_ascend
Version: 0.11.0rc1.dev449+gc9b64052e.d20251209
Summary: vLLM Ascend backend plugin
Home-page: https://github.com/vllm-project/vllm-ascend
Author: vLLM-Ascend team
Author-email:
License: Apache 2.0
Location: /usr/local/python3.11.13/lib/python3.11/site-packages
Editable project location: /usr/local/Ascend/vllm-ascend
Requires: cmake, compressed_tensors, decorator, einops, msgpack, numba, numpy, opencv-python-headless, packaging, pandas, pandas-stubs, pip, pybind11, pyyaml, quart, scipy, setuptools, setuptools-scm, torch, torch-npu, torchvision, transformers, wheel
Required-by:
---
Name: torch
Version: 2.7.1+cpu
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: /usr/local/python3.11.13/lib/python3.11/site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: accelerate, compressed-tensors, evalscope, flashinfer-python, torch_npu, torchaudio, torchvision, vllm, vllm_ascend, xformers, xgrammar
---
Name: torch_npu
Version: 2.7.1
Summary: NPU bridge for PyTorch
Home-page: https://gitcode.com/ascend/pytorch
Author:
Author-email:
License: BSD License
Location: /usr/local/python3.11.13/lib/python3.11/site-packages
Requires: torch
Required-by: vllm_ascend
---
Name: transformers
Version: 4.57.1
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/python3.11.13/lib/python3.11/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: compressed-tensors, evalscope, vllm, vllm_ascend, xgrammarexport HCCL_INTRA_ROCE_ENABLE=1
export IGNORE_GRAPH_UPDATE_EXCEPTIONS=0
TIKTOKEN_RS_CACHE_DIR=/model/tiktoken_cache vllm serve /model/GPT-OSS \
--served-model-name gpt-oss-120b-bf16 \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 8 \
--max-model-len 16384 \
--compilation_config '{"cudagraph_mode": "PIECEWISE", "cudagraph_capture_sizes": [1024,512,256,128,64,16,4,2,1]}'说明:此处要做相应的修改,详细如下:
| 序号 | 参数 | 说明 | 必选 | 是否修改 |
|---|---|---|---|---|
| 1 | HCCL_INTRA_ROCE_ENABLE | 多卡通信环境变量 | 必选 | 否 |
| 2 | IGNORE_GRAPH_UPDATE_EXCEPTIONS | 图模式场景下,用于防止异常的处理 | 图模式必选 | 否 |
| 3 | TIKTOKEN_RS_CACHE_DIR | 大模型词表的路径 | 必选 | 是 |
| 4 | vllm serve | 起动命令,后面跟随大模型的地址 | 必选 | 是 |
| 5 | --served-model-name | 推理模型名称 | 必选 | 按需调整 |
| 6 | --gpu-memory-utilization | 内存参数设置 | 可选 | 按需调整 |
| 7 | --tensor-parallel-size | TP并行策略 | 必选 | 否 |
| 8 | --max-model-len | 最大输入长度 | 可选 | 按需调整 |
| 9 | --compilation_config | 图模式配置 | 图模式必选 | 按需调整,关键在于cudagraph_capture_sizes,需要包含大部分的输入的长度 |
使用如下命令验证服务是否已正常启动
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gpt-oss-120b-bf16", "messages":[{"role":"user","content":"who are you"}]}'结果:
{"id":"chatcmpl-a1caa47ab6734ef3b5fa9b28ece54945","object":"chat.completion","created":1765360849,"model":"gpt-oss-120b-bf16","choices":[{"index":0,"message":{"role":"assistant","content":"I’m ChatGPT, an AI language model created by OpenAI. I can help answer questions, brainstorm ideas, explain concepts, and chat about a wide range of topics. Feel free to ask me anything!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":"The user asks \"who are you\". We should respond with a brief introduction. The system says we are ChatGPT, large language model trained by OpenAI, knowledge cutoff 2024-06. Keep it short and friendly.","reasoning_content":"The user asks \"who are you\". We should respond with a brief introduction. The system says we are ChatGPT, large language model trained by OpenAI, knowledge cutoff 2024-06. Keep it short and friendly."},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":72,"total_tokens":170,"completion_tokens":98,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}精度验证脚本:新建python脚本如下(文件名:evalscop.py)
from evalscope import TaskConfig, run_task
API_URL = "http://localhost:8000/v1"
MODEL_NAME = "gpt-oss-120b-bf16"
task_cfg = TaskConfig(
model= f"{MODEL_NAME}",
api_url= f"{API_URL}",
eval_type='server',
datasets=['mmlu'],
eval_batch_size=256,
timeout=1200,
)
run_task(task_cfg=task_cfg)到脚本目录下执行python脚本:
python evalscope.pygpt-oss论文呈现的精度测试结果,如下图:

本服务运行后的精度测试结果,如下图:

整体效果:在MMLU所有子测试集的平均mean_acc为87.55%,相比论文公布的88%,在合理的误差范围内
性能验证脚本如下:
API_URL="http://localhost:8000/v1"
MODEL_NAME="gpt-oss-120b-bf16"
evalscope perf
--parallel 417 833
--number 1000 2000
--model "${MODEL_NAME}"
--url "${API_URL}"/completions
--api openai
--dataset random
--max-tokens 1024
--min-tokens 1024
--prefix-length 0
--min-prompt-length 1024
--max-prompt-length 1024
--log-every-n-query 20
--tokenizer-path /inspire/sj-ssd/project/pretrain-test/public/workspace/models/gpt-oss-120b-BF16
--extra-args '{"ignore_eos": true}'适配优化后基于融合算子以及图模式的性能测试结果:

客户要求在昇腾上基于vllm部署GPT-OSS模型,但是vllm-ascend主线版本存在断点,未对gpt-oss模型进行适配。gpt-oss模型相比主流开源模型存在以下差异:
平均推理时延降低94.99%,TPS提升67倍,TTFT降低36.58%,TPOT降低99.04%。在MMLU所有子测试集的平均mean_acc为87.55%,相比论文公布的88%,精度误差在可接受的范围内
1、单线程模式下融合算子替换、MOE断点补齐、图模式断点补齐关键修改点:链接
2、并发模式下支持图模式场景:链接
3、120B模型按TP=8切分时,张量形状补齐,符合fractal_nz格式:链接