Ascend-SACT/GPT-OSS-120B-BF16
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

模型概述

GPT-OSS-120B是OPENAI发布的拥有 1200 亿参数的开源语言模型。 本项目提供该模型在昇腾Atlas A2推理服务器基于 vllm-ascend的推理使用指导

详细模型介绍可以参考模型开源社区内容 https://modelscope.cn/models/unsloth/gpt-oss-120b-BF16

运行环境

硬件环境

型号卡数模型
910B38gpt-oss-120b

软件环境

软件名版本
CANN8.3.RC1
Python3.11.13
transformers4.57.1
torch2.7.1
torch-npu2.7.1
vllmv0.11.2

推理部署

模型下载(从Huggingface/Modelscope平台下载)

检查磁盘空间:首先确保您的系统有足够的存储空间。该模型有 120B 参数,需要约 200GB+ 的可用空间

df -h /home

创建模型下载目录:

mkdir /home/models/
mkdir /home/models/GPT-OSS

由于昇腾还不支持FP4,需要下载反量化版本: 首选Modelscope社区下载,比较快:

pip install modelscope
modelscope download --model=unsloth/gpt-oss-120b-BF16 --local_dir=/home/models/GPT-OSS

如果模型在Modelscope没有的话,huggface平台的下载可以参考如下方法- 设置国内镜像加速(推荐),并下载模型

export HF_ENDPOINT=https://hf-mirror.com
huggingface-cli download --resume-download unsloth/gpt-oss-120b-BF16 --local-dir /home/models/GPT-OSS

下载过程中如果出现了网络超时错误(ReadTimeoutError),导致下载中断。可以通过重复执行上述命令继续下载   或者使用 huggingface_hub Python 库来下载模型

# 安装 huggingface_hub
pip install huggingface_hub
#使用 Python 下载模型
python -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='unsloth/gpt-oss-120b-BF16', local_dir='/home/model/unsloth/gpt-oss-120b-BF16')
"

安装镜像

拉取CANN镜像

docker pull m.daocloud.io/quay.io/ascend/cann:8.3.rc1-910b-ubuntu22.04-py3.11

启动开发容器

export IMAGE=m.daocloud.io/quay.io/ascend/cann:8.3.rc1-910b-ubuntu22.04-py3.11
docker run -itd -u 0  --ipc=host  --privileged \
-e VLLM_USE_MODELSCOPE=True \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-e ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
--name vllm-gpt-oss-120b-v1 \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /home/models:/model \
-p 3050:8000 \
-it $IMAGE bash

进入开发容器:

docker exec -it  vllm-gpt-oss-120b-v1 bash

更新vllm

卸载vllm

pip uninstall vllm -y

从github安装vllm

git clone https://github.com/vllm-project/vllm.git

切换到vllm路径下并安装指定的vllm版本:

cd vllm
git checkout 275de34170654274616082721348b7edd9741d32
pip install -v -e .

说明:vllm版本为笔者适配时候最新的开发版本,早于v0.11.2版本发布,因此不能直接使用v0.11.2版本。

从gitcode安装vllm

有些环境无法连接github的,可以从gitcode镜像地址下载:

git clone https://gitcode.com/Joiin0392/vllm.git

但由于vllm安装过程中,还需要从github下载其它依赖软件,因此需要同步修改相关文件中的依赖软件地址,把https://github.com地址修改成为https://githubfast.com,相关文件包括:

vllm/CMakeLists.txt
vllm/cmake/cpu_extension.cmake
vllm/requirements/test.in
vllm/requirements/test.txt
vllm/setup.py

切换到vllm路径下并安装指定的vllm版本:

cd vllm
git checkout 275de34170654274616082721348b7edd9741d32
pip install -v -e .

更新vllm-ascend

由于适配GPT-OSS,对vllm-ascend做了定制修改,需要从定制仓获取对应的代码,详细修改点请参考:[vllm-ascend适配修改](#vllm-ascend 修改适配)

卸载vllm-ascend 退出vllm目录,回到上层目录并卸载vllm-ascend

cd .. 
pip uninstall vllm-ascend -y

安装vllm-ascend

从gitcode下载vllm-ascend的版本

git clone https://gitcode.com/Joiin0392/vllm-ascend.git

进入到vllm-ascend目标,切换到指定commit并进行安装:

cd vllm-ascend
git checkout 37d8fecc443786b7fd5c73be553013136a6bfeba
pip install -v -e .

更新torch

由于vllm默认会依赖torch的2.8.0版本,该场景下暂时不配套,需要回退到2.7.1版本

cd ..
pip uninstall torch -y
pip install torch==2.7.1

o200k-harmony分词器处理

同时需要在宿主机下把分词器拷贝到模型目录下,以便可以从容器外映射到容器内的模型目录中

pip install openai-harmony
wget https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
mv o200k_base.tiktoken fb374d419588a4632f3f557e76b4b70aebbca790
TIKTOKEN_RS_CACHE_DIR=$(pwd) python -c 'from openai_harmony import load_harmony_encoding; load_harmony_encoding("HarmonyGptOss")'

安装精度及性能验证工具

pip install evalscope
pip install evalscope[perf] -U

检查版本配套

pip show vllm vllm-ascend torch torch-npu transformers

结果显示:

[root:vllm]$ pip show vllm vllm-ascend torch torch-npu transformers
Name: vllm
Version: 0.11.3.dev0+g275de3417.d20251210
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
Author-email: 
License-Expression: Apache-2.0
Location: /usr/local/Ascend/vllm
Requires: aiohttp, anthropic, blake3, cachetools, cbor2, cloudpickle, compressed-tensors, datasets, depyf, diskcache, einops, fastapi, filelock, gguf, intel-openmp, intel_extension_for_pytorch, lark, llguidance, lm-format-enforcer, mistral_common, model-hosting-container-standards, msgspec, ninja, numba, numpy, openai, openai-harmony, opencv-python-headless, outlines_core, packaging, partial-json-parser, pillow, prometheus-fastapi-instrumentator, prometheus_client, protobuf, psutil, py-cpuinfo, pybase64, pydantic, python-json-logger, pyyaml, pyzmq, regex, requests, scipy, sentencepiece, setproctitle, setuptools, tiktoken, tokenizers, torch, torchaudio, torchvision, tqdm, transformers, triton, typing_extensions, watchfiles, xgrammar
Required-by: 
---
Name: vllm_ascend
Version: 0.11.0rc1.dev449+gc9b64052e.d20251209
Summary: vLLM Ascend backend plugin
Home-page: https://github.com/vllm-project/vllm-ascend
Author: vLLM-Ascend team
Author-email: 
License: Apache 2.0
Location: /usr/local/python3.11.13/lib/python3.11/site-packages
Editable project location: /usr/local/Ascend/vllm-ascend
Requires: cmake, compressed_tensors, decorator, einops, msgpack, numba, numpy, opencv-python-headless, packaging, pandas, pandas-stubs, pip, pybind11, pyyaml, quart, scipy, setuptools, setuptools-scm, torch, torch-npu, torchvision, transformers, wheel
Required-by: 
---
Name: torch
Version: 2.7.1+cpu
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: /usr/local/python3.11.13/lib/python3.11/site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: accelerate, compressed-tensors, evalscope, flashinfer-python, torch_npu, torchaudio, torchvision, vllm, vllm_ascend, xformers, xgrammar
---
Name: torch_npu
Version: 2.7.1
Summary: NPU bridge for PyTorch
Home-page: https://gitcode.com/ascend/pytorch
Author: 
Author-email: 
License: BSD License
Location: /usr/local/python3.11.13/lib/python3.11/site-packages
Requires: torch
Required-by: vllm_ascend
---
Name: transformers
Version: 4.57.1
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /usr/local/python3.11.13/lib/python3.11/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: compressed-tensors, evalscope, vllm, vllm_ascend, xgrammar

启动模型服务

export HCCL_INTRA_ROCE_ENABLE=1
export IGNORE_GRAPH_UPDATE_EXCEPTIONS=0
TIKTOKEN_RS_CACHE_DIR=/model/tiktoken_cache vllm serve /model/GPT-OSS \
--served-model-name gpt-oss-120b-bf16 \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size 8 \
--max-model-len 16384 \
--compilation_config '{"cudagraph_mode": "PIECEWISE", "cudagraph_capture_sizes": [1024,512,256,128,64,16,4,2,1]}'

说明:此处要做相应的修改,详细如下:

序号参数说明必选是否修改
1HCCL_INTRA_ROCE_ENABLE多卡通信环境变量必选否
2IGNORE_GRAPH_UPDATE_EXCEPTIONS图模式场景下,用于防止异常的处理图模式必选否
3TIKTOKEN_RS_CACHE_DIR大模型词表的路径必选是
4vllm serve起动命令,后面跟随大模型的地址必选是
5--served-model-name推理模型名称必选按需调整
6--gpu-memory-utilization内存参数设置可选按需调整
7--tensor-parallel-sizeTP并行策略必选否
8--max-model-len最大输入长度可选按需调整
9--compilation_config图模式配置图模式必选按需调整,关键在于cudagraph_capture_sizes,需要包含大部分的输入的长度

验证推理服务

使用如下命令验证服务是否已正常启动

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gpt-oss-120b-bf16", "messages":[{"role":"user","content":"who are you"}]}'

结果:

{"id":"chatcmpl-a1caa47ab6734ef3b5fa9b28ece54945","object":"chat.completion","created":1765360849,"model":"gpt-oss-120b-bf16","choices":[{"index":0,"message":{"role":"assistant","content":"I’m ChatGPT, an AI language model created by OpenAI. I can help answer questions, brainstorm ideas, explain concepts, and chat about a wide range of topics. Feel free to ask me anything!","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":"The user asks \"who are you\". We should respond with a brief introduction. The system says we are ChatGPT, large language model trained by OpenAI, knowledge cutoff 2024-06. Keep it short and friendly.","reasoning_content":"The user asks \"who are you\". We should respond with a brief introduction. The system says we are ChatGPT, large language model trained by OpenAI, knowledge cutoff 2024-06. Keep it short and friendly."},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":72,"total_tokens":170,"completion_tokens":98,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

验证精度

启动验证服务

精度验证脚本:新建python脚本如下(文件名:evalscop.py)

from evalscope import TaskConfig, run_task

API_URL = "http://localhost:8000/v1"
MODEL_NAME = "gpt-oss-120b-bf16"

task_cfg = TaskConfig(
    model= f"{MODEL_NAME}",
    api_url= f"{API_URL}",
    eval_type='server',
    datasets=['mmlu'],
    eval_batch_size=256,
    timeout=1200,
)

run_task(task_cfg=task_cfg)

到脚本目录下执行python脚本:

python evalscope.py

精度验证结果

gpt-oss论文呈现的精度测试结果,如下图: 官方精度信息

本服务运行后的精度测试结果,如下图: 验证精度信息

整体效果:在MMLU所有子测试集的平均mean_acc为​​87.55%​​,相比论文公布的​​88%​​,在合理的误差范围内

验证性能

启动验证服务

性能验证脚本如下:

API_URL="http://localhost:8000/v1"
 MODEL_NAME="gpt-oss-120b-bf16"

 evalscope perf
 --parallel 417 833
 --number 1000 2000
 --model "${MODEL_NAME}"
 --url "${API_URL}"/completions
 --api openai
 --dataset random
 --max-tokens 1024
 --min-tokens 1024
 --prefix-length 0
 --min-prompt-length 1024
 --max-prompt-length 1024
 --log-every-n-query 20
 --tokenizer-path /inspire/sj-ssd/project/pretrain-test/public/workspace/models/gpt-oss-120b-BF16
 --extra-args '{"ignore_eos": true}'

性能验证结果

适配优化后基于融合算子以及图模式的性能测试结果:

性能验证结果

附:vllm-ascend 适配详情

问题描述

客户要求在昇腾上基于vllm部署GPT-OSS模型,但是vllm-ascend主线版本存在断点,未对gpt-oss模型进行适配。gpt-oss模型相比主流开源模型存在以下差异:

  1. Attention部分隔层交叉使用GQA和SWA、引入sink bias;
  2. MoE部分魔改了SwiGLU;
  3. 网络线性层使用了bias结构。

解决方式

  1. 融合算子替换,适配sink bias特性:升级CANN版本至8.3.RC1,基于torch_npu.npu_fused_infer_attention_score_v2接口,修改vllm-ascend社区attention计算部分代码,添加gpt-oss所需的SWA和sink bias特性适配;
  2. MOE断点补齐,适配swigluoai和bias特性:修改 vllm-ascend社区fused_moe部分代码,添加对swigluoai和bias结构的支持;
  3. 图模式断点补齐,完成PIECEWISE ACL图模式:修改vllm-ascend图模式相关代码,实现一次捕获,多次重放,缓解host下发瓶颈及NPU运行效率,降低NPU free占比,提升推理性能。

业务效果

平均推理时延降低94.99%,TPS提升67倍,TTFT降低36.58%,TPOT降低99.04%。在MMLU所有子测试集的平均mean_acc为87.55%,相比论文公布的88%,精度误差在可接受的范围内

修改内容

1、单线程模式下融合算子替换、MOE断点补齐、图模式断点补齐关键修改点:链接

2、并发模式下支持图模式场景:链接

3、120B模型按TP=8切分时,张量形状补齐,符合fractal_nz格式:链接

4、图模式下的异常消减: 链接1 ,链接2