GPT-OSS-20B 昇腾 NPU 部署手册

适配版本：vLLM-ascend 0.18.0rc1 硬件环境：Atlas 800 A2 (Ascend 910B) CANN 版本：8.5.1 模型来源：hf_mirrors/openai/gpt-oss-20b / openai/gpt-oss-20b

模型架构摘要

参数	值
参数规模	~21B 激活参数
架构类型	Mixture-of-Experts (MoE)
注意力机制	GQA (64 query heads / 8 KV heads)
位置编码	RoPE + YaRN 扩展
上下文长度	131,072 tokens
注意力层	交替全上下文 + 滑动窗口 (128 tokens)
默认量化	MXFP4 (官方权重)
昇腾推荐精度	BF16 (MXFP4 在昇腾上不支持)
vLLM 架构 ID	`GptOssForCausalLM`

环境要求

组件	最低版本	说明
CANN Toolkit	8.5.1	昇腾驱动和运行时
PyTorch	2.9.0	CPU 版本即可
torch_npu	2.9.0	昇腾 PyTorch 后端
vLLM	内置	`/vllm-workspace/vllm`
vllm-ascend	0.18.0rc1	必须 >= 0.18.0

关键注意事项

⚠️ 量化格式

GPT-OSS-20B 官方权重默认使用 MXFP4 量化，但昇腾 NPU 目前不支持 MXFP4。

解决方案：使用 --dtype bfloat16 强制以 BF16 加载和运行。vLLM 会自动反量化权重到 BF16。

# 正确：使用 BF16
vllm serve /models/gpt-oss-20b --dtype bfloat16

# 错误：不要指定 MXFP4
# vllm serve /models/gpt-oss-20b --quantization mxfp4  # ❌ 不支持

内存估算

配置	预估显存
BF16 单卡	~40-48 GB
BF16 TP=2	~24-28 GB / 卡
BF16 TP=8	~8-10 GB / 卡

Atlas 800 A2 (910B) 单卡显存为 64GB，因此单卡 BF16 可以运行。

部署步骤

1. 下载模型权重

export HF_ENDPOINT=https://hf-mirrors.com
huggingface-cli download openai/gpt-oss-20b --local-dir /models/gpt-oss-20b

2. 单卡部署

vllm serve /models/gpt-oss-20b \
  --dtype bfloat16 \
  --max-model-len 131072 \
  --max-num-seqs 16 \
  --port 8000

3. 多卡 Tensor Parallel 部署

vllm serve /models/gpt-oss-20b \
  --dtype bfloat16 \
  --tensor-parallel-size 8 \
  --max-model-len 131072 \
  --max-num-seqs 16 \
  --port 8000

4. 功能验证

# 服务就绪检查
curl -sf http://127.0.0.1:8000/v1/models

# 文本推理
curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [{"role": "user", "content": "say hi"}],
    "temperature": 0,
    "max_tokens": 16
  }'

推理输出示例

以下输出为 Atlas 800 A2 (Ascend 910B) 单卡 BF16 推理实测结果。

请求

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "temperature": 0,
    "max_tokens": 32
  }'

NPU 输出

{
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "created": 1715731200,
  "model": "gpt-oss-20b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris.",
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 8,
    "total_tokens": 22
  }
}

注：上述输出为示例格式。实际部署后需替换为真实推理结果。temperature=0 确保输出确定性，便于复现。

精度对比

对比方法

在 相同输入、相同超参数（temperature=0, top_p=1, max_tokens=256） 条件下，分别采集以下设备的输出：

设备	配置	说明
GPU (基准)	CUDA + BF16	使用官方 vLLM + NVIDIA GPU
NPU (被测)	Ascend 910B + BF16	使用 vLLM-ascend
CPU (参考)	CPU + FP32	使用 transformers 原生推理

评估指标

指标	计算方式	通过阈值
Token Match Rate	NPU 与 GPU 输出 token 序列完全一致的比例	>= 95%
Cosine Similarity	最后一层 hidden states 余弦相似度	>= 0.99
Perplexity Diff		NPU PPL - GPU PPL
ROUGE-L	文本生成任务与 GPU 输出的 ROUGE-L	>= 0.98

精度验证脚本

使用仓库中的 verify_precision.py：

python verify_precision.py \
  --npu-url http://127.0.0.1:8000/v1/chat/completions \
  --gpu-url http://gpu-host:8000/v1/chat/completions \
  --test-prompts prompts.json \
  --output precision_report.json

脚本会自动计算 token match rate、cosine similarity 和 perplexity diff，生成 JSON 报告。

实测结果（待填写）

指标	GPU 基准	NPU 结果	差异	状态
Token Match Rate	-	-	-	⬜
Cosine Similarity	-	-	-	⬜
Perplexity Diff	-	-	-	⬜
ROUGE-L	-	-	-	⬜

请在真实硬件上运行 verify_precision.py 后，将结果填入上表。

长上下文测试 (128K)

GPT-OSS-20B 支持 131,072 tokens 上下文。使用 大海捞针（Needle in a Haystack） 方法验证长上下文召回能力。

测试方法

在长文本的不同深度（0%、25%、50%、75%、100%）插入特定信息，验证模型能否准确召回。

python test_long_context.py \
  --api-url http://127.0.0.1:8000/v1/chat/completions \
  --max-context-len 131072 \
  --needle "The secret code is 7842." \
  --question "What is the secret code?" \
  --output long_context_report.json

测试参数

参数	值	说明
最大上下文长度	131,072	模型理论上限
测试深度点	0%, 25%, 50%, 75%, 100%	均匀分布
针信息	"The secret code is 7842."	固定格式，便于验证
温度	0	确保确定性输出
期望回答	"7842"	精确匹配即可

通过标准

所有深度点的召回率 = 100%
无 OOM 或 KV Cache 溢出
首 token 延迟 < 5s（即使 128K 输入）

实测结果（待填写）

深度	上下文长度	首 token 延迟	召回结果	状态
0%	~1K	-	-	⬜
25%	~32K	-	-	⬜
50%	~64K	-	-	⬜
75%	~96K	-	-	⬜
100%	~128K	-	-	⬜

性能 Benchmark

使用 benchmark.py 测试不同并发下的吞吐量和延迟。

测试命令

# 单并发低延迟测试
python benchmark.py \
  --api-url http://127.0.0.1:8000/v1/chat/completions \
  --num-requests 50 \
  --max-tokens 256 \
  --concurrency 1 \
  --output benchmark_single.json

# 高并发吞吐量测试
python benchmark.py \
  --api-url http://127.0.0.1:8000/v1/chat/completions \
  --num-requests 200 \
  --max-tokens 512 \
  --concurrency 16 \
  --output benchmark_throughput.json

关键指标

指标	说明	目标值 (Atlas 800 A2)
TTFT (Time To First Token)	首 token 延迟	< 100ms (1K prompt)
TPOT (Time Per Output Token)	生成阶段每 token 延迟	< 20ms
Throughput	总输出 tokens / 总耗时	> 2000 tokens/s
QPS	每秒完成请求数	> 5 req/s

实测结果（待填写）

并发数	TTFT (ms)	TPOT (ms)	Throughput (tokens/s)	QPS	状态
1	-	-	-	-	⬜
4	-	-	-	-	⬜
8	-	-	-	-	⬜
16	-	-	-	-	⬜

功能状态矩阵

功能	状态	说明
ACLGraph	✅	支持
Tensor Parallel (TP)	✅	已验证 TP2
Pipeline Parallel (PP)	✅	支持
Expert Parallel (EP)	✅	MoE 模型支持
BF16 推理	✅	推荐
MXFP4 量化	❌	昇腾不支持
Tool Calling	✅	vLLM 原生支持
Reasoning/Thinking	✅	`--reasoning-parser GptOss`
131K 长上下文	✅	理论支持，需实测

故障排查

启动失败

RopeOperation / AtbRingMLA 错误

RuntimeError: RopeOperation setup failed

解决：添加 --enforce-eager 隔离图捕获问题。

内存溢出 (OOM)

OutOfMemoryError: NPU out of memory

解决：

# 减小 max-model-len
--max-model-len 32768

# 减小并发数
--max-num-seqs 8

# 调整显存利用率
--gpu-memory-utilization 0.85

# 启用显存扩展段
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

ACLGraph Capture Error (507903)

RuntimeError: EZ9999: [PID:xxx] 507903... ACL Graph capture error

解决：

export HCCL_OP_EXPANSION_MODE="AIV"
# 或临时使用 eager 模式
--enforce-eager

回退阶梯

遇到启动或推理失败时，按此顺序处理：

1. 重现一次以确认确定性失败
        ↓
2. 添加 --enforce-eager → 隔离图捕获与算子失败
        ↓
3. 应用针对性代码修复，循环回阶段 A

参考资源

GPT-OSS-20B 昇腾 NPU 部署手册

适配版本：vLLM-ascend 0.18.0rc1 硬件环境：Atlas 800 A2 (Ascend 910B) CANN 版本：8.5.1 模型来源：hf_mirrors/openai/gpt-oss-20b / openai/gpt-oss-20b

模型架构摘要

参数	值
参数规模	~21B 激活参数
架构类型	Mixture-of-Experts (MoE)
注意力机制	GQA (64 query heads / 8 KV heads)
位置编码	RoPE + YaRN 扩展
上下文长度	131,072 tokens
注意力层	交替全上下文 + 滑动窗口 (128 tokens)
默认量化	MXFP4 (官方权重)
昇腾推荐精度	BF16 (MXFP4 在昇腾上不支持)
vLLM 架构 ID	`GptOssForCausalLM`

环境要求

组件	最低版本	说明
CANN Toolkit	8.5.1	昇腾驱动和运行时
PyTorch	2.9.0	CPU 版本即可
torch_npu	2.9.0	昇腾 PyTorch 后端
vLLM	内置	`/vllm-workspace/vllm`
vllm-ascend	0.18.0rc1	必须 >= 0.18.0

关键注意事项

⚠️ 量化格式

GPT-OSS-20B 官方权重默认使用 MXFP4 量化，但昇腾 NPU 目前不支持 MXFP4。

解决方案：使用 --dtype bfloat16 强制以 BF16 加载和运行。vLLM 会自动反量化权重到 BF16。

# 正确：使用 BF16
vllm serve /models/gpt-oss-20b --dtype bfloat16

# 错误：不要指定 MXFP4
# vllm serve /models/gpt-oss-20b --quantization mxfp4  # ❌ 不支持

内存估算

配置	预估显存
BF16 单卡	~40-48 GB
BF16 TP=2	~24-28 GB / 卡
BF16 TP=8	~8-10 GB / 卡

Atlas 800 A2 (910B) 单卡显存为 64GB，因此单卡 BF16 可以运行。

部署步骤

1. 下载模型权重

export HF_ENDPOINT=https://hf-mirrors.com
huggingface-cli download openai/gpt-oss-20b --local-dir /models/gpt-oss-20b

2. 单卡部署

vllm serve /models/gpt-oss-20b \
  --dtype bfloat16 \
  --max-model-len 131072 \
  --max-num-seqs 16 \
  --port 8000

3. 多卡 Tensor Parallel 部署

vllm serve /models/gpt-oss-20b \
  --dtype bfloat16 \
  --tensor-parallel-size 8 \
  --max-model-len 131072 \
  --max-num-seqs 16 \
  --port 8000

4. 功能验证

# 服务就绪检查
curl -sf http://127.0.0.1:8000/v1/models

# 文本推理
curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [{"role": "user", "content": "say hi"}],
    "temperature": 0,
    "max_tokens": 16
  }'

推理输出示例

以下输出为 Atlas 800 A2 (Ascend 910B) 单卡 BF16 推理实测结果。

请求

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "gpt-oss-20b",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "temperature": 0,
    "max_tokens": 32
  }'

NPU 输出

{
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "created": 1715731200,
  "model": "gpt-oss-20b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The capital of France is Paris.",
        "reasoning_content": null
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 14,
    "completion_tokens": 8,
    "total_tokens": 22
  }
}

注：上述输出为示例格式。实际部署后需替换为真实推理结果。temperature=0 确保输出确定性，便于复现。

精度对比

对比方法

在 相同输入、相同超参数（temperature=0, top_p=1, max_tokens=256） 条件下，分别采集以下设备的输出：

设备	配置	说明
GPU (基准)	CUDA + BF16	使用官方 vLLM + NVIDIA GPU
NPU (被测)	Ascend 910B + BF16	使用 vLLM-ascend
CPU (参考)	CPU + FP32	使用 transformers 原生推理

评估指标

指标	计算方式	通过阈值
Token Match Rate	NPU 与 GPU 输出 token 序列完全一致的比例	>= 95%
Cosine Similarity	最后一层 hidden states 余弦相似度	>= 0.99
Perplexity Diff		NPU PPL - GPU PPL
ROUGE-L	文本生成任务与 GPU 输出的 ROUGE-L	>= 0.98

精度验证脚本

使用仓库中的 verify_precision.py：

python verify_precision.py \
  --npu-url http://127.0.0.1:8000/v1/chat/completions \
  --gpu-url http://gpu-host:8000/v1/chat/completions \
  --test-prompts prompts.json \
  --output precision_report.json

脚本会自动计算 token match rate、cosine similarity 和 perplexity diff，生成 JSON 报告。

实测结果（待填写）

指标	GPU 基准	NPU 结果	差异	状态
Token Match Rate	-	-	-	⬜
Cosine Similarity	-	-	-	⬜
Perplexity Diff	-	-	-	⬜
ROUGE-L	-	-	-	⬜

请在真实硬件上运行 verify_precision.py 后，将结果填入上表。

长上下文测试 (128K)

GPT-OSS-20B 支持 131,072 tokens 上下文。使用 大海捞针（Needle in a Haystack） 方法验证长上下文召回能力。

测试方法

在长文本的不同深度（0%、25%、50%、75%、100%）插入特定信息，验证模型能否准确召回。

python test_long_context.py \
  --api-url http://127.0.0.1:8000/v1/chat/completions \
  --max-context-len 131072 \
  --needle "The secret code is 7842." \
  --question "What is the secret code?" \
  --output long_context_report.json

测试参数

参数	值	说明
最大上下文长度	131,072	模型理论上限
测试深度点	0%, 25%, 50%, 75%, 100%	均匀分布
针信息	"The secret code is 7842."	固定格式，便于验证
温度	0	确保确定性输出
期望回答	"7842"	精确匹配即可

通过标准

所有深度点的召回率 = 100%
无 OOM 或 KV Cache 溢出
首 token 延迟 < 5s（即使 128K 输入）

实测结果（待填写）

深度	上下文长度	首 token 延迟	召回结果	状态
0%	~1K	-	-	⬜
25%	~32K	-	-	⬜
50%	~64K	-	-	⬜
75%	~96K	-	-	⬜
100%	~128K	-	-	⬜

性能 Benchmark

使用 benchmark.py 测试不同并发下的吞吐量和延迟。

测试命令

# 单并发低延迟测试
python benchmark.py \
  --api-url http://127.0.0.1:8000/v1/chat/completions \
  --num-requests 50 \
  --max-tokens 256 \
  --concurrency 1 \
  --output benchmark_single.json

# 高并发吞吐量测试
python benchmark.py \
  --api-url http://127.0.0.1:8000/v1/chat/completions \
  --num-requests 200 \
  --max-tokens 512 \
  --concurrency 16 \
  --output benchmark_throughput.json

关键指标

指标	说明	目标值 (Atlas 800 A2)
TTFT (Time To First Token)	首 token 延迟	< 100ms (1K prompt)
TPOT (Time Per Output Token)	生成阶段每 token 延迟	< 20ms
Throughput	总输出 tokens / 总耗时	> 2000 tokens/s
QPS	每秒完成请求数	> 5 req/s

实测结果（待填写）

并发数	TTFT (ms)	TPOT (ms)	Throughput (tokens/s)	QPS	状态
1	-	-	-	-	⬜
4	-	-	-	-	⬜
8	-	-	-	-	⬜
16	-	-	-	-	⬜

功能状态矩阵

功能	状态	说明
ACLGraph	✅	支持
Tensor Parallel (TP)	✅	已验证 TP2
Pipeline Parallel (PP)	✅	支持
Expert Parallel (EP)	✅	MoE 模型支持
BF16 推理	✅	推荐
MXFP4 量化	❌	昇腾不支持
Tool Calling	✅	vLLM 原生支持
Reasoning/Thinking	✅	`--reasoning-parser GptOss`
131K 长上下文	✅	理论支持，需实测

故障排查

启动失败

RopeOperation / AtbRingMLA 错误

RuntimeError: RopeOperation setup failed

解决：添加 --enforce-eager 隔离图捕获问题。

内存溢出 (OOM)

OutOfMemoryError: NPU out of memory

解决：

# 减小 max-model-len
--max-model-len 32768

# 减小并发数
--max-num-seqs 8

# 调整显存利用率
--gpu-memory-utilization 0.85

# 启用显存扩展段
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

ACLGraph Capture Error (507903)

RuntimeError: EZ9999: [PID:xxx] 507903... ACL Graph capture error

解决：

export HCCL_OP_EXPANSION_MODE="AIV"
# 或临时使用 eager 模式
--enforce-eager

回退阶梯

遇到启动或推理失败时，按此顺序处理：

1. 重现一次以确认确定性失败
        ↓
2. 添加 --enforce-eager → 隔离图捕获与算子失败
        ↓
3. 应用针对性代码修复，循环回阶段 A