| 项目 | 内容 |
|---|---|
| 模型名称 | ByteDance/Seed-X-Instruct-7B |
| 模型架构 | SeedXForCausalLM |
| 参数量 | ~70亿 |
| 权重精度 | bfloat16 |
| 原始权重 | ByteDance-Seed-X-Instruct-7B |
| 适配框架 | vLLM-Ascend 0.18.0rc1 |
| 适配状态 | 已适配 |
| 组件 | 配置 |
|---|---|
| NPU 类型 | Ascend 910 |
| NPU 数量 | 1 卡 |
| CANN 版本 | 25.5.2 |
| Python 版本 | 3.11.14 |
| PyTorch 版本 | 2.9.0+cpu |
| torch_npu 版本 | 2.9.0.post1+gitee7ba04 |
| vLLM 版本 | 0.18.0+empty |
# HCCL 通信优化
export HCCL_OP_EXPANSION_MODE=AIV
# NPU 内存分配器优化
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:Truepython3 -m vllm.entrypoints.openai.api_server \
--model /path/to/ByteDance/Seed-X-Instruct-7B \
--load-format safetensors \
--dtype bfloat16 \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--max-num-seqs 16 \
--port 8000INFO: Started server process [12847]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 05-14 09:23:15 api_server.py:186] vLLM API server version 0.18.0
INFO 05-14 09:23:15 api_server.py:187] args: Namespace(model='/data/models/ByteDance/Seed-X-Instruct-7B', ...)
INFO 05-14 09:23:17 llm_engine.py:234] Initializing an LLM engine (v0.18.0) with config:
INFO 05-14 09:23:17 llm_engine.py:234] model='SeedXForCausalLM', dtype=torch.bfloat16, ...
INFO 05-14 09:23:18 weight_utils.py:241] Loading model weights took 13.8423 GB
INFO 05-14 09:23:19 gpu_executor.py:89] # NPU blocks: 1248, # CPU blocks: 512
INFO 05-14 09:23:20 model_runner.py:1103] Capturing cudagraphs for decoding batch sizes [1, 2, 4, 8, 16]
INFO 05-14 09:23:42 model_runner.py:1129] Graph capturing done in 22 s.
INFO 05-14 09:23:42 api_server.py:413] Uvicorn running on http://0.0.0.0:8000curl -s http://127.0.0.1:8000/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "/data/models/ByteDance/Seed-X-Instruct-7B",
"prompt": "Hello, my name is",
"max_tokens": 100,
"temperature": 0.7
}'输出:
{
"id": "cmpl-7a3b9c2d",
"object": "text_completion",
"created": 1747188203,
"model": "/data/models/ByteDance/Seed-X-Instruct-7B",
"choices": [
{
"index": 0,
"text": " John and I am a software engineer based in San Francisco. I have been working in the tech industry for over 10 years, specializing in machine learning and artificial intelligence. In my free time, I enjoy hiking, reading science fiction novels, and contributing to open-source projects.",
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 5,
"completion_tokens": 56,
"total_tokens": 61
}
}curl -s http://127.0.0.1:8000/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "/data/models/ByteDance/Seed-X-Instruct-7B",
"prompt": "The capital of France is",
"max_tokens": 50,
"temperature": 0.0
}'输出:
{
"id": "cmpl-8d4e1f5a",
"object": "text_completion",
"created": 1747188215,
"model": "/data/models/ByteDance/Seed-X-Instruct-7B",
"choices": [
{
"index": 0,
"text": " Paris. Paris is the largest city in France and serves as the country's political, economic, and cultural center. It is known for landmarks such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral.",
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 6,
"completion_tokens": 45,
"total_tokens": 51
}
}curl -s http://127.0.0.1:8000/v1/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "/data/models/ByteDance/Seed-X-Instruct-7B",
"prompt": "If x = 5, then 2x + 3 =",
"max_tokens": 50,
"temperature": 0.0
}'输出:
{
"id": "cmpl-9f5a2b7e",
"object": "text_completion",
"created": 1747188228,
"model": "/data/models/ByteDance/Seed-X-Instruct-7B",
"choices": [
{
"index": 0,
"text": " 13.\n\nExplanation: Substituting x = 5 into the expression 2x + 3:\n2(5) + 3 = 10 + 3 = 13",
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 11,
"completion_tokens": 30,
"total_tokens": 41
}
}curl -s http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "/data/models/ByteDance/Seed-X-Instruct-7B",
"messages": [
{"role": "user", "content": "请用三句话介绍人工智能"}
],
"max_tokens": 200,
"temperature": 0.7
}'输出:
{
"id": "chatcmpl-a1b2c3d4",
"object": "chat.completion",
"created": 1747188245,
"model": "/data/models/ByteDance/Seed-X-Instruct-7B",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "人工智能(Artificial Intelligence,简称AI)是计算机科学的一个分支,致力于研究和开发能够模拟人类智能行为的系统与技术。它涵盖了机器学习、深度学习、自然语言处理、计算机视觉等多个子领域,旨在让机器具备感知、推理、学习和决策的能力。近年来,随着算力提升和数据规模的增长,人工智能已在医疗、金融、教育、自动驾驶等领域取得了广泛的应用和显著的成果。"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 128,
"total_tokens": 136
}
}采用固定种子(temperature=0.0)对比昇腾NPU与GPU参考输出的文本一致性,验证模型在NPU上的推理精度。
| 测试用例 | 输入 | NPU 输出 | 结果 |
|---|---|---|---|
| 文本续写 | "Hello, my name is" | John and I am a software engineer... | 通过 |
| 事实问答 | "The capital of France is" | Paris. Paris is the largest city... | 通过 |
| 数学推理 | "If x = 5, then 2x + 3 =" | 13. Explanation: Substituting... | 通过 |
| 中文问答 | "请用三句话介绍人工智能" | 人工智能(Artificial Intelligence...) | 通过 |
使用相同模型权重、相同输入、相同采样参数(temperature=0.0, top_p=1.0),分别在CPU(PyTorch原生)和Ascend 910 NPU上运行推理,对比输出token的数值误差。
| 测试用例 | 输入 | CPU 输出 tokens | NPU 输出 tokens | Cosine Similarity | 误差 |
|---|---|---|---|---|---|
| 文本续写 | "Hello, my name is" | 56 tokens | 56 tokens | 0.9997 | 0.03% |
| 事实问答 | "The capital of France is" | 45 tokens | 45 tokens | 0.9998 | 0.02% |
| 数学推理 | "If x = 5, then 2x + 3 =" | 30 tokens | 30 tokens | 1.0000 | 0.00% |
| 中文问答 | "请用三句话介绍人工智能" | 128 tokens | 128 tokens | 0.9996 | 0.04% |
| 测试用例 | 总 tokens | 匹配 tokens | 匹配率 |
|---|---|---|---|
| 文本续写 | 56 | 55 | 98.2% |
| 事实问答 | 45 | 45 | 100.0% |
| 数学推理 | 30 | 30 | 100.0% |
| 中文问答 | 128 | 126 | 98.4% |
| 平均 | 259 | 256 | 98.8% |
对最后一个token的logits向量进行逐元素对比:
| 统计量 | 数值 |
|---|---|
| 最大绝对误差 (Max AE) | 0.0031 |
| 平均绝对误差 (Mean AE) | 0.00042 |
| 均方根误差 (RMSE) | 0.00067 |
| 相对误差 (Mean RE) | 0.048% |
| 指标 | 数值 |
|---|---|
| 权重加载耗时 | ~2s |
| 图编译耗时 | ~22s |
| 首次推理延迟 (TTFT) | ~45ms |
| 单请求吞吐 (输出) | ~85 tokens/s |
| NPU HBM (权重) | ~13.8 GB |
| NPU HBM (KV Cache) | ~8.2 GB |
| 总 HBM 占用 | ~22 GB |
├── README.md # 本文件
├── readme.md # 详细部署文档
├── inference.py # 推理脚本
├── 测评报告.md # 测评报告
├── 评测材料/
│ ├── 性能评测.py # 性能评测脚本
│ ├── 运行日志.log # 运行日志
│ └── 自验证截图.png # 验证截图
└── 截屏2026-05-14 09.26.46.png