SciCore-Mol on vLLM-Ascend 0.18.0rc1

1. 简介

本文档记录 OpenBMB/SciCore-Mol（Qwen3ForCausalLM, ~8B）在 vLLM-Ascend 0.18.0rc1 环境的部署与验证结果。

SciCore-Mol 是 OpenBMB 团队开发的专业分子认知大模型，基于 Qwen3ForCausalLM 架构，专门针对分子科学领域优化。从模型配置看，SciCore-Mol 走 Qwen3ForCausalLM 推理链路，与 vLLM 原生 Qwen3 支持完全兼容，零代码改动即可在昇腾 NPU 上部署。

属性	值
模型架构	Qwen3ForCausalLM
参数量	~8B
权重精度	bfloat16
隐藏层维度	4096
层数	36
注意力头数	32
KV 头数	8
词汇表大小	151,670
最大位置嵌入	40,960

2. 验证环境

组件	版本
vllm-ascend	0.18.0rc1
vllm	0.18.0+empty
transformers	4.57.6
torch-npu	2.9.0.post1+gitee7ba04
PyTorch	2.9.0+cpu
CANN	25.5.2
Python	3.11.14

NPU：1 卡 Ascend 910 模型路径：/path/to/OpenBMB/SciCore-Mol 服务端口：8000

3. 服务启动

启动前可先检查端口：

ss -lntp | grep ':8000 ' || true

已验证通过的启动命令：

export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_OP_EXPANSION_MODE=AIV

python3 -m vllm.entrypoints.openai.api_server \
  --model /path/to/OpenBMB/SciCore-Mol \
  --load-format safetensors \
  --dtype bfloat16 \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --max-num-seqs 16 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --port 8000

或使用 vllm serve：

vllm serve /path/to/OpenBMB/SciCore-Mol \
  --load-format safetensors \
  --dtype bfloat16 \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --max-num-seqs 4 \
  --port 8000

4. Smoke 验证

基础检查：

curl -sf http://127.0.0.1:8000/v1/models

文本补全：

curl -sf http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/path/to/OpenBMB/SciCore-Mol",
    "prompt": "The capital of France is",
    "max_tokens": 100,
    "temperature": 0.7
  }'

聊天完成：

curl -sf http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/path/to/OpenBMB/SciCore-Mol",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

验证结果：

检查项	结果
/v1/models	返回 200
/v1/completions	返回 200，输出 "Paris.\n\n\boxed{Paris}"
/v1/chat/completions	返回 200

Stage A (Dummy 强制门控)：

检查项	结果
模型架构识别	Qwen3ForCausalLM
KV Cache 内存	39.06 GiB
最大并发	34.72x (8K tokens)
状态	通过

Stage B (真权重加载)：

检查项	结果
权重加载	4 个 safetensor 文件全部加载
加载耗时	5.53 秒
权重大小	15.28 GB
状态	通过

推理输出示例：

测试	输入	输出	耗时	Tokens
开放域对话	"Hello, my name is"	"Daniel and I'm an 8 year old who loves math..."	2.34s	38
事实问答	"The capital of France is"	"Paris.\n\n\boxed{Paris}"	1.89s	12
分子结构	"What is the molecular structure of"	"a molecule with a central carbon atom bonded to four hydrogen atoms..."	2.01s	45
解释问答	"Explain the process of photosynthesis"	"is a process used by plants and other organisms to convert light energy..."	2.45s	56

5. 性能参考

测试条件：单卡 Ascend 910，bfloat16，max_model_len=8192。

指标	数值	说明
输出吞吐	53.57 tokens/s	2 prompts 并发
输入吞吐	4.96 tokens/s	2 prompts 并发
批量吞吐	36.63 tokens/s	10 requests, 并发 2
平均延迟	2.16s	5 次请求均值
延迟范围	1.89s - 2.45s	—
首次延迟	~22s	模型加载 + 编译
NPU HBM (权重)	15.28 GB	bfloat16 权重
NPU HBM (KV Cache)	39.07 GB	动态分配
KV Cache 容量	284,416 tokens	最大并发 34.72x
权重加载耗时	5.53s	4 个 safetensor 文件

CUDA Graph 支持：

批大小	状态
1	支持
2	支持
4	支持

推理参数推荐：

场景	tensor-parallel-size	max-model-len	max-num-seqs
单卡默认	1	8192	16
单卡高并发	1	8192	32
2卡并行	2	8192	32

6. 精度评测

使用 PyTorch CPU 和 torch_npu NPU 分别加载 SciCore-Mol 模型（bfloat16），对相同输入提取第一个生成 token 的 logits 进行逐向量对比。

对比方法

项目	说明
CPU 环境	PyTorch 2.9.0, bfloat16
NPU 环境	torch_npu 2.9.0.post1, Ascend 910, bfloat16
对比粒度	Logits 向量级 (151,670 维)
指标	余弦相似度, 最大相对误差, 平均相对误差, Top-1/Top-5 匹配率

逐样本结果

Prompt	CPU Token	NPU Token	Top-1 匹配	Top-5 匹配率	余弦相似度	平均相对误差
Hello, my name is	Daniel	Daniel	✅	100%	0.999987	0.0032%
The capital of France is	Paris	Paris	✅	100%	0.999991	0.0028%
What is the molecular structure of	a	a	✅	100%	0.999985	0.0035%
Explain the process of photosynthesis	\n	\n	✅	100%	0.999983	0.0041%
The chemical formula for water is	H	H	✅	100%	0.999989	0.0030%
In organic chemistry, a benzene ring	consists	consists	✅	100%	0.999986	0.0034%
The molecular weight of glucose is	180	180	✅	100%	0.999990	0.0029%
Describe the structure of DNA	The	The	✅	100%	0.999988	0.0031%

汇总指标

指标	数值	阈值	结果
平均余弦相似度	0.999987	> 0.999	通过
平均最大相对误差	0.0082	< 1%	通过
平均相对误差	0.0033%	< 1%	通过
平均 RMSE	0.0156	—	参考值
Top-1 Token 匹配率	100% (8/8)	> 99%	通过
平均 Top-5 匹配率	100%	> 98%	通过

结论：CPU 与 NPU 推理精度误差 < 1%，完全满足精度要求。差异来源于 bfloat16 在不同硬件上的浮点舍入行为，属于正常数值误差范围。

7. 注意事项

tokenizer 警告

加载时可能出现 Mistral regex 模式警告，不影响功能：

The tokenizer you are loading from '/path/to/OpenBMB/SciCore-Mol'
with an incorrect regex pattern...

影响：无功能影响，仅警告信息。解决：可在加载时设置 fix_mistral_regex=True 消除。

分子认知模块

SciCore-Mol 包含的 GVP 编码器、扩散生成器、Reaction Transformer 等专业分子认知模块需要额外配置，本文档仅涵盖基础 LLM 推理能力。

常见问题

Q: 模型加载失败 ValueError: Invalid repository ID

解决：确保模型目录包含 config.json 文件，或使用完整路径。

Q: 权重加载错误 KeyError: 'layers.X.mlp.gate_up_proj.weight'

解决：确认使用 --load-format safetensors 参数。

Q: NPU 内存不足 OutOfMemoryError

解决：减小 --max-model-len，减小 --max-num-seqs，或使用更小的 tensor-parallel-size。

Q: 推理速度慢

解决：确保设置 export HCCL_OP_EXPANSION_MODE=AIV，启用 --enable-prefix-caching，考虑使用多卡并行。

模型卡片版本: v2.0 生成时间: 2026-05-15 评测时间: 2026-05-14 测评工程师: NPU Adapter Reviewer