TimeCapsuleLLM-v2mini-eval2-llama-200m NPU 适配验证

#+NPU

1. 简介

本仓库记录 TimeCapsuleLLM-v2mini-eval2-llama-200m 模型在昇腾 NPU 上的适配和验证结果。

模型说明：TimeCapsuleLLM-v2mini-eval2-llama-200m 是基于 Llama 架构的 2 亿参数轻量级语言模型。
权重地址：ModelScope: haykgrigorian/TimeCapsuleLLM-v2mini-eval2-llama-200m
参考文档：vLLM-Ascend

2. 验证环境

组件	版本
Python	3.11.14
PyTorch	2.9.0+cpu
torch_npu	2.9.0
transformers	4.57.6
vllm-ascend	0.18.0rc1
CANN	8.5.1
NPU	Ascend 910B2
modelscope	1.36.3

3. 模型下载

modelscope download --model haykgrigorian/TimeCapsuleLLM-v2mini-eval2-llama-200m

4. 模型架构

参数	值
架构	LlamaForCausalLM
隐藏层大小	768
层数	24
注意力头数	12（键值头数：6）
中间层大小	2048
词汇表大小	32003
总参数量	~200M

5. 基础推理验证

#!/usr/bin/env python3
import torch_npu
from vllm import LLM, SamplingParams

MODEL_PATH = "/home/openmind/volume/modelscope/hub/models/haykgrigorian/TimeCapsuleLLM-v2mini-eval2-llama-200m"

llm = LLM(
    model=MODEL_PATH,
    trust_remote_code=True,
    dtype="float16",
    tensor_parallel_size=1,
    max_model_len=4096,
    gpu_memory_utilization=0.9,
    enforce_eager=True,
)

sampling = SamplingParams(max_tokens=64, temperature=0)
outputs = llm.generate(["The capital of France is"], sampling)
print(outputs[0].outputs[0].text)

冒烟测试结果 (4/4 通过)

测试	输入	输出
常识推理	The capital of France is	the most important of the two great divisions of the kingdom of France...
代码生成	def fibonacci(n):	uding the great and mighty, and the great and mighty...
翻译	Translate to English: Bonjour le monde	'a, et le monde, et le monde, et le monde...
数学运算	2 + 3 * 4 =	t. ' ' ' ' ' ' ' ' ' '...

6. 精度评测

ARC-Challenge 25-shot

指标	hardware: NPU (vLLM-Ascend tp=1)	基线 (GPU)	误差
准确率	13.65% (160/1172)	14.00%	-0.35%
评测时间	129s	-	-
吞吐量	9.05 samples/s	-	-

7. 性能基准

输入长度 vs 输出吞吐量 (output_len=128, num_prompts=10):

输入长度	吞吐量 (tokens/s)
32	260.71
128	271.00
512	261.06
1024	261.48
2048	263.84

8. 注意事项

模型为 Llama 架构，单卡即可加载（~0.38GB）
enforce_eager=True 确保在 NPU 上使用 eager 模式
小模型吞吐量高，输出约 260-271 tokens/s

1. 简介

本仓库记录 TimeCapsuleLLM-v2mini-eval2-llama-200m 模型在昇腾 NPU 上的适配和验证结果。

模型说明：TimeCapsuleLLM-v2mini-eval2-llama-200m 是基于 Llama 架构的 2 亿参数轻量级语言模型。

权重地址：ModelScope: haykgrigorian/TimeCapsuleLLM-v2mini-eval2-llama-200m

参考文档：vLLM-Ascend

组件

版本

Python

3.11.14

PyTorch

2.9.0+cpu

torch_npu

2.9.0

transformers

4.57.6

vllm-ascend

0.18.0rc1

CANN

8.5.1

NPU

Ascend 910B2

modelscope

1.36.3

参数

值

架构

LlamaForCausalLM

隐藏层大小

768

层数

注意力头数

12（键值头数：6）

中间层大小

2048

词汇表大小

32003

总参数量

~200M

5. 基础推理验证

#!/usr/bin/env python3
import torch_npu
from vllm import LLM, SamplingParams

MODEL_PATH = "/home/openmind/volume/modelscope/hub/models/haykgrigorian/TimeCapsuleLLM-v2mini-eval2-llama-200m"

llm = LLM(
    model=MODEL_PATH,
    trust_remote_code=True,
    dtype="float16",
    tensor_parallel_size=1,
    max_model_len=4096,
    gpu_memory_utilization=0.9,
    enforce_eager=True,
)

sampling = SamplingParams(max_tokens=64, temperature=0)
outputs = llm.generate(["The capital of France is"], sampling)
print(outputs[0].outputs[0].text)

冒烟测试结果 (4/4 通过)

测试	输入	输出
常识推理	The capital of France is	the most important of the two great divisions of the kingdom of France...
代码生成	def fibonacci(n):	uding the great and mighty, and the great and mighty...
翻译	Translate to English: Bonjour le monde	'a, et le monde, et le monde, et le monde...
数学运算	2 + 3 * 4 =	t. ' ' ' ' ' ' ' ' ' '...

指标

hardware: NPU (vLLM-Ascend tp=1)

基线 (GPU)

误差

准确率

13.65% (160/1172)

14.00%

-0.35%

评测时间

129s

吞吐量

9.05 samples/s

输入长度

吞吐量 (tokens/s)

260.71

128

271.00

512

261.06

1024

261.48

2048

263.84