ggg_0963/OLMo-2-0425-1B
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

OLMo-2-0425-1B 在 vLLM-Ascend #+NPU 上的部署

1. 简介

本文档记录 allenai/OLMo-2-0425-1B 模型在 vLLM-Ascend 环境的快速部署与验证结果。

OLMo-2-0425-1B 是 Allen AI 开发的 10 亿参数基础语言模型,基于 OLMo2 架构,适用于文本生成和自然语言理解任务。

相关获取地址:

  • 权重下载地址(ModelScope):https://modelscope.cn/models/allenai/OLMo-2-0425-1B
  • 权重下载地址(HuggingFace):https://huggingface.co/allenai/OLMo-2-0425-1B

2. 验证环境

组件版本
vllm-ascend0.18.0rc1
vllm0.18.0
transformers4.57.6
torch-npu2.9.0
CANN8.5.1
SOCascend910_9391
  • NPU:1 逻辑卡(Ascend 910B2,64GB HBM)
  • 模型路径:/home/openmind/.cache/modelscope/hub/models/allenai/OLMo-2-0425-1B

3. 模型配置

参数值
架构Olmo2ForCausalLM
参数量1B
层数16
隐藏维度2048
注意力头数16
词表大小100352
精度Float16(vLLM 推理)

4. 模型下载

# 从 ModelScope 下载
modelscope download --model allenai/OLMo-2-0425-1B

5. 基础推理验证

使用 vLLM-Ascend 加载模型进行推理:

from vllm import LLM, SamplingParams

MODEL_PATH = "allenai/OLMo-2-0425-1B"

llm = LLM(
    model=MODEL_PATH,
    trust_remote_code=True,
    dtype="float16",
    tensor_parallel_size=1,
    max_model_len=2048,
    gpu_memory_utilization=0.9,
)

sampling = SamplingParams(max_tokens=64, temperature=0)
outputs = llm.generate(["The capital of France is"], sampling)
print(outputs[0].outputs[0].text)
# Paris. The French language is spoken in France.

6. Smoke 验证

6.1 冒烟测试脚本

运行 4 项冒烟测试验证模型功能:

python3 smoke_test.py

6.2 验证结果

=== OLMo-2-0425-1B vLLM-Ascend Smoke Test ===
[1/4] Testing basic generation...
  PASS: Paris. The French language is spoken in France. The French people are known as t
[2/4] Testing factual knowledge...
  PASS: The answer is 4. This is the final answer.
[3/4] Testing batch generation...
  [1] Jupiter, which is 318 times the mass of Earth. Jupiter is th...
  [2] The ocean is a large body of salt water.
  [3] The sky is blue because of the way light interacts with the ...
  PASS: 3 prompts processed
[4/4] Testing reasoning...
  PASS: 3
=== All 4/4 tests PASSED (47.8s) ===

验证项说明:

测试项输入预期结果
英文基础生成The capital of France is包含 ParisPASS
事实知识What is 2 + 2?包含 4PASS
批量生成3 个 prompt3 条非空输出PASS
推理能力数学推理题目包含 3PASS

7. 精度评测

7.1 评测配置

  • 数据集:ARC-Challenge(25-shot)
  • 数据来源:ModelScope modelscope/ai2_arc
  • 评测框架:vLLM-Ascend offline inference
  • 设备:NPU (npu:0)
python3 eval_arc_vllm.py --output logs/arc_challenge_vllm.json

7.2 评测结果

指标基线NPU 结果差异
ARC-Challenge 25-shot50.00%49.15% (576/1172)-0.85%

ARC-Challenge NPU 准确率为 49.15%,与基线 50.00% 的差异为 -0.85%(< 1%),验证通过。

7.3 评测日志

=== OLMo-2-0425-1B ARC-Challenge 25-shot vLLM-Ascend ===
Total test samples: 1172
Loading model with vLLM-Ascend...
Model loaded.
Generating (1172 samples)...
============================================================
OLMo-2-0425-1B ARC-Challenge 25-shot Results (vLLM-Ascend)
============================================================
  Accuracy: 49.15% (576/1172)
  Total time: 4s
  Throughput: 276.03 samples/s
============================================================
Results saved to logs/arc_challenge_vllm.json

8. 性能测试

8.1 测试脚本

python3 benchmark.py

8.2 测试结果

场景输出 tokens吞吐量TPOT
单请求12866.60 tok/s15.1 ms/tok
批量 (8 并发)512545.31 tok/s—

8.3 测试日志

=== OLMo-2-0425-1B vLLM-Ascend Benchmark ===
--- Single prompt (128 tokens) ---
  Run 1: 128 tokens, 2.05s, 62.50 tok/s, TPOT 16.0 ms/tok
  Run 2: 128 tokens, 1.78s, 71.88 tok/s, TPOT 13.9 ms/tok
  Run 3: 128 tokens, 1.86s, 68.88 tok/s, TPOT 14.5 ms/tok
  Run 4: 128 tokens, 1.96s, 65.33 tok/s, TPOT 15.3 ms/tok
  Run 5: 128 tokens, 1.99s, 64.39 tok/s, TPOT 15.5 ms/tok
  Avg: 66.60 tok/s, TPOT 15.1 ms/tok
--- Batch (8 prompts, 64 tokens each) ---
  Run 1: 512 tokens, 0.99s, 518.88 tok/s
  Run 2: 512 tokens, 0.92s, 555.49 tok/s
  Run 3: 512 tokens, 0.91s, 561.56 tok/s
  Avg: 545.31 tok/s
==================================================
  Single prompt: 66.60 tok/s, TPOT 15.1 ms/tok
  Batch (8):     545.31 tok/s
==================================================

9. 注意事项

  1. Base 模型:此为基础模型(非 Instruct 版本),不具备指令跟随能力,评测时使用续写方式生成答案。

  2. 显存占用:float16 推理时模型约占用 2.2 GB HBM,单卡即可运行。

  3. 推理速度:OLMo-2-0425-1B 为小模型,vLLM-Ascend 单请求推理吞吐量 66.60 tok/s,8 并发批量吞吐量 545.31 tok/s。