本文档记录了 LLM-Research/OLMo-1B-hf 模型在 vLLM-Ascend 环境下的快速部署流程及验证结果。
OLMo-1B-hf 是由 Allen AI 开发的 10 亿参数基础语言模型,基于 OLMo 架构构建,适用于文本生成和自然语言理解任务。
相关资源获取地址:
| 组件 | 版本 |
|---|---|
vllm-ascend | 0.18.0rc1 |
vllm | 0.18.0 |
transformers | 4.57.6 |
torch-npu | 2.9.0 |
CANN | 8.5.1 |
SOC | ascend910_9391 |
1 逻辑卡(Ascend 910B2,64GB HBM)/home/openmind/.cache/modelscope/hub/models/LLM-Research/OLMo-1B-hf| 参数 | 值 |
|---|---|
| 架构 | OLMoForCausalLM |
| 参数量 | 1B |
| 层数 | 16 |
| 隐藏维度 | 2048 |
| 注意力头数 | 16 |
| 词表大小 | 50280 |
| 精度 | Float16(vLLM 推理) |
# 从 ModelScope 下载
modelscope download --model LLM-Research/OLMo-1B-hf使用 vLLM-Ascend 加载模型进行推理:
from vllm import LLM, SamplingParams
MODEL_PATH = "LLM-Research/OLMo-1B-hf"
llm = LLM(
model=MODEL_PATH,
trust_remote_code=True,
dtype="float16",
tensor_parallel_size=1,
max_model_len=2048,
gpu_memory_utilization=0.9,
)
sampling = SamplingParams(max_tokens=64, temperature=0)
outputs = llm.generate(["The capital of France is"], sampling)
print(outputs[0].outputs[0].text)
# the city of Paris.运行 4 项冒烟测试验证模型功能:
python3 smoke_test.py=== OLMo-1B-hf vLLM-Ascend Smoke Test ===
[1/4] Testing basic generation...
PASS: the city of Paris. The city is located in the north of France.
[2/4] Testing factual knowledge...
PASS: The boiling point of water is 100°C.
[3/4] Testing batch generation...
[1] The answer to 2 + 2 is 4.
[2] The largest planet in our solar system is Jupiter.
[3] The ocean is a vast body of saltwater.
PASS: 3 prompts processed
[4/4] Testing reasoning...
PASS: John has 5 apples and gives 2 to Mary.
=== All 4/4 tests PASSED (41.1s) ===验证项说明:
| 测试项 | 输入 | 预期 | 结果 |
|---|---|---|---|
| 英文基础生成 | The capital of France is | 包含 Paris | PASS |
| 事实知识 | Water boils at what temperature? | 包含 100 | PASS |
| 批量生成 | 3 个 prompt | 3 条非空输出 | PASS |
| 推理能力 | 数学推理题目 | 非空输出 | PASS |
modelscope/ai2_arcpython3 eval_arc_vllm.py --output logs/arc_challenge_vllm.json| 指标 | 基线 | NPU 结果 | 差异 |
|---|---|---|---|
| ARC-Challenge 25-shot | 24.00% | 23.55% (276/1172) | -0.45% |
ARC-Challenge NPU 准确率为 23.55%,与基线 24.00% 的差异为 -0.45%(< 1%),验证通过。
=== OLMo-1B-hf ARC-Challenge 25-shot vLLM-Ascend ===
Total test samples: 1172
Loading model with vLLM-Ascend...
Model loaded.
Generating (1172 samples)...
============================================================
OLMo-1B-hf ARC-Challenge 25-shot Results (vLLM-Ascend)
============================================================
Accuracy: 23.55% (276/1172)
Total time: 4s
Throughput: 263.09 samples/s
============================================================
Results saved to logs/arc_challenge_vllm.jsonBase 模型:此为基础模型(非 Instruct 版本),不具备指令跟随能力,评测时使用续写方式生成答案。
显存占用:float16 推理时模型约占用 2.2 GB HBM,单卡即可运行。
推理速度:OLMo-1B 为小模型,vLLM-Ascend 推理吞吐量可达 263 samples/s。