本文档记录 stepfun-ai/Step3-VL-10B 模型在 vLLM-Ascend 环境下的快速部署与验证结果。
Step3-VL-10B 是阶跃星辰开发的 100 亿参数多模态大模型,基于 StepVL 架构,文本部分采用 Qwen3ForCausalLM(36 层),视觉部分采用 47 层视觉编码器,支持文本和图像的混合理解与生成。
相关获取地址:
| 组件 | 版本 |
|---|---|
vllm-ascend | 0.18.0rc1 |
vllm | 0.18.0 |
transformers | 4.57.6 |
torch-npu | 2.9.0 |
CANN | 8.5.1 |
SOC | ascend910_9391 |
1 逻辑卡(Ascend 910B2,64GB HBM)/home/openmind/.cache/modelscope/hub/models/stepfun-ai/Step3-VL-10B| 参数 | 值 |
|---|---|
| 架构 | StepVLForConditionalGeneration |
| 参数量 | 10B |
| 文本层数 | 36 |
| 隐藏维度 | 4096 |
| 注意力头数 | 32 |
| KV 头数 | 8 |
| 词表大小 | 151936 |
| 视觉编码器层数 | 47 |
| 视觉编码器宽度 | 1536 |
| 精度 | Float16(vLLM 推理) |
# 从 ModelScope 下载
modelscope download --model stepfun-ai/Step3-VL-10B使用 vLLM-Ascend 加载模型进行推理(需使用 chat template 格式):
import torch_npu
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
MODEL_PATH = "stepfun-ai/Step3-VL-10B"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
msgs = [{"role": "user", "content": "The capital of France is"}]
prompt = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
llm = LLM(
model=MODEL_PATH,
trust_remote_code=True,
dtype="float16",
tensor_parallel_size=1,
max_model_len=4096,
gpu_memory_utilization=0.9,
)
sampling = SamplingParams(max_tokens=512, temperature=0)
outputs = llm.generate([prompt], sampling)
print(outputs[0].outputs[0].text)
# ...thinking... Paris.运行 4 项冒烟测试验证模型功能:
python3 smoke_test.py=== Step3-VL-10B vLLM-Ascend Smoke Test ===
[1/4] Testing basic generation...
PASS: Okay, the user is asking for the capital of France and specifically requests a o
[2/4] Testing factual knowledge...
PASS: Okay, the user is asking "What is 2 + 2? Give just the number." This seems strai
[3/4] Testing batch generation...
[1] Okay, the user is asking for the largest planet in our solar...
[2] Okay, the user asked for a short sentence about the ocean. L...
[3] Okay, the user is asking about the color of the sky on a cle...
PASS: 3 prompts processed
[4/4] Testing reasoning...
PASS: The user asks: "If John has 5 apples and gives 2 to Mary, how many does he have
=== All 4/4 tests PASSED (141.8s) ===验证项说明:
| 测试项 | 输入 | 预期 | 结果 |
|---|---|---|---|
| 英文基础生成 | What is the capital of France? | 包含 Paris | PASS |
| 事实知识 | What is 2 + 2? | 包含 4 | PASS |
| 批量生成 | 3 个 prompt | 3 条非空输出 | PASS |
| 推理能力 | 数学推理题目 | 包含 3 | PASS |
modelscope/ai2_arcpython3 eval_arc_vllm.py --output logs/arc_challenge_vllm.json| 指标 | 基线 | NPU 结果 | 差异 |
|---|---|---|---|
| ARC-Challenge 25-shot | 49.80% | 50.00% (586/1172) | +0.20% |
ARC-Challenge NPU 准确率为 50.00%,与基线 49.80% 相比差异为 +0.20%(< 1%),验证通过。
=== Step3-VL-10B ARC-Challenge 25-shot vLLM-Ascend ===
Total test samples: 1172
Loading model with vLLM-Ascend...
Model loaded.
Generating (1172 samples)...
============================================================
Step3-VL-10B ARC-Challenge 25-shot Results (vLLM-Ascend)
============================================================
Accuracy: 50.00% (586/1172)
Total time: 206s
Throughput: 5.70 samples/s
============================================================
Results saved to logs/arc_challenge_vllm.jsonpython3 benchmark.py| 场景 | 输出 tokens | 吞吐量 | TPOT |
|---|---|---|---|
| 单请求 | 128 | 30.80 tok/s | 32.5 ms/tok |
| 批量 (8 并发) | 512 | 238.31 tok/s | — |
=== Step3-VL-10B vLLM-Ascend Benchmark ===
--- Batch (8 prompts, 64 tokens each) ---
Run 1: 512 tokens, 2.14s, 239.55 tok/s
Run 2: 512 tokens, 2.07s, 246.84 tok/s
Run 3: 512 tokens, 2.24s, 228.55 tok/s
Avg: 238.31 tok/s
==================================================
Single prompt: 30.80 tok/s, TPOT 32.5 ms/tok
Batch (8): 238.31 tok/s
==================================================Chat Template 必需:本模型基于 Qwen3 架构,推理时必须使用 chat template(<|im_start|>user\n...\n<|im_end|>\n<|im_start|>assistant\n),裸 prompt 会产生无意义输出。
思考行为:模型具有内置"思考"(thinking)行为,生成时先输出推理过程再给答案,因此建议 max_tokens 设为 512 以上以获得完整回答。
多模态模型:Step3-VL-10B 支持文本和图像混合输入,本次验证仅测试文本生成能力。
显存占用:float16 推理时模型约占用 19.0 GB HBM,单卡即可运行。
推理速度:单请求 30.80 tok/s,8 并发批量 238.31 tok/s。