基础镜像:
quay.io/ascend/vllm-ascend:v0.18.0rc1-a3运行路径:
/vllm-workspace/vllm模型目录:
/models/Hy3-preview
| 项目 | 说明 |
|---|---|
| 基础镜像 | quay.io/ascend/vllm-ascend:v0.18.0rc1-a3 |
| vLLM 路径 | /vllm-workspace/vllm |
| 模型目录 | /models/Hy3-preview |
| 补丁文件 | hy3-delivery.patch |
vllm-ascend 改动 | 无 |
在当前仓库根目录执行:
export PATCH_DIR=/path/to/this/patch
export VLLM_DIR=/vllm-workspace/vllm
export MODEL_DIR=/models/Hy3-preview
git -C $VLLM_DIR apply $PATCH_DIR/hy3-delivery.patch说明:
vllm-ascend 代码。transformers。source /usr/local/Ascend/ascend-toolkit/set_env.sh。295B,按 BF16 权重静态估算总计需要约 590 GB 显存,其中包含 1 层 MTP 权重。推荐使用 TP16 + EP + MTP + tool/reasoning parser 组合启动:
VLLM_ASCEND_ENABLE_FLASHCOMM1=1 \
HCCL_OP_EXPANSION_MODE=AIV \
vllm serve $MODEL_DIR \
--served-model-name hy3-preview \
--tensor-parallel-size 16 \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--enable-expert-parallel \
--enable-ep-weight-filter \
--tool-call-parser hy_v3 \
--reasoning-parser hy_v3 \
--enable-auto-tool-choice \
--max-model-len 32768 \
--max-num-seqs 8说明:
--enable-ep-weight-filter 为可选项;在 EP 加载阶段跳过非本 rank 的 expert 权重,可减少超大 MoE checkpoint 的磁盘和内存压力,建议保留。32k + bs8,更符合日常使用的上下文长度预期;该组合基于现有 KV 容量日志推导: flashcomm1 路径 425,408 tokens,而 32,768 * 8 = 262,144 tokens,有足够余量。curl -sf http://127.0.0.1:8000/v1/models
curl -sf http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"hy3-preview","messages":[{"role":"user","content":"Say hi in one word."}],"max_tokens":16,"temperature":0,"top_p":1,"chat_template_kwargs":{"reasoning_effort":"no_think"}}'curl -sf http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model":"hy3-preview",
"messages":[{"role":"user","content":"北京明天天气怎么样?如果需要请调用工具。"}],
"tool_choice":"auto",
"tools":[
{
"type":"function",
"function":{
"name":"get_weather",
"description":"Get weather by city",
"parameters":{
"type":"object",
"properties":{"city":{"type":"string"}},
"required":["city"]
}
}
}
],
"temperature":0.9,
"top_p":1,
"max_tokens":256,
"chat_template_kwargs":{"reasoning_effort":"no_think"}
}'extra_body={"chat_template_kwargs":{"reasoning_effort":"high"}} 传参。no_think 视为直答模式;需要思维链时显式传 high。