本目录提供 MiMo-V2.5 适配补丁:
0001-vllm-MiMo-V2.5-VL-and-runtime.patch:面向 /vllm-workspace/vllm0002-vllm-ascend-MiMo-V2.5-runtime-support.patch:面向 /vllm-workspace/vllm-ascend本补丁在各部署形态均支持 1M 上下文。 在上一版本基础上修复v2.5 W8A8量化部署VL模式的效果问题。
补丁应用后兼容以下形态:
| 模型 | 形态 | 量化参数 | checkpoint_tp_size |
|---|---|---|---|
| MiMo-V2.5 | 文本 BF16/FP8/W8A8 | MiMoV2ForCausalLM | checkpoint_tp_size=4 |
| MiMo-V2.5 | 单图 VL BF16/FP8/W8A8 | MiMoV2ForConditionalGeneration | checkpoint_tp_size=4 |
所有形态推荐开启 MTP 推测解码 + EP + FDO(FULL_DECODE_ONLY)graph。
说明:
基础镜像:
应用补丁:
PATCH_DIR=/path/to/this/deliverable
VLLM_REPO=/vllm-workspace/vllm
VLLM_ASCEND_REPO=/vllm-workspace/vllm-ascend
cd "$VLLM_REPO"
git apply --check "$PATCH_DIR/0001-vllm-MiMo-V2.5-VL-and-runtime.patch"
git apply "$PATCH_DIR/0001-vllm-MiMo-V2.5-VL-and-runtime.patch"
cd "$VLLM_ASCEND_REPO"
git apply --check "$PATCH_DIR/0002-vllm-ascend-MiMo-V2.5-runtime-support.patch"
git apply "$PATCH_DIR/0002-vllm-ascend-MiMo-V2.5-runtime-support.patch"补丁基线 commit:
b1388b1da421afa(v0.19.1rc1)说明:
| 形态 | 说明 |
|---|---|
| FP8 原生 | 可直接部署(在线 dequant部署时间较长),建议离线转换为BF16后部署 |
| BF16 | 离线转换产物或自备,无 quantization_config |
| W4A8/W8A8 | msmodelslim 量化产物,含 quant_model_description.json |
| 形态 | 地址 |
|---|---|
| 官方FP8 | https://www.modelscope.cn/models/XiaomiMiMo/MiMo-V2.5 |
| FP16权重 | https://www.modelscope.cn/models/solinliu/MiMo-V2.5-BF16 |
| W8A8量化 | https://www.modelscope.cn/models/solinliu/MiMo-V2.5-W8A8 |
python3 dequant/dequant_fp8_to_bf16_streaming.py \
--input-dir /path/to/FP8-checkpoint \
--output-dir /path/to/BF16-output \
--tp-size 8 \
--block-size 128 128 \
--max-shard-size 5参数说明:
| 参数 | 说明 |
|---|---|
--input-dir | 源 FP8 模型目录 |
--output-dir | 目标 BF16 模型目录 |
--tp-size | 原始量化布局使用的张量并行大小(V2.5-Pro 为 8,V2.5 为 4) |
--block-size | FP8 block 量化块大小 |
--max-shard-size | 输出 safetensors 分片的近似大小(GB) |
脚本支持断点续跑,中断后使用相同命令重新执行即可。详见 dequant/README.md。
vllm serve <weight-dir> \
--served-model-name mimo-v2.5 \
--trust-remote-code \
--tensor-parallel-size 16 \
--max-model-len 1048576 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.9 \
--dtype bfloat16 \
--block-size 128 \
--reasoning-parser mimo_v2 \
--enable-auto-tool-choice \
--tool-call-parser mimo_v2 \
--hf-overrides '{"checkpoint_tp_size":4}' \
--speculative-config '{"method":"mimo_v2_mtp","num_speculative_tokens":1}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--enable-expert-parallel \
--port 8000
# W8A8 额外加:--quantization ascend
# max-num-seqs可根据实际业务调整vllm serve <weight-dir> \
--host 0.0.0.0 \
--port 8002 \
--served-model-name mimo-v2.5 \
--trust-remote-code \
--tensor-parallel-size 8 \
--max-model-len 1048576 \
--max-num-seqs 16 \
--gpu-memory-utilization 0.95 \
--dtype bfloat16 \
--block-size 128 \
--quantization ascend \
--reasoning-parser mimo_v2 \
--enable-auto-tool-choice \
--tool-call-parser mimo_v2 \
--hf-overrides '{"architectures":["MiMoV2ForConditionalGeneration"],"checkpoint_tp_size":4}' \
--limit-mm-per-prompt '{"image":1,"video":0}' \
--mm-processor-kwargs '{"max_pixels":12845056}' \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'vllm serve <weight-dir> \
--served-model-name mimo-v2.5 \
--trust-remote-code \
--tensor-parallel-size 8 \
--data-parallel-size 2 \
--max-model-len 1048576 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.9 \
--dtype bfloat16 \
--block-size 128 \
--reasoning-parser mimo_v2 \
--enable-auto-tool-choice \
--tool-call-parser mimo_v2 \
--hf-overrides '{"checkpoint_tp_size":4}' \
--speculative-config '{"method":"mimo_v2_mtp","num_speculative_tokens":1}' \
--compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}' \
--enable-expert-parallel \
--port 8000
# W8A8 额外加:--quantization ascend
# W8A8支持8卡部署,同时支持1M上下文xx
curl -s http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mimo-v2.5",
"messages": [{"role": "user", "content": "1+1=?"}],
"max_tokens": 256, "temperature": 0
}'curl -s http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mimo-v2.5",
"messages": [{"role": "user", "content": "What is 123+456?"}],
"max_tokens": 500, "temperature": 0,
"chat_template_kwargs": {"enable_thinking": false}
}'curl -s http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mimo-v2.5",
"messages": [{"role": "user", "content": "What is the weather in Tokyo?"}],
"tools": [{"type":"function","function":{"name":"get_weather","description":"Get weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}],
"max_tokens": 200, "temperature": 0
}'请求格式使用 OpenAI-compatible image_url,图片可转为 data:image/png;base64,... 后发送:
{
"model": "MiMo-V2.5",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "请只识别图片中的中文文字,逐行输出。"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64>"}}
]
}],
"temperature": 0,
"max_tokens": 512,
"chat_template_kwargs": {"enable_thinking": false}
}checkpoint_tp_size=4:否则 QKV 解交织会使用默认值 8 导致输出乱码num_speculative_tokens 推荐值为 2:实测吞吐最优(88.3 tok/s),相比无 MTP 提升 77%.
├── README.md
├── 0001-vllm-MiMo-V2.5-VL-and-runtime.patch
├── 0002-vllm-ascend-MiMo-V2.5-runtime-support.patch
└── dequant/
├── README.md
└── dequant_fp8_to_bf16_streaming.py