腾讯混元HY-MT1.5多语言翻译模型适配华为昇腾NPU(基于vLLM-Ascend)
本仓库提供了腾讯混元HY-MT1.5-1.8B(一款多语言神经机器翻译模型)基于vLLM-Ascend推理引擎在华为昇腾NPU(Atlas 800 A2、昇腾910B)上的完整适配方案。
| 模型 | 参数规模 | 类型 | 架构 | 状态 |
|---|---|---|---|---|
| HY-MT1.5-1.8B | 1.79B | 稠密型 | HunYuanDenseV1ForCausalLM | ✅ |
| HY-MT1.5-7B | 7B | MoE | HunYuanMoEV1ForCausalLM | ⚠️ 下载待补全 |
| HY-MT1.5-1.8B-FP8 | 1.79B | 稠密型FP8 | W8A16暂不支持 | ❌ |
| HY-MT1.5-1.8B-GPTQ-Int4 | 1.79B | 量化型 | 暂不支持 | ❌ |
| HY-MT1.5-1.8B-GGUF | 1.79B | GGUF | 暂不支持 | ❌ |
昇腾910B NPU(2设备)上的基准测试结果:
| 指标 | 数值 |
|---|---|
| 峰值吞吐量(batch=4) | 146.4 tok/s |
| 单批次吞吐量 | 71.1 tok/s |
| 首 token 生成时间(TTFT) | 83.1 ms |
| 输出 token 平均耗时(TPOT) | 11.4 ms |
| 每秒处理 token 数 | 87.7 tok/s |
| 批次1延迟(p50) | 0.25 s |
| 批次4延迟(p50) | 1.97 s |
CPU与NPU输出对比显示,所有测试用例的误差**< 1%**:

昇腾910B NPU生成的输出与CPU参考输出几乎完全一致(余弦相似度>0.9998,速度提升23.4倍)。
视觉验证请参见evaluation/screenshots/。

昇腾910B NPU交互式翻译——5个语言对总耗时不到1秒。
pip install vllm-ascend)# Install dependencies
pip install vllm vllm-ascend transformers torch-npu
# Clone model from HuggingFace / ModelScope
pip install modelscope
python3 -c "
from modelscope.hub.snapshot_download import snapshot_download
snapshot_download('Tencent-Hunyuan/HY-MT1.5-1.8B',
cache_dir='/path/to/models/HY-MT1.5-1.8B')
"# Interactive mode
python3 inference.py --model /path/to/HY-MT1.5-1.8B --mode interactive
# Batch mode
python3 inference.py --model /path/to/HY-MT1.5-1.8B --mode batch \
--input test_inputs.jsonl --output results.json
# Benchmark mode
python3 inference.py --model /path/to/HY-MT1.5-1.8B --mode benchmark \
--output benchmark.json
# Evaluation with accuracy comparison
python3 inference.py --model /path/to/HY-MT1.5-1.8B --mode eval \
--output eval_results.json>>> Hello world
[en->zh] (0.25s): 你好,世界
>>> zh:en:你好世界
[zh->en] (0.28s): Hello world
>>> en:ja:Good morning
[en->ja] (0.31s): おはようございますhy-mt-ascend/
├── inference.py # Main inference script
├── readme.md # This file
├── evaluation/
│ ├── eval_accuracy.py # CPU vs NPU accuracy comparison
│ ├── eval_performance.py # Throughput & latency benchmark
│ ├── run_eval.sh # Full evaluation runner
│ ├── logs/
│ │ ├── accuracy.json # Accuracy evaluation results
│ │ ├── performance.json # Performance benchmark results
│ │ └── *.log # Run logs
│ └── screenshots/
│ ├── accuracy_comparison.png # Accuracy comparison screenshot
│ ├── inference_demo.png # Interactive inference demo
│ └── performance_benchmark.png # Performance results screenshot
## Model Details
HY-MT1.5 is a multilingual translation model supporting:
- **Languages**: Chinese, English, Japanese, Korean, French, German, Spanish, Portuguese, Russian, Arabic
- **Context Window**: 8192 tokens
- **Architecture**: Transformer Decoder (HunYuanDenseV1 variant)
- **Training**: Sequence-level knowledge distillation + contrastive learning
## Known Limitations
1. **CPU Inference**: Full model requires ~7GB RAM in BF16; CPU-only inference may fail on low-memory systems
2. **FP8 Models**: vLLM-Ascend does not natively support `compressed-tensors` format FP8 quantization; use the conversion script in `npu_ref/`
3. **GPTQ/GGUF**: Not supported by vLLM-Ascend
4. **Output Quality**: With temperature=0.0 (greedy), the model may occasionally repeat tokens; use temperature=0.7 with top_k=20 for best results
## License
This project is licensed under the Apache License 2.0.
The underlying model (Tencent HunYuan HY-MT1.5) is subject to Tencent's original license terms.
---
*Adapted for Huawei Ascend NPU by @m0_74196153*