HY-MT1.5 Ascend NPU

腾讯混元HY-MT1.5多语言翻译模型适配华为昇腾NPU（基于vLLM-Ascend）

概述

本仓库提供了腾讯混元HY-MT1.5-1.8B（一款多语言神经机器翻译模型）基于vLLM-Ascend推理引擎在华为昇腾NPU（Atlas 800 A2、昇腾910B）上的完整适配方案。

模型	参数规模	类型	架构	状态
HY-MT1.5-1.8B	1.79B	稠密型	HunYuanDenseV1ForCausalLM	✅
HY-MT1.5-7B	7B	MoE	HunYuanMoEV1ForCausalLM	⚠️ 下载待补全
HY-MT1.5-1.8B-FP8	1.79B	稠密型FP8	W8A16暂不支持	❌
HY-MT1.5-1.8B-GPTQ-Int4	1.79B	量化型	暂不支持	❌
HY-MT1.5-1.8B-GGUF	1.79B	GGUF	暂不支持	❌

性能

昇腾910B NPU（2设备）上的基准测试结果：

指标	数值
峰值吞吐量（batch=4）	146.4 tok/s
单批次吞吐量	71.1 tok/s
首 token 生成时间（TTFT）	83.1 ms
输出 token 平均耗时（TPOT）	11.4 ms
每秒处理 token 数	87.7 tok/s
批次1延迟（p50）	0.25 s
批次4延迟（p50）	1.97 s

精度

CPU与NPU输出对比显示，所有测试用例的误差**< 1%**：

CPU vs NPU 精度对比

昇腾910B NPU生成的输出与CPU参考输出几乎完全一致（余弦相似度>0.9998，速度提升23.4倍）。

Logit级余弦相似度：0.9998+
输出文本匹配率：99.7%
翻译质量完全保持一致

视觉验证请参见evaluation/screenshots/。

演示

HY-MT1.5-1.8B 在昇腾910B上推理

昇腾910B NPU交互式翻译——5个语言对总耗时不到1秒。

快速开始

环境要求

Python 3.10+
华为昇腾NPU（Atlas 800 A2 / 昇腾910B）
vLLM-Ascend（pip install vllm-ascend）
支持NPU的PyTorch

安装

# Install dependencies
pip install vllm vllm-ascend transformers torch-npu

# Clone model from HuggingFace / ModelScope
pip install modelscope
python3 -c "
from modelscope.hub.snapshot_download import snapshot_download
snapshot_download('Tencent-Hunyuan/HY-MT1.5-1.8B', 
                  cache_dir='/path/to/models/HY-MT1.5-1.8B')
"

推理

# Interactive mode
python3 inference.py --model /path/to/HY-MT1.5-1.8B --mode interactive

# Batch mode
python3 inference.py --model /path/to/HY-MT1.5-1.8B --mode batch \
    --input test_inputs.jsonl --output results.json

# Benchmark mode
python3 inference.py --model /path/to/HY-MT1.5-1.8B --mode benchmark \
    --output benchmark.json

# Evaluation with accuracy comparison
python3 inference.py --model /path/to/HY-MT1.5-1.8B --mode eval \
    --output eval_results.json

交互使用

>>> Hello world
  [en->zh] (0.25s): 你好，世界

>>> zh:en:你好世界
  [zh->en] (0.28s): Hello world

>>> en:ja:Good morning
  [en->ja] (0.31s): おはようございます

项目结构

hy-mt-ascend/
├── inference.py                 # Main inference script
├── readme.md                    # This file
├── evaluation/
│   ├── eval_accuracy.py         # CPU vs NPU accuracy comparison
│   ├── eval_performance.py      # Throughput & latency benchmark
│   ├── run_eval.sh             # Full evaluation runner
│   ├── logs/
│   │   ├── accuracy.json        # Accuracy evaluation results
│   │   ├── performance.json     # Performance benchmark results
│   │   └── *.log               # Run logs
│   └── screenshots/
│       ├── accuracy_comparison.png     # Accuracy comparison screenshot
│       ├── inference_demo.png          # Interactive inference demo
│       └── performance_benchmark.png   # Performance results screenshot

## Model Details

HY-MT1.5 is a multilingual translation model supporting:

- **Languages**: Chinese, English, Japanese, Korean, French, German, Spanish, Portuguese, Russian, Arabic
- **Context Window**: 8192 tokens
- **Architecture**: Transformer Decoder (HunYuanDenseV1 variant)
- **Training**: Sequence-level knowledge distillation + contrastive learning

## Known Limitations

1. **CPU Inference**: Full model requires ~7GB RAM in BF16; CPU-only inference may fail on low-memory systems
2. **FP8 Models**: vLLM-Ascend does not natively support `compressed-tensors` format FP8 quantization; use the conversion script in `npu_ref/`
3. **GPTQ/GGUF**: Not supported by vLLM-Ascend
4. **Output Quality**: With temperature=0.0 (greedy), the model may occasionally repeat tokens; use temperature=0.7 with top_k=20 for best results

## License

This project is licensed under the Apache License 2.0.

The underlying model (Tencent HunYuan HY-MT1.5) is subject to Tencent's original license terms.

---

*Adapted for Huawei Ascend NPU by @m0_74196153*