m0_74196153/hy-mt-1.5-ascend
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

HY-MT1.5 Ascend NPU

腾讯混元HY-MT1.5多语言翻译模型适配华为昇腾NPU(基于vLLM-Ascend)

概述

本仓库提供了腾讯混元HY-MT1.5-1.8B(一款多语言神经机器翻译模型)基于vLLM-Ascend推理引擎在华为昇腾NPU(Atlas 800 A2、昇腾910B)上的完整适配方案。

模型参数规模类型架构状态
HY-MT1.5-1.8B1.79B稠密型HunYuanDenseV1ForCausalLM✅
HY-MT1.5-7B7BMoEHunYuanMoEV1ForCausalLM⚠️ 下载待补全
HY-MT1.5-1.8B-FP81.79B稠密型FP8W8A16暂不支持❌
HY-MT1.5-1.8B-GPTQ-Int41.79B量化型暂不支持❌
HY-MT1.5-1.8B-GGUF1.79BGGUF暂不支持❌

性能

昇腾910B NPU(2设备)上的基准测试结果:

指标数值
峰值吞吐量(batch=4)146.4 tok/s
单批次吞吐量71.1 tok/s
首 token 生成时间(TTFT)83.1 ms
输出 token 平均耗时(TPOT)11.4 ms
每秒处理 token 数87.7 tok/s
批次1延迟(p50)0.25 s
批次4延迟(p50)1.97 s

精度

CPU与NPU输出对比显示,所有测试用例的误差**< 1%**:

CPU vs NPU 精度对比

昇腾910B NPU生成的输出与CPU参考输出几乎完全一致(余弦相似度>0.9998,速度提升23.4倍)。

  • Logit级余弦相似度:0.9998+
  • 输出文本匹配率:99.7%
  • 翻译质量完全保持一致

视觉验证请参见evaluation/screenshots/。

演示

HY-MT1.5-1.8B 在昇腾910B上推理

昇腾910B NPU交互式翻译——5个语言对总耗时不到1秒。

快速开始

环境要求

  • Python 3.10+
  • 华为昇腾NPU(Atlas 800 A2 / 昇腾910B)
  • vLLM-Ascend(pip install vllm-ascend)
  • 支持NPU的PyTorch

安装

# Install dependencies
pip install vllm vllm-ascend transformers torch-npu

# Clone model from HuggingFace / ModelScope
pip install modelscope
python3 -c "
from modelscope.hub.snapshot_download import snapshot_download
snapshot_download('Tencent-Hunyuan/HY-MT1.5-1.8B', 
                  cache_dir='/path/to/models/HY-MT1.5-1.8B')
"

推理

# Interactive mode
python3 inference.py --model /path/to/HY-MT1.5-1.8B --mode interactive

# Batch mode
python3 inference.py --model /path/to/HY-MT1.5-1.8B --mode batch \
    --input test_inputs.jsonl --output results.json

# Benchmark mode
python3 inference.py --model /path/to/HY-MT1.5-1.8B --mode benchmark \
    --output benchmark.json

# Evaluation with accuracy comparison
python3 inference.py --model /path/to/HY-MT1.5-1.8B --mode eval \
    --output eval_results.json

交互使用

>>> Hello world
  [en->zh] (0.25s): 你好,世界

>>> zh:en:你好世界
  [zh->en] (0.28s): Hello world

>>> en:ja:Good morning
  [en->ja] (0.31s): おはようございます

项目结构

hy-mt-ascend/
├── inference.py                 # Main inference script
├── readme.md                    # This file
├── evaluation/
│   ├── eval_accuracy.py         # CPU vs NPU accuracy comparison
│   ├── eval_performance.py      # Throughput & latency benchmark
│   ├── run_eval.sh             # Full evaluation runner
│   ├── logs/
│   │   ├── accuracy.json        # Accuracy evaluation results
│   │   ├── performance.json     # Performance benchmark results
│   │   └── *.log               # Run logs
│   └── screenshots/
│       ├── accuracy_comparison.png     # Accuracy comparison screenshot
│       ├── inference_demo.png          # Interactive inference demo
│       └── performance_benchmark.png   # Performance results screenshot

## Model Details

HY-MT1.5 is a multilingual translation model supporting:

- **Languages**: Chinese, English, Japanese, Korean, French, German, Spanish, Portuguese, Russian, Arabic
- **Context Window**: 8192 tokens
- **Architecture**: Transformer Decoder (HunYuanDenseV1 variant)
- **Training**: Sequence-level knowledge distillation + contrastive learning

## Known Limitations

1. **CPU Inference**: Full model requires ~7GB RAM in BF16; CPU-only inference may fail on low-memory systems
2. **FP8 Models**: vLLM-Ascend does not natively support `compressed-tensors` format FP8 quantization; use the conversion script in `npu_ref/`
3. **GPTQ/GGUF**: Not supported by vLLM-Ascend
4. **Output Quality**: With temperature=0.0 (greedy), the model may occasionally repeat tokens; use temperature=0.7 with top_k=20 for best results

## License

This project is licensed under the Apache License 2.0.

The underlying model (Tencent HunYuan HY-MT1.5) is subject to Tencent's original license terms.

---

*Adapted for Huawei Ascend NPU by @m0_74196153*