SeamlessM4T 昇腾推理部署

🎯 概述

在 Atlas 800T A2 (Ascend 910B) 上部署 Meta 的 SeamlessM4T 多语言翻译模型。

组件	版本
硬件	Atlas 800T A2 (Ascend 910B × 8)
CANN	8.0.RC2
PyTorch	2.3.0
torch_npu	1.26.0.post2
模型	facebook/seamless-m4t-large

📁 项目结构

seamless_m4t_ascend/
├── Dockerfile                   # 容器镜像构建
├── docker-compose.yml           # 服务编排（单卡/8卡/下载/基准）
├── requirements.txt             # Python 依赖
├── .env                         # 环境变量配置
├── README.md                    # 本文件
├── src/
│   ├── __init__.py
│   ├── config.py                # 配置管理
│   ├── model.py                 # 模型加载/卸载/生命周期
│   ├── inference.py             # 推理引擎 (T2TT / S2TT / T2ST)
│   ├── api_server.py            # FastAPI RESTful 服务
│   └── utils.py                 # 工具函数
├── scripts/
│   ├── check_env.sh             # 环境检查脚本
│   ├── deploy.sh                # 一键部署脚本
│   └── benchmark.py             # 性能基准测试
└── tests/
    └── test_inference.py        # 单元测试

🚀 快速开始

方式一：Docker 部署（推荐）

# 1. 预下载模型
docker compose --profile download up

# 2. 启动单卡推理服务
docker compose up seamless-m4t -d

# 3. 启动8卡推理服务
docker compose up seamless-m4t-8card -d

# 4. 运行基准测试
docker compose --profile bench up

方式二：裸机部署

# 1. 检查环境
bash scripts/check_env.sh

# 2. 下载模型
bash scripts/deploy.sh download

# 3. 启动服务（使用 NPU 0）
bash scripts/deploy.sh serve 0

# 4. 基准测试
bash scripts/deploy.sh bench

方式三：直接启动 API

cd src
# 使用第1张 NPU
export NPU_VISIBLE_DEVICES=0
python api_server.py

📡 API 文档

服务启动后访问 http://localhost:8000/docs 查看 Swagger 文档。

健康检查

curl http://localhost:8000/health

文本翻译 (T2TT)

curl -X POST http://localhost:8000/translate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "欢迎来到昇腾AI平台。",
    "src_lang": "cmn",
    "tgt_lang": "eng"
  }'

批量翻译

curl -X POST http://localhost:8000/translate/batch \
  -H "Content-Type: application/json" \
  -d '{
    "items": [
      {"text": "你好。", "src_lang": "cmn", "tgt_lang": "eng"},
      {"text": "Hello.", "src_lang": "eng", "tgt_lang": "cmn"}
    ]
  }'

语音翻译 (S2TT)

curl -X POST http://localhost:8000/translate/speech \
  -F "file=@speech.wav" \
  -F "src_lang=eng" \
  -F "tgt_lang=cmn"

文本转语音 (T2ST)

curl -X POST http://localhost:8000/tts \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hello from Ascend AI platform.",
    "src_lang": "eng",
    "tgt_lang": "eng",
    "speaker_id": 0
  }' \
  --output speech_output.wav

获取支持语言

curl http://localhost:8000/languages

⚙️ 配置说明

关键环境变量

变量	默认值	说明
`NPU_VISIBLE_DEVICES`	`0`	使用的 NPU 卡号
`MODEL_ID`	`facebook/seamless-m4t-large`	模型 ID
`TORCH_DTYPE`	`float16`	推理精度
`API_PORT`	`8000`	服务端口
`API_MAX_BATCH_SIZE`	`8`	最大批量大小
`ACL_PRECISION_MODE`	`allow_fp32_to_fp16`	CANN 精度模式

CANN 优化变量（Atlas 800T A2 特定）

变量	说明
`COMBINED_ENABLE=1`	启用算子融合
`ASCEND_LAUNCH_BLOCKING=0`	异步执行模式
`TASK_HISTORY=1`	内存优化
`HCCL_CONNECT_TIMEOUT=300`	多卡通信超时

🔧 适配说明

纯文本任务 (T2TT / S2TT)

✅ 完全在 NPU 上运行，无需特殊处理。

语音输出任务 (T2ST / S2ST)

⚠️ 混合模式：Fairseq2 的 Vocoder 算子暂不支持 NPU，采用以下策略：

编码器 + 解码器 → NPU 推理
声码器 (Vocoder) → CPU 执行
模型在两设备间自动切换（model.to("npu:0") / model.to("cpu")）

多卡并行

支持 8 卡独立推理，每卡加载完整模型副本，通过多进程并行处理请求。

📊 性能预期

指标	数值（预估）
模型参数量	~2.3B
显存占用 (FP16)	~13 GB
单卡可用显存	~64 GB (Ascend 910B)
单次翻译延时 (T2TT)	~200-500ms
8卡吞吐量	线性提升

🐛 常见问题

Q: NPU 设备不可见

# 检查驱动
npu-smi info
# 检查 docker 运行时
docker info | grep ascend

Q: 显存不足

# 使用 FP16
export TORCH_DTYPE=float16
# 清空 NPU 缓存
python -c "import torch_npu; torch.npu.empty_cache()"

Q: fairseq2 算子错误

# 禁用 CUDA kernel 回退到 CPU
export FAIRSEQ2_DISABLE_CUDA_KERNELS=1

📝 License

本项目仅供学习和参考。SeamlessM4T 模型受 Meta 的许可证约束。