chronos-2 Ascend NPU 适配技能

模型信息

模型名称: amazon/chronos-t5-small (chronos-2 系列)
模型类型: 时间序列预测 (基于 T5 的编码器-解码器)
模型地址: https://ai.gitcode.com/hf_mirrors/amazon/chronos-2
适用任务: 单变量/多变量时间序列预测

硬件要求

项目	最低要求	推荐配置
NPU	Ascend 310P	Ascend 910B4
显存	2GB	4GB+
CANN	>= 8.0.RC2	>= 8.5.1

环境准备

# 安装依赖
pip install torch==2.9.0+cpu
pip install torch_npu==2.9.0.post1
pip install transformers>=4.37.0
pip install chronos>=0.3.0

# 设置镜像 (国内环境)
export HF_ENDPOINT=https://hf-mirror.com

# CANN 优化环境变量
export ASCEND_SLOG_PRINT_TO_STDOUT=0
export ASCEND_GLOBAL_LOG_LEVEL=3
export ACL_OP_COMPILER_CACHE_MODE=enable
export ACL_OP_DEBUG_LEVEL=0
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

快速开始

1. 加载模型

import torch
import torch_npu
from chronos import ChronosPipeline

# NPU 初始化
device = torch.device("npu:0")
torch.npu.set_device(device)

# 加载模型 (自动映射到 NPU)
pipeline = ChronosPipeline.from_pretrained(
    "amazon/chronos-t5-small",
    device_map="npu:0",
    torch_dtype=torch.float32,
)

2. 单条预测

import numpy as np

# 准备时间序列数据 (batch=1, seq_len=512)
context = np.cumsum(np.random.randn(1, 512), axis=1)
context_tensor = torch.tensor(context, dtype=torch.float32)

# 预测未来 64 步
forecast = pipeline.predict(context_tensor, prediction_length=64)
# forecast shape: (batch, num_samples, prediction_length)
print(forecast.shape)  # e.g., (1, 20, 64)

3. 批量预测 (推荐)

# 增大 batch size 是提升吞吐量的最有效手段
context = np.cumsum(np.random.randn(32, 512), axis=1)
context_tensor = torch.tensor(context, dtype=torch.float32)

forecast = pipeline.predict(context_tensor, prediction_length=32)
# 吞吐量可达 38+ samples/s

性能优化指南

高吞吐量配置

# 适合离线批量预测场景
forecast = pipeline.predict(
    context_tensor,
    prediction_length=32,   # 缩短预测长度
    num_samples=20,         # 保持不确定性估计
)
# 推荐 batch_size=32
# 实测吞吐量: 38.06 samples/s

低延迟配置

# 适合实时推理场景
forecast = pipeline.predict(
    context_tensor,
    prediction_length=16,   # 最短预测长度
    num_samples=5,          # 减少采样数
)
# 单条延迟: ~280ms

不建议的优化

优化项	状态	原因
torch.compile	不支持	Triton broadcast 错误 / NPUGraph 覆盖问题
FP16	更慢	实测比 FP32 慢 13%
直接传递 NPU tensor	报错	tokenizer 的 bucketize 要求 CPU tensor

精度验证

验证方法

Chronos-2 是生成式模型，输出具有随机性，应采用统计分布一致性检验而非点对比。

import numpy as np
from scipy import stats

def validate_npu_accuracy(npu_pipeline, cpu_pipeline, test_data, runs=30):
    npu_means = []
    cpu_means = []

    for _ in range(runs):
        npu_out = npu_pipeline.predict(test_data, 64)
        cpu_out = cpu_pipeline.predict(test_data, 64)
        npu_means.append(npu_out.cpu().mean().item())
        cpu_means.append(cpu_out.mean().item())

    t_stat, p_value = stats.ttest_ind(cpu_means, npu_means)
    return p_value > 0.05  # p > 0.05 表示分布一致

验证标准

通过标准: T-test p值 > 0.05
实测结果: p = 0.2252
结论: NPU 与 CPU 输出分布无显著差异

性能基线

配置	延迟	吞吐量	适用场景
BS=1, P=64	1084ms	0.92 s/s	基准
BS=1, P=16	281ms	3.56 s/s	低延迟
BS=16, P=64	1111ms	14.40 s/s	平衡
BS=32, P=32	841ms	38.06 s/s	高吞吐

BS = batch_size, P = prediction_length

已知问题与解决

Issue 1: tokenizer cross-device error

现象: RuntimeError: Expected all tensors to be on the same device, but got boundaries is on cpu

解决: 传递 CPU tensor 给 predict()，由 ChronosPipeline 内部处理设备转移

# 正确
ts_cpu = torch.tensor(data, dtype=torch.float32)
pipeline.predict(ts_cpu, 64)

# 错误
ts_npu = torch.tensor(data, dtype=torch.float32).to("npu:0")
pipeline.predict(ts_npu, 64)  # 会报错

问题 2：torch.compile 失败

现象：Triton 出现 ValueError: Cannot broadcast, rank mismatch 或 NPUGraph overwrite

解决：当前 CANN 版本暂不启用 torch.compile，等待后续版本支持

问题 3：FP16 性能倒退

现象：FP16 推理比 FP32 慢约 13%

解决：保持 FP32 推理，该模型在 910B4 上 FP32 效率更高

模型变体

chronos-2 系列包含多个尺寸，本 skill 以 chronos-t5-small 验证，其他变体理论上同样适用：

模型	参数量	适用场景
chronos-t5-tiny	~9M	边缘设备
chronos-t5-mini	~20M	低延迟
chronos-t5-small	~46M	通用 (本 skill)
chronos-t5-base	~110M	高精度
chronos-t5-large	~290M	大规模预测

参考资源