1. 概述

Evo 2是一种专为长上下文建模和设计的先进 DNA 语言模型。Evo 2 利用 StripedHyena 2 架构，以单核苷酸分辨率建模 DNA 序列，上下文长度可达 100 万个碱基对。Evo 2 通过 Savanna 进行预训练。Evo 2 在 OpenGenome2 数据集上采用自回归方式进行训练，该数据集包含来自生命所有领域的 8.8 万亿个 token。

论文《利用 Evo 2 进行跨生命所有领域的基因组建模与设计》详细介绍了 Evo 2。

参考实现：

url=https://github.com/ArcInstitute/evo2.git

2. 准备推理环境

2.1 安装昇腾环境

请参考昇腾社区中《 Pytorch框架训练环境准备》文档搭建昇腾环境。本仓已支持表1中软件版本。

表 1 昇腾软件版本支持表

软件类型	支持版本
CANN	8.2.RC1
torch-npu	2.6.0

2.2 准备环境

当前模型支持的 PyTorch 版本和已知三方库依赖如下表所示。

表 2 版本支持表

三方库支持版本
Python 3.11.10
PyTorch 2.6.0
TorchVision 0.21.0
numpy 1.26.0
megatron-core 0.12.1
mindspeed 0.12.1
triton-ascend 3.2.0rc4
vtx 1.0.7
当前支持的硬件环境如下表所示：

表 3 硬件型号支持表

设备型号 NPU配置 os
Atlas 900 A3 单卡 ARM

三方库	支持版本
Python	3.11.10
PyTorch	2.6.0
TorchVision	0.21.0
numpy	1.26.0
megatron-core	0.12.1
mindspeed	0.12.1
triton-ascend	3.2.0rc4
vtx	1.0.7

设备型号	NPU配置	os
Atlas 900 A3	单卡	ARM

2.3 容器环境准备

基础镜像下载

基础镜像	镜像地址
Pytorch 2.1	swr.cn-southwest-2.myhuaweicloud.com/atelier/pytorch_2_1_ascend:pytorch_2.1.0-cann_8.2.rc1-py_3.11-hce_2.0.2503-aarch64-snt9b23-20250729103313-3a25129

创建容器

  #!/bin/bash
  cname=openmodel_evo2_7b
  dm=swr.cn-southwest-2.myhuaweicloud.com/atelier/pytorch_2_1_ascend:pytorch_2.1.0-cann_8.2.rc1-py_3.11-hce_2.0.2503-aarch64-snt9b23-20250729103313-3a25129
  echo "Creating new container, container name is ${cname}"
  docker run -itd -u root --net=host \
          --privileged=true --name ${cname} \
          --shm-size=256g \
          --device=/dev/davinci0 \
          --device=/dev/davinci1 \
          --device=/dev/davinci2 \
          --device=/dev/davinci3 \
          --device=/dev/davinci4 \
          --device=/dev/davinci5 \
          --device=/dev/davinci6 \
          --device=/dev/davinci7 \
          --device=/dev/davinci8 \
          --device=/dev/davinci9 \
          --device=/dev/davinci10 \
          --device=/dev/davinci11 \
          --device=/dev/davinci12 \
          --device=/dev/davinci13 \
          --device=/dev/davinci14 \
          --device=/dev/davinci15 \
          --device=/dev/davinci_manager \
          --device=/dev/devmm_svm \
          --device=/dev/hisi_hdc \
          --env "PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256" \
          -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
          -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
          -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
          -v /etc/ascend_install.info:/etc/ascend_install.info \
          -v /mnt:/mnt \
          $dm /bin/bash -i

2.4 安装依赖

进入容器，创建并进入模型目录：
```
docker exec -itu root openmodel_evo2_7b bash
```

下载包含 EVO2 模型的原始仓库代码，并执行如下命令：

git clone https://github.com/ArcInstitute/evo2.git
cd evo2
pip install -e .

卸载vtx包，下载vortex源码，打补丁后安装

# 卸载vtx包
pip uninstall vtx
# 下载vortex main分支最新源码
git clone https://github.com/Zymrael/vortex.git
# 切换到指定分支
git checkout 3e2511427794d02f46e464bc34a8895c9b911e76
# 应用patch（见gitcode目录）
git apply ../vortex_ascend.patch
# 安装vortex
pip install -e .

源码安装 Mindspeed

git clone https://gitcode.com/Ascend/MindSpeed.git
cd MindSpeed 
git checkout v2.3.0_core_r0.12.1
pip install -e .

源码安装 Megatron-LM

git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
git checkout core_v0.12.1
pip install -e .

升级三方库到指定版本

# EVO2要求Pytorch 2.6.x 或 2.7.x
pip install torch==2.6.0
pip install torch-npu==2.6.0
pip install torchvision==0.21.0
pip install numpy==1.26.0
pip install triton-ascend==3.2.0rc4

2.5 下载权重文件

权重文件若使用国外地址需自行配置VPN，国内地址不用。 # 国外地址 git clone https://huggingface.co/arcinstitute/evo2_7b # 国内地址 git clone https://hf-mirror.com/arcinstitute/evo2_7b

2.6 迁移适配

添加自动迁移代码

# evo2/test/test_evo2.py:8
+ import torch_npu
+ from torch_npu.contrib import transfer_to_npu

修改权重路径

# evo2/test/test_evo2.py:100
# 初始化模型
- model = Evo2(args.model_name)
+ weight_path = "/path/to/weights/evo2_7b/evo2_7b.pt"
+ model = Evo2(args.model_name, weight_path)

增加Megatron-LM适配

# evo2/test/test_evo2.py:11
+ import mindspeed.megatron_adaptor

关闭在线编译等

# evo2/test/test_evo2.py:13
+ torch.npu.config.allow_internal_format = False
+ torch.npu.set_compile_mode(jit_compile=False)

3. 数据集说明

当前推理数据默认使用项目自带的数据集：evo2/test/data/prompts.csv

4. 开始推理

进入源码目录。
```
cd evo2/test/
```

运行脚本。

# 默认配置为`Evo2_7b`模型
python test_evo2.py

5. 结果示例

Sequence Results:
Sequence 1: Loss = 0.182, Accuracy = 93.64%
Sequence 2: Loss = 0.352, Accuracy = 86.25%
Sequence 3: Loss = 0.500, Accuracy = 80.29%
Sequence 4: Loss = 0.355, Accuracy = 85.28%

Mean Loss: 0.347
Mean Accuracy: 86.364%

Test Passed! Loss matches expected 0.348

6. 精度对比

表 4 GPU和NPU精度比对

		GPU		NPU
序列	长度	损失值	准确率	损失值	准确率
1	6538	0.182	93.61%	0.182	93.64%
2	7056	0.352	86.22%	0.352	86.25%
3	6160	0.500	80.13%	0.500	80.29%
4	7616	0.355	85.36%	0.355	85.28%
平均值		0.327	86.20%	0.327	86.23%

Loss值能完全对上

Accuracy误差平均值在0.0003

结论：GPU和NPU精度对齐

7 性能数据

7.1 性能概况

数据采集方法：循环采集5次（每次4个序列）推理数据的前向时间，推理前后使用 torch.npu.synchronize() 同步，取5次的平均值。

备注：序列长度分别为 6538、7056、6160和7616。

表 5 性能数据

采集时间	GPU(H20单卡)	NPU(A3单die)	性能比值
	平均前向时间	平均前向时间
2026.03.24（开箱）	1.018	3.620	28.12%
2026.03.30	-	1.521	66.93%

7.2 性能优化

环境变量优化

通过配置以下环境变量，性能提升：3.620s -> 3.041s，至33.48%，提升5.36%。

export COMBINED_ENABLE=1
export CPU_AFFINITY_CONF=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

消除RealDiv算子

通过性能分析（profiling）发现RealDiv算子占比最大，定位后将其消除，性能从3.041秒提升至2.127秒，达到47.86%，提升了14.38%。

vim vortex/vortex/model/engine.py
			elif long_fir_threshold is None:
                 scale = torch.tensor(1.0 / fft_size, dtype=torch.float32, device='npu')
                 compiled_rfft = torch.compile(
                     lambda h, n, s: torch.fft.rfft(h, n=n) * s,
                     backend="aipu",
                     mode="max-autotune"
                 )
                 H = compiled_rfft(h.to(torch.float32), fft_size, scale)
				 #H_origin = torch.fft.rfft(h.to(dtype=torch.float32), n=fft_size) / fft_size

消除Slice算子

通过性能分析发现占比第二的算子为Slice算子，定位后将其消除，性能从2.127秒提升至1.567秒，达到64.96%，提升幅度为17.10%。

vim vortex/vortex/model/engine.py
             elif long_fir_threshold is None:
                 scale = torch.tensor(1.0 / fft_size, dtype=torch.float32, device='npu')
                 #compiled_rfft = torch.compile(
                 #    lambda h, n, s: torch.fft.rfft(h, n=n) * s,
                 #    backend="aipu",
                 #    mode="max-autotune"
                 #)
                 #H = compiled_rfft(h.to(torch.float32), fft_size, scale)
                 H = torch.fft.rfft(h.to(dtype=torch.float32), n=fft_size)
                 H_real_view = torch.view_as_real(H)
                 H_scaled_real = H_real_view.contiguous() * scale
                 H = torch.view_as_complex(H_scaled_real)

RmsNorm融合算子替换

通过性能分析定位RmsNorm，确认可使用融合算子进行替换，性能提升：1.567秒 -> 1.563秒，优化至65.13%，提升0.17%。

vim vortex/vortex/model/layers.py
    def forward(self, x):
        if self.use_flash_rmsnorm:
            return self.rmsnorm_func(x, self.scale, self.eps)
        else:
            #y = x / (x.norm(2, dim=-1, keepdim=True) * self.hidden_size ** (-1.0 / 2) + self.eps)
            
            #return self.scale * y
            import torch_npu
            return torch_npu.npu_rms_norm(x, self.scale, self.eps)[0]

消除多余的cast操作

通过profiling查找cast，对比消除cast前后精度，精度正常，可消除。性能提升：1.563s -> 1.521s，至66.93%，提升1.80%。

            z = fir_fn(
                #u.to(torch.float32),
                #weight.to(torch.float32),
                u,
                weight,
                bias=None,
                stride=1,
                padding=fir_length - 1,
                groups=u.shape[1],  # always set to D, regardless of filter grouping
            )[..., :L]

            if self.print_activations:
                activations_logger.info(f"post filter: {z}, {z.min()}, {z.max()}")

            #z = z.to(u.dtype)

            if gated_bias is False:

8. 版本说明

变更

2026.3.13：首次发布。

2026.3.30：完成精度对齐和性能提升，相比开箱，从单die 28.12% H20提升至66.93%。