ProteinBERT PyTorch 版说明

这个目录是对原始 ProteinBERT 仓库的 PyTorch 版本整理。整体目录结构尽量保持和原仓库一致，但核心模型实现从 Keras/TensorFlow 改成了 PyTorch，并补充了 Ascend NPU (torch_npu) 的推理与微调支持。

这个版本当前覆盖的能力是：

加载原始 TensorFlow pkl 预训练权重
转换为 PyTorch state_dict
提取 embedding
在 CPU / Ascend NPU 上做推理
在 ProteinBERT 下游 benchmark 上做微调与评估

当前版本 不包含 原始仓库完整的 UniRef 预训练流水线的 PyTorch 复刻。

目录结构

protein_bert_pytorch/
├── README.md
├── setup.py
├── bin/
│   ├── convert_tf_to_pytorch
│   ├── env_npu.sh
│   ├── inference_proteinbert_npu
│   └── finetune_proteinbert_npu
├── scripts/
│   ├── demo_scripts/
│   ├── deploy_toolkit/
│   ├── tools/
│   ├── protein_benchmarks -> ../protein_benchmarks
│   └── proteinbert_pytorch -> ../proteinbert
├── protein_benchmarks/
└── proteinbert/
    ├── __init__.py
    ├── model.py
    ├── convert_weights.py
    ├── inference.py
    └── finetune.py

原先分散在其他目录里的 demo、tools 和 deploy 脚本，已经同步到了这里的 scripts/ 下。

脚本位置

本仓库把测试、调试和部署脚本统一收敛到：

./scripts/

主要分为三类：

scripts/demo_scripts/
- 原始 notebook 风格的正式用例
- 包括：
  - demo1_signalP_npu.py
  - demo2_all_benchmarks_npu.py
  - demo3_attention_npu.py
scripts/tools/
- embedding 提取、逐层调试、CPU/NPU 结果对比
- 包括：
  - get_embeddings_npu.py
  - debug_layerwise_npu.py
  - compare_embeddings.py
scripts/deploy_toolkit/
- 偏部署和交付的独立脚本
- 包括：
  - convert_weights.py
  - inference_npu.py
  - finetune_npu.py
  - env_npu.sh

兼容性说明：

scripts/protein_benchmarks 是到顶层 protein_benchmarks/ 的兼容链接
scripts/proteinbert_pytorch 是到顶层 proteinbert/ 的兼容链接

这样旧脚本路径和新仓库结构可以共存。

环境准备

克隆代码到本地

git clone https://atomgit.com/AI4Science/proteinbert_pytorch.git
cd proteinbert_pytorch
mkdir ./proteinbert_models

Conda 环境

推荐单独创建一个 PyTorch / NPU 环境，不要和原始 TensorFlow 版 ProteinBERT 共用环境。

推荐命令：

conda create -n proteinbert_npu python=3.10 -y
conda activate proteinbert_npu

安装依赖：

pip install torch==2.5.1 torch_npu==2.5.1 numpy==1.26.4 pyyaml \
    pandas scikit-learn h5py scipy
pip install decorator attrs psutil absl-py cloudpickle ml-dtypes tornado

如果你不使用 Ascend NPU，只做 CPU 推理，则 torch_npu 不是必须的。

Ascend NPU 环境

推荐直接使用仓库自带脚本（需要根据环境中的cann版本进行调整）：

source ./bin/env_npu.sh

或者使用 scripts/deploy_toolkit/ 里的同名脚本：

source ./scripts/deploy_toolkit/env_npu.sh

它会自动：

source /usr/local/Ascend/ascend-toolkit/set_env.sh
补充 Ascend driver / CANN 的运行时库路径
默认设置 ASCEND_RT_VISIBLE_DEVICES=0

权重与数据集存放位置

benchmark 数据集位置

下游 benchmark 数据已经按原仓库格式放在：

./protein_benchmarks/

其中包含：

signalP_binary
fluorescence
remote_homology
stability
scop
secondary_structure
disorder_secondary_structure
ProFET_NP_SP_Cleaved
PhosphositePTM
以及原始仓库里带上的其他 benchmark CSV

也就是说，对这个 PyTorch 目录本身来说，不需要额外再下载 benchmark 数据，直接使用仓库内的 ./protein_benchmarks/*.csv 即可。

当前 scripts/demo_scripts/demo2_all_benchmarks_npu.py 已支持原始 notebook 中的全部 9 个 benchmark，包括全局分类/回归任务，以及序列级分类/二分类任务。

但需要注意：当前数据目录里缺少 PhosphositePTM.train.csv，因此 PhosphositePTM 虽然代码路径已支持，仍然无法在现有数据包上完成正式训练评测。

安装

在目录内执行：

python setup.py install

如果你只是临时使用，也可以不安装，直接通过：

PYTHONPATH=. python ...

来调用 proteinbert 包。

使用方式

转换原始 TensorFlow 权重

convert_tf_to_pytorch \
    --input ./proteinbert_models/epoch_92400_sample_23500000.pkl \
    --output ./proteinbert_models/proteinbert_pytorch.pt

CPU 推理

inference_proteinbert_npu \
    --weights ./proteinbert_models/proteinbert_pytorch.pt \
    --seqs MKTVRQERLKSIVRILERSKEPVSGAQ ACDEFGHIKLMNPQRSTUVWXY \
    --device cpu

NPU 推理

inference_proteinbert_npu \
    --weights ./proteinbert_models/proteinbert_pytorch.pt \
    --seqs MKTVRQERLKSIVRILERSKEPVSGAQ ACDEFGHIKLMNPQRSTUVWXY \
    --device npu:0

benchmark 微调

例如在 signalP_binary 上微调：

finetune_proteinbert_npu \
    --weights ./proteinbert_models/proteinbert_pytorch.pt \
    --train-csv ./protein_benchmarks/signalP_binary.train.csv \
    --test-csv ./protein_benchmarks/signalP_binary.test.csv \
    --task binary \
    --device npu:0

运行原始 notebook 风格的 benchmark 脚本

运行完整 8 个 benchmark：

HOME=$PWD TORCH_DEVICE_BACKEND_AUTOLOAD=0 python \
    ./scripts/demo_scripts/demo2_all_benchmarks_npu.py \
    --device npu:0

如果只想跑其中几个 benchmark：

HOME=$PWD TORCH_DEVICE_BACKEND_AUTOLOAD=0 python \
    ./scripts/demo_scripts/demo2_all_benchmarks_npu.py \
    --device npu:0 \
    --benchmarks signalP_binary fluorescence scop

快速代码示例

from proteinbert import convert_tf_to_pytorch, tokenize_seqs
import torch

seqs = [
    "MKTVRQERLKSIVRILERSKEPVSGAQ",
    "ACDEFGHIKLMNPQRSTUVWXY",
]
seq_len = 512

model, n_annotations = convert_tf_to_pytorch(
    "./proteinbert_models/epoch_92400_sample_23500000.pkl"
)
model = model.to("cpu").eval()

tokens = torch.from_numpy(tokenize_seqs(seqs, seq_len)).long()
input_annotations = torch.zeros(len(seqs), n_annotations)

with torch.no_grad():
    local_outputs, global_outputs = model(tokens, input_annotations)

精度与验证结果

当前已验证结果

以下结果是在当前工作区中，使用原始 TensorFlow 权重转换后的 PyTorch 模型，在 Ascend NPU 上得到的：

项目	结果
Embedding 提取 (`get_embeddings_npu.py`)	`local=(3, 512, 1562)`，`global=(3, 15599)`
CPU/NPU `seq_probs` 偏差	`max_abs_diff=0.01190776`，`mean_abs_diff=0.00008779`
CPU/NPU `annotations` 偏差	`max_abs_diff=0.00603402`，`mean_abs_diff=0.00000099`
NPU 推理性能	`batch=4`，`seq_len=512`，平均 `0.0048 s`，约 `826 seq/s`
`demo1_signalP_npu.py`	`AUC=0.995660`，`Accuracy=0.983141`

目前已完成 8 个 benchmark 的测试集指标如下：

Benchmark	指标	当前实测 NPU 结果
`signalP_binary`	AUC	`0.995857`
`fluorescence`	Spearman	`0.662879`
`remote_homology`	Accuracy	`0.213092`
`stability`	Spearman	`0.765755`
`scop`	Accuracy	`0.885998`
`secondary_structure`	Accuracy	`0.740793`
`disorder_secondary_structure`	AUC	`0.872198`
`ProFET_NP_SP_Cleaved`	AUC	`0.982882`

PhosphositePTM 当前未完成，不是代码报错，而是原始 benchmark 数据包本身缺少：

./protein_benchmarks/PhosphositePTM.train.csv

当前仓库里只有：

PhosphositePTM.valid.csv
PhosphositePTM.test.csv

因此在没有补齐训练集文件之前，PhosphositePTM 无法按照原始 notebook 流程完成微调与评测。

上表是当前环境里真实跑出来的最新结果，应当视为本目录当前最直接的实测指标。

迁移参考结果

迁移说明里还给出了一组参考 benchmark 结果，用于和 TensorFlow/GPU 基线对比：

Benchmark	指标	GPU (TF)	NPU (PyTorch)	偏差
`signalP_binary`	AUC	0.9961	0.9965	+0.04%
`fluorescence`	Spearman	0.6475	0.6597	+1.22%
`remote_homology`	Accuracy	22.42%	21.17%	-1.25%
`stability`	Spearman	0.7068	0.7851	+7.83%
`ProFET_NP_SP_Cleaved`	AUC	0.9855	0.9852	-0.03%

其中分类任务整体和 TensorFlow 基线比较接近。回归任务的差异更大，主要受 TensorFlow Adam 与 PyTorch Adam 优化器收敛差异影响。

和原始 TensorFlow 仓库的区别

这是 PyTorch 版本，不是 Keras/TensorFlow 版本
预训练权重仍然来自原始 TensorFlow pkl dump
下游 benchmark 的训练与推理流程已经切换到 PyTorch / torch_npu
没有完整复刻原始仓库的 UniRef 预训练流水线
建议与原始 TensorFlow 版使用不同 Conda 环境

许可证与引用

这个 PyTorch 版本基于原始 ProteinBERT 项目整理，使用时请同时参考原始仓库的许可证与引用要求。

如果你使用 ProteinBERT，请引用原论文：

Brandes, N., Ofer, D., Peleg, Y., Rappoport, N. & Linial, M.
ProteinBERT: A universal deep-learning model of protein sequence and function.
Bioinformatics (2022). https://doi.org/10.1093/bioinformatics/btac020

ProteinBERT PyTorch 版说明

这个版本当前覆盖的能力是：

加载原始 TensorFlow pkl 预训练权重
转换为 PyTorch state_dict
提取 embedding
在 CPU / Ascend NPU 上做推理
在 ProteinBERT 下游 benchmark 上做微调与评估

当前版本 不包含 原始仓库完整的 UniRef 预训练流水线的 PyTorch 复刻。

目录结构

protein_bert_pytorch/
├── README.md
├── setup.py
├── bin/
│   ├── convert_tf_to_pytorch
│   ├── env_npu.sh
│   ├── inference_proteinbert_npu
│   └── finetune_proteinbert_npu
├── scripts/
│   ├── demo_scripts/
│   ├── deploy_toolkit/
│   ├── tools/
│   ├── protein_benchmarks -> ../protein_benchmarks
│   └── proteinbert_pytorch -> ../proteinbert
├── protein_benchmarks/
└── proteinbert/
    ├── __init__.py
    ├── model.py
    ├── convert_weights.py
    ├── inference.py
    └── finetune.py

原先分散在其他目录里的 demo、tools 和 deploy 脚本，已经同步到了这里的 scripts/ 下。

脚本位置

本仓库把测试、调试和部署脚本统一收敛到：

./scripts/

主要分为三类：

scripts/demo_scripts/
- 原始 notebook 风格的正式用例
- 包括：
  - demo1_signalP_npu.py
  - demo2_all_benchmarks_npu.py
  - demo3_attention_npu.py
scripts/tools/
- embedding 提取、逐层调试、CPU/NPU 结果对比
- 包括：
  - get_embeddings_npu.py
  - debug_layerwise_npu.py
  - compare_embeddings.py
scripts/deploy_toolkit/
- 偏部署和交付的独立脚本
- 包括：
  - convert_weights.py
  - inference_npu.py
  - finetune_npu.py
  - env_npu.sh

兼容性说明：

scripts/protein_benchmarks 是到顶层 protein_benchmarks/ 的兼容链接
scripts/proteinbert_pytorch 是到顶层 proteinbert/ 的兼容链接

这样旧脚本路径和新仓库结构可以共存。

环境准备

克隆代码到本地

git clone https://atomgit.com/AI4Science/proteinbert_pytorch.git
cd proteinbert_pytorch
mkdir ./proteinbert_models

Conda 环境

推荐单独创建一个 PyTorch / NPU 环境，不要和原始 TensorFlow 版 ProteinBERT 共用环境。

推荐命令：

conda create -n proteinbert_npu python=3.10 -y
conda activate proteinbert_npu

安装依赖：

pip install torch==2.5.1 torch_npu==2.5.1 numpy==1.26.4 pyyaml \
    pandas scikit-learn h5py scipy
pip install decorator attrs psutil absl-py cloudpickle ml-dtypes tornado

如果你不使用 Ascend NPU，只做 CPU 推理，则 torch_npu 不是必须的。

Ascend NPU 环境

推荐直接使用仓库自带脚本（需要根据环境中的cann版本进行调整）：

source ./bin/env_npu.sh

或者使用 scripts/deploy_toolkit/ 里的同名脚本：

source ./scripts/deploy_toolkit/env_npu.sh

它会自动：

source /usr/local/Ascend/ascend-toolkit/set_env.sh
补充 Ascend driver / CANN 的运行时库路径
默认设置 ASCEND_RT_VISIBLE_DEVICES=0

权重与数据集存放位置

benchmark 数据集位置

下游 benchmark 数据已经按原仓库格式放在：

./protein_benchmarks/

其中包含：

signalP_binary
fluorescence
remote_homology
stability
scop
secondary_structure
disorder_secondary_structure
ProFET_NP_SP_Cleaved
PhosphositePTM
以及原始仓库里带上的其他 benchmark CSV

也就是说，对这个 PyTorch 目录本身来说，不需要额外再下载 benchmark 数据，直接使用仓库内的 ./protein_benchmarks/*.csv 即可。

当前 scripts/demo_scripts/demo2_all_benchmarks_npu.py 已支持原始 notebook 中的全部 9 个 benchmark，包括全局分类/回归任务，以及序列级分类/二分类任务。

但需要注意：当前数据目录里缺少 PhosphositePTM.train.csv，因此 PhosphositePTM 虽然代码路径已支持，仍然无法在现有数据包上完成正式训练评测。

安装

在目录内执行：

python setup.py install

如果你只是临时使用，也可以不安装，直接通过：

PYTHONPATH=. python ...

来调用 proteinbert 包。

使用方式

转换原始 TensorFlow 权重

convert_tf_to_pytorch \
    --input ./proteinbert_models/epoch_92400_sample_23500000.pkl \
    --output ./proteinbert_models/proteinbert_pytorch.pt

CPU 推理

inference_proteinbert_npu \
    --weights ./proteinbert_models/proteinbert_pytorch.pt \
    --seqs MKTVRQERLKSIVRILERSKEPVSGAQ ACDEFGHIKLMNPQRSTUVWXY \
    --device cpu

NPU 推理

inference_proteinbert_npu \
    --weights ./proteinbert_models/proteinbert_pytorch.pt \
    --seqs MKTVRQERLKSIVRILERSKEPVSGAQ ACDEFGHIKLMNPQRSTUVWXY \
    --device npu:0

benchmark 微调

例如在 signalP_binary 上微调：

finetune_proteinbert_npu \
    --weights ./proteinbert_models/proteinbert_pytorch.pt \
    --train-csv ./protein_benchmarks/signalP_binary.train.csv \
    --test-csv ./protein_benchmarks/signalP_binary.test.csv \
    --task binary \
    --device npu:0

运行原始 notebook 风格的 benchmark 脚本

运行完整 8 个 benchmark：

HOME=$PWD TORCH_DEVICE_BACKEND_AUTOLOAD=0 python \
    ./scripts/demo_scripts/demo2_all_benchmarks_npu.py \
    --device npu:0

如果只想跑其中几个 benchmark：

HOME=$PWD TORCH_DEVICE_BACKEND_AUTOLOAD=0 python \
    ./scripts/demo_scripts/demo2_all_benchmarks_npu.py \
    --device npu:0 \
    --benchmarks signalP_binary fluorescence scop

快速代码示例

from proteinbert import convert_tf_to_pytorch, tokenize_seqs
import torch

seqs = [
    "MKTVRQERLKSIVRILERSKEPVSGAQ",
    "ACDEFGHIKLMNPQRSTUVWXY",
]
seq_len = 512

model, n_annotations = convert_tf_to_pytorch(
    "./proteinbert_models/epoch_92400_sample_23500000.pkl"
)
model = model.to("cpu").eval()

tokens = torch.from_numpy(tokenize_seqs(seqs, seq_len)).long()
input_annotations = torch.zeros(len(seqs), n_annotations)

with torch.no_grad():
    local_outputs, global_outputs = model(tokens, input_annotations)

精度与验证结果

当前已验证结果

以下结果是在当前工作区中，使用原始 TensorFlow 权重转换后的 PyTorch 模型，在 Ascend NPU 上得到的：

项目	结果
Embedding 提取 (`get_embeddings_npu.py`)	`local=(3, 512, 1562)`，`global=(3, 15599)`
CPU/NPU `seq_probs` 偏差	`max_abs_diff=0.01190776`，`mean_abs_diff=0.00008779`
CPU/NPU `annotations` 偏差	`max_abs_diff=0.00603402`，`mean_abs_diff=0.00000099`
NPU 推理性能	`batch=4`，`seq_len=512`，平均 `0.0048 s`，约 `826 seq/s`
`demo1_signalP_npu.py`	`AUC=0.995660`，`Accuracy=0.983141`

目前已完成 8 个 benchmark 的测试集指标如下：

Benchmark	指标	当前实测 NPU 结果
`signalP_binary`	AUC	`0.995857`
`fluorescence`	Spearman	`0.662879`
`remote_homology`	Accuracy	`0.213092`
`stability`	Spearman	`0.765755`
`scop`	Accuracy	`0.885998`
`secondary_structure`	Accuracy	`0.740793`
`disorder_secondary_structure`	AUC	`0.872198`
`ProFET_NP_SP_Cleaved`	AUC	`0.982882`

PhosphositePTM 当前未完成，不是代码报错，而是原始 benchmark 数据包本身缺少：

./protein_benchmarks/PhosphositePTM.train.csv

当前仓库里只有：

PhosphositePTM.valid.csv
PhosphositePTM.test.csv

因此在没有补齐训练集文件之前，PhosphositePTM 无法按照原始 notebook 流程完成微调与评测。

上表是当前环境里真实跑出来的最新结果，应当视为本目录当前最直接的实测指标。

迁移参考结果

迁移说明里还给出了一组参考 benchmark 结果，用于和 TensorFlow/GPU 基线对比：

Benchmark	指标	GPU (TF)	NPU (PyTorch)	偏差
`signalP_binary`	AUC	0.9961	0.9965	+0.04%
`fluorescence`	Spearman	0.6475	0.6597	+1.22%
`remote_homology`	Accuracy	22.42%	21.17%	-1.25%
`stability`	Spearman	0.7068	0.7851	+7.83%
`ProFET_NP_SP_Cleaved`	AUC	0.9855	0.9852	-0.03%

其中分类任务整体和 TensorFlow 基线比较接近。回归任务的差异更大，主要受 TensorFlow Adam 与 PyTorch Adam 优化器收敛差异影响。

和原始 TensorFlow 仓库的区别

这是 PyTorch 版本，不是 Keras/TensorFlow 版本
预训练权重仍然来自原始 TensorFlow pkl dump
下游 benchmark 的训练与推理流程已经切换到 PyTorch / torch_npu
没有完整复刻原始仓库的 UniRef 预训练流水线
建议与原始 TensorFlow 版使用不同 Conda 环境

许可证与引用

这个 PyTorch 版本基于原始 ProteinBERT 项目整理，使用时请同时参考原始仓库的许可证与引用要求。

如果你使用 ProteinBERT，请引用原论文：

Brandes, N., Ofer, D., Peleg, Y., Rappoport, N. & Linial, M.
ProteinBERT: A universal deep-learning model of protein sequence and function.
Bioinformatics (2022). https://doi.org/10.1093/bioinformatics/btac020

ProteinBERT PyTorch 版说明

目录结构

脚本位置

环境准备

克隆代码到本地

Conda 环境

Ascend NPU 环境

权重与数据集存放位置

推荐的标准存放方式

benchmark 数据集位置

安装

使用方式

快速代码示例

精度与验证结果

当前已验证结果

迁移参考结果

和原始 TensorFlow 仓库的区别

许可证与引用

ProteinBERT PyTorch 版说明

目录结构

脚本位置

环境准备

克隆代码到本地

Conda 环境

Ascend NPU 环境

权重与数据集存放位置

推荐的标准存放方式

benchmark 数据集位置

安装

使用方式

快速代码示例

精度与验证结果

当前已验证结果

迁移参考结果

和原始 TensorFlow 仓库的区别

许可证与引用