vit-face-expression on Ascend NPU

1. 简介

本文档记录 trpakov/vit-face-expression 面部表情识别模型在昇腾 NPU（Ascend 910B3）上的迁移适配、精度评测与性能验证结果。

该模型基于 ViT（Vision Transformer）在 FER2013 数据集上微调，支持 7 种表情：angry、disgust、fear、happy、neutral、sad、surprise。输入为 224×224 人脸图像，输出 7 维 softmax 概率。

2. 验证环境

组件	版本
`torch`	`2.8.0`
`torch_npu`	`2.8.0.post4`
`transformers`	`5.8.1`
`CANN`	`8.5.1`

NPU：8 × Ascend 910B3
精度对比基准：CPU（x86, PyTorch 2.8.0）

3. 部署使用流程

3.1 环境准备

conda create -n vit-face-expression python=3.11 -y
conda activate vit-face-expression
pip install torch==2.8.0 torch_npu==2.8.0.post4 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install transformers torchvision pillow numpy -i https://pypi.tuna.tsinghua.edu.cn/simple

3.2 模型权重下载

HF_ENDPOINT=https://hf-mirror.com huggingface-cli download trpakov/vit-face-expression --local-dir ./vit-face-expression

3.3 推理

python inference.py --image face.jpg --device npu

from inference import AgeClassifier
clf = AgeClassifier(model_path="./vit-face-expression", device="npu")
results = clf.predict(["face.jpg"])
# results[0] → {'happy': 0.92, 'surprise': 0.05, ...}

4. Smoke 验证

python inference.py --image face.jpg --device npu

预期输出：按概率降序排列的 7 种表情及其置信度，无错误。

5. 性能参考

测试条件：10 张合成 224×224 图像（固定种子），batch_size=8。

指标	数值
CPU 吞吐量	`3.9` img/s
NPU 吞吐量	`242.9` img/s
CPU/NPU 加速比	`62.3` ×

6. 精度评测

6.1 评测方法

分别在 CPU 和 NPU 上推理 10 张合成图像，比较 softmax 概率向量的余弦相似度、MAE 和 Top-1 一致性。

6.2 评测结果

指标	数值
平均余弦相似度	`0.999978`
MAE	`0.000744`
精度误差率	`0.0022%`
Top-1 准确率	`100.0%`

结论：精度误差率 0.0022%，远低于 1% 要求，PASS。

7. 迁移适配说明

Backbone：ViT-base-patch16-224（12 层，768 维），Head：768→7
使用 AutoModelForImageClassification + model.to("npu:0") 迁移
AutoImageProcessor 在 CPU 预处理，tensor 转移至 NPU
输出 logits 通过 softmax 后返回 CPU

8. 注意事项

最佳输入为正面人脸照片；侧面/遮挡图像影响精度
首次 NPU 推理需预热（算子编译约 3-5s）
模型目录含 ONNX 权重，可用于跨平台部署
原始 FER2013 测试准确率 71.16%，本评测仅验证 NPU 与 CPU 输出一致性