Facial Emotion Detection (ViT) on Ascend NPU

1. 简介

本文档记录 dima806/facial_emotions_image_detection 在 Ascend 910B3 NPU 环境下的适配与验证结果。该模型基于 Vision Transformer（ViT）架构，用于人脸表情识别，支持 7 种情绪分类：sad、disgust、angry、neutral、fear、surprise、happy。

本仓库提供：

inference.py：NPU 推理脚本，支持单张/批量图片情绪识别
eval.py：精度与性能评测脚本，对比 NPU 与 CPU 输出差异
log.txt：评测运行日志

2. 验证环境

组件	版本
`torch`	`2.8.0`
`torch_npu`	`2.8.0.post4`
`transformers`	`4.57.6`

NPU：Ascend 910B3，1 逻辑卡
模型架构：ViTForImageClassification（12 层，768 隐藏维，12 注意力头）
输入尺寸：224x224 RGB 图像
分类数：7 种情绪

3. 推理启动

环境准备

pip install torch torch_npu transformers Pillow
export ASCEND_RT_VISIBLE_DEVICES=0

单张图片推理

python inference.py --image path/to/face.jpg

批量图片推理

python inference.py --image-dir path/to/images/
python inference.py --image img1.jpg --image img2.jpg --image img3.jpg

输出格式

{
  "model": "facial_emotions_image_detection",
  "device": "npu:0",
  "num_images": 1,
  "results": [
    {
      "image": "face.jpg",
      "prediction": "happy",
      "confidence": 0.9821,
      "probabilities": {
        "sad": 0.0001, "disgust": 0.0, "angry": 0.0,
        "neutral": 0.0012, "fear": 0.0, "surprise": 0.0166,
        "happy": 0.9821
      }
    }
  ]
}

4. Smoke 验证

python inference.py --image test_images/happy.jpg

5. 性能参考

指标	CPU	NPU
平均推理时间	`3.3798 s`	`0.0541 s`
吞吐量	`2.07 img/s`	`129.34 img/s`
加速比	-	`62.47x`

6. 精度评测

精度评测采用概率误差与预测一致性双指标。

指标	数值
测试图片数	`7`
最大概率绝对误差	`4.9697e-03`
平均概率绝对误差	`1.0136e-03`
预测一致率	`7/7 (100.00%)`
错误率	`0.00%`
精度要求（错误率 < 1%）	通过

结论：NPU 与 CPU 的预测结果完全一致（100%），概率误差极小，精度通过验证。

7. 注意事项

图片格式：支持 JPG、PNG、BMP 等常见格式，内部自动转换为 RGB。
预处理：使用 ViTImageProcessor 自动进行 Resize、Rescale、Normalize。
合成测试图：本仓库 test_images/ 下的测试图片为程序生成，仅用于精度对比，不反映模型真实分类能力。
ViT 模型：相比 CNN 模型，ViT 在 NPU 上的加速效果更显著（patch-level 并行计算）。