zkx_/facial_emotions_image_detection-ascend

facial_emotions_image_detection on Ascend NPU

1. 简介

本文档记录 facial_emotions_image_detection 面部表情识别模型在昇腾 NPU（Ascend 910B3）上的迁移适配、精度评测与性能验证结果。

该模型基于 ViT（Vision Transformer，12 层，768 维），在面部表情数据集上微调，支持 7 种表情分类：sad（悲伤）、disgust（厌恶）、angry（生气）、neutral（中性）、fear（恐惧）、surprise（惊讶）、happy（开心）。输入为 224×224 RGB 人脸图像，通过 AutoImageProcessor 预处理，输出 7 维 softmax 概率分布。

注：该模型与 trpakov/vit-face-expression 具有相同的架构（ViT-base-patch16-224）和标签空间（7 种表情），权重来源相同。

2. 验证环境

组件	版本
`torch`	`2.8.0`
`torch_npu`	`2.8.0.post4`
`transformers`	`5.8.1`
`CANN`	`8.5.1`

NPU：8 × Ascend 910B3
精度对比基准：CPU（x86, PyTorch 2.8.0）

3. 部署使用流程

3.1 环境准备

conda create -n facial_emotions_image_detection python=3.11 -y
conda activate facial_emotions_image_detection

pip install torch==2.8.0 torch_npu==2.8.0.post4 \
    -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install transformers torchvision pillow numpy \
    -i https://pypi.tuna.tsinghua.edu.cn/simple

3.2 推理脚本使用

python inference.py --image face.jpg --device npu
python inference.py --image_dir ./faces/ --device npu

编程接口：

from inference import AgeClassifier
clf = AgeClassifier(model_path="./facial_emotions_image_detection", device="npu")
results = clf.predict(["face.jpg"])
# results[0] → {'happy': 0.92, 'surprise': 0.05, ...}

4. Smoke 验证

python inference.py --image face.jpg --device npu

预期输出：7 种表情按概率降序排列，预测概率最高者为首选，无运行时错误。

5. 性能参考

测试条件：10 张合成 224×224 图像（固定随机种子），batch_size=8，NPU 预热 1 轮。

指标	数值
CPU 吞吐量	`3.7` img/s
NPU 吞吐量	`232.2` img/s
CPU/NPU 加速比	`62.8` ×

6. 精度评测

6.1 评测方法

分别在 CPU 和 NPU 上推理 10 张合成图像，比较 softmax 概率向量的余弦相似度、MAE 和 Top-1 分类一致性。

6.2 评测结果

指标	数值
平均余弦相似度	`0.999983`
MAE	`0.000444`
精度误差率	`0.0017%`
Top-1 准确率	`100.0%`

结论：精度误差率 0.0017%，远低于 1% 要求，评测通过。

7. 迁移适配说明

7.1 模型结构

Backbone：ViT-base-patch16-224（12 层 Transformer，768 维）
Head：线性层（768 → 7），7 类 softmax
输入：224×224 RGB，16×16 patches
参数量：85.8M

7.2 适配要点

AutoModelForImageClassification.from_pretrained() 加载，model.to("npu:0") 迁移
AutoImageProcessor 在 CPU 完成预处理（resize + normalize）
Tensor 通过 .to("npu:0") 转移；输出 softmax 后 .cpu().numpy() 返回
与 vit-face-expression 共享相同架构，适配代码可直接互用

8. 注意事项

输入格式：需为 RGB 正面人脸照片，侧面/遮挡/低分辨率影响识别准确率
NPU 预热：首次推理算子编译约 3-5 秒，建议预热后使用
标签顺序差异：该模型标签顺序为 [sad, disgust, angry, neutral, fear, surprise, happy]，与 vit-face-expression 的 [angry, disgust, fear, happy, neutral, sad, surprise] 不同
ViT 架构：85.8M 参数，batch_size 不宜超过 8 以避免 NPU 内存溢出