Llama-3.2-1B-Model:用户可利用该项目对图像中红色标记区域进行智能缩放增强。它基于FLUX.2-Klein架构，支持高质量图像生成，专为华为昇腾NPU优化，提供高效推理和简单API接口。【此简介由AI生成】 - AtomGit AI社区

meta-llama/Llama-3.2-1B 昇腾 NPU 适配报告

模型简介

Llama 3.2 是 Meta 发布的新一代开源大语言模型系列，针对边缘设备和移动端进行了优化。1B 版本是该系列中最轻量的模型，参数规模约 12.4 亿，适合在资源受限环境下进行快速推理和部署。

属性	说明
模型名称	meta-llama/Llama-3.2-1B
发布方	Meta AI
参数规模	1.24B
上下文长度	128K tokens
模型类型	Decoder-only Transformer
词汇表大小	128K (tiktoken-based)
支持的精度	fp32 / bf16 / fp16
官方许可	Llama 3.2 License

注意：meta-llama/Llama-3.2-1B 为 HuggingFace Gated Repository，需申请授权后访问。本报告验证时使用了权重一致的免授权镜像 unsloth/Llama-3.2-1B。

硬件与软件环境

测试环境

项目	配置
服务器型号	Atlas 800T A2
NPU 型号	Ascend910B4
NPU 数量	1 卡
HBM 容量	32 GB
CPU	ARM Kunpeng 920
操作系统	EulerOS 2.0 (aarch64)
CANN 版本	8.5.1
Python	3.11.14
PyTorch	2.9.0+cpu
torch_npu	2.9.0.post1+gitee7ba04
transformers	4.51.0
accelerate	1.6.0

环境初始化

# 1. 加载 CANN 环境（必须）
source /usr/local/Ascend/ascend-toolkit/set_env.sh

# 2. 配置华为镜像源
export PIP_INDEX_URL=https://repo.huaweicloud.com/repository/pypi/simple/
export HF_ENDPOINT=https://hf-mirror.com

# 3. 安装依赖
pip install transformers accelerate sentencepiece protobuf numpy==1.26.4

# 4. 验证 NPU 可用
python3 -c "import torch; import torch_npu; a = torch.randn(3,4).npu(); print(a + a)"

模型适配

适配方式

本模型基于 transformers 框架加载，无需修改模型源码。核心适配通过 torch_npu.contrib.transfer_to_npu 自动完成 CUDA → NPU 的 API 映射。

关键适配步骤

1. 自动迁移注入（脚本最顶部）

import torch_npu
from torch_npu.contrib import transfer_to_npu

transfer_to_npu 自动完成以下映射：

torch.cuda.is_available() → 返回 True（NPU 可用时）
torch.device('cuda') → torch.device('npu')
Tensor.cuda() / Module.cuda() → .npu()
torch.cuda.amp.* → torch.npu.amp.*

2. 设备检测

# 显式初始化 NPU（比 torch.cuda.is_available() 更可靠）
try:
    torch.npu.init()
    npu_available = True
except Exception:
    npu_available = False

device = torch.device("cuda" if npu_available else "cpu")

3. 模型加载

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "unsloth/Llama-3.2-1B"  # 或 meta-llama/Llama-3.2-1B（需授权）

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.bfloat16,  # Ascend910 原生支持 bf16
    trust_remote_code=True,
)
model = model.to(device)
model.eval()

4. 推理（混合精度）

torch.manual_seed(42)
inputs = tokenizer(prompt, return_tensors="pt", padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    # device_type 传 "cuda"，transfer_to_npu 内部映射为 npu
    with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
        outputs = model.generate(
            **inputs,
            max_new_tokens=50,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=0.8,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

适配检查清单

检查项	状态	说明
transfer_to_npu 注入	通过	入口脚本顶部注入
设备检测	通过	torch.npu.init() + torch.device("cuda")
get_device_properties.major	无需处理	本模型未使用 CUDA compute capability 判断
torch.cuda.amp / autocast	通过	transfer_to_npu 自动映射
CUDA kernel / .cu 文件	无需处理	transformers 纯 Python 实现
DP/DDP 改造	无需处理	单卡推理
flash_attn 等第三方库	无需处理	transformers 原生 attention 实现
精度验证	通过	greedy fp32 vs bf16 token 序列一致

精度验证

验证方法

采用 Greedy Decoding 对比法：同一 prompt、同一随机种子、关闭采样（do_sample=False），分别使用 torch.float32 和 torch.bfloat16 加载模型，对比生成的 token 序列是否完全一致。

python3 llama32_npu_greedy_verify.py

验证结果

对比项	fp32	bf16	是否一致
Prompt	The future of artificial intelligence is	-	-
max_new_tokens	20	20	-
生成文本	The future of artificial intelligence is here...	The future of artificial intelligence is here...	一致
Token IDs	[128000, 791, 3938, ...]	[128000, 791, 3938, ...]	逐位相同

结论：NPU 上 bf16 推理与 fp32 基线在 greedy decoding 条件下生成的 20 个 token 完全一致，精度无损失。

采样推理示例

[Prompt] The future of artificial intelligence is
[Output] The future of artificial intelligence is here: The new artificial intelligence system uses the power of music to predict how you will respond to information...

[Prompt] Once upon a time in a distant galaxy
[Output] Once upon a time in a distant galaxy, there was a great warrior. He was known for his courage, strength, and skill...

[Prompt] The key to success is
[Output] The key to success is to choose the right kind of marketing campaign that will effectively promote your products...

性能测试

测试方法

使用 time.perf_counter() 测量端到端生成延迟，通过 torch.npu.synchronize() 确保 NPU 计算完成后再计时。每个配置运行 3 次取平均。

python3 llama32_npu_perf_test.py

性能数据（bf16, greedy decoding）

Batch Size	Prompt Length	生成长度	平均延迟	吞吐量	单 token 延迟
1	32	128	4.063s	31.50 tokens/s	31.74 ms
1	128	128	4.120s	31.07 tokens/s	32.19 ms
1	512	128	4.153s	30.82 tokens/s	32.44 ms
1	1024	128	4.067s	31.48 tokens/s	31.77 ms
4	32	128	4.136s	123.79 tokens/s	32.31 ms
4	128	128	4.134s	123.84 tokens/s	32.30 ms
8	32	128	4.391s	233.19 tokens/s	34.31 ms

关键结论

单卡吞吐：batch=1 时约 31.5 tokens/s，batch=8 时可达 233 tokens/s
延迟稳定性：不同 prompt 长度（32~1024）下，单 token 延迟基本稳定在 31~33 ms
内存占用：单卡 bf16 推理约 2.3 GB HBM，远小于 Ascend910B4 的 32GB 容量
显存扩展性：batch=8 时仍有大量余量，可进一步增大 batch 或序列长度

快速开始

单条命令运行推理

source /usr/local/Ascend/ascend-toolkit/set_env.sh
export HF_ENDPOINT=https://hf-mirror.com
python3 llama32_npu_inference.py

运行精度验证

python3 llama32_npu_greedy_verify.py

运行性能基准

python3 llama32_npu_perf_test.py

文件说明

文件	说明
`llama32_npu_inference.py`	NPU 推理主脚本，含采样生成示例
`llama32_npu_greedy_verify.py`	精度验证脚本（fp32 vs bf16 greedy 对比）
`llama32_npu_perf_test.py`	性能基准测试脚本
`llama32_cpu_baseline.py`	CPU 基线推理脚本
`llama32_npu_perf_results.json`	性能测试原始数据（JSON）
`README.md`	本报告

常见问题

问题	原因	解决方案
`403 Client Error`	未授权访问 gated repo	使用 `unsloth/Llama-3.2-1B` 或申请 HuggingFace 授权
`Invalid device ID`	`ASCEND_RT_VISIBLE_DEVICES` 设置不当	在 Python 进程启动前 export，或不设置该变量
`torch_dtype is deprecated`	transformers 版本警告	将 `torch_dtype=` 改为 `dtype=`
双精度降级警告	Ascend910 不支持 fp64	无需处理，torch_npu 自动降级为 fp32
`transfer_to_npu` 禁用 JIT	torch_npu 当前不支持 `torch.jit.script`	如需 JIT，不使用 `transfer_to_npu`，改为手动 `.npu()`
`ModuleNotFoundError: decorator`	torch_npu 运行时依赖缺失	`pip install decorator attrs psutil`

许可证

本报告及相关适配脚本遵循与模型本身一致的许可证。meta-llama/Llama-3.2-1B 模型使用 Llama 3.2 License。

适配脚本代码（llama32_npu_inference.py 等）采用 MIT License 开源。

贡献与反馈

如有问题或建议，欢迎通过以下方式反馈：

GitCode Issue: Ascend-SACT/Llama-3.2-1B
华为昇腾社区: 昇腾论坛