Monad 是基于 Llama 架构的轻量级大语言模型,参数量约 27M。该模型能够进行文本生成和对话任务,适用于低资源部署场景。
Monad-ascend/
├── inference.py # 推理测试脚本
├── log.txt # 测试日志
├── README.md # 本文档
├── test_prompt.txt # 测试提示词
├── inference_result.json # 推理结果
└── precision_result.json # 精度测试结果docker exec -it test-modelagent bashsource /usr/local/Ascend/ascend-toolkit/set_env.sh模型文件位于 /data/ysws/agentsp/5-16/Monad/PleIAs/Monad/ 目录下:
pip install transformers torch_npu -i https://pypi.huaweicloud.com/repository/pypi/simple/Run the inference script for text generation:
cd /data/ysws/agentsp/5-16/Monad-ascend/
python3 inference.py
python3 inference.py --mode inference运行精度对比测试:
cd /data/ysws/agentsp/5-16/Monad-ascend/
python3 inference.py --mode precision_test| 参数 | 说明 | 默认值 |
|---|---|---|
--mode | 测试模式: all, inference 或 precision_test | all |
| 指标 | 实测值 | 阈值 | 状态 |
|---|---|---|---|
| Cosine 相似度 | 0.999928 | > 0.99 | PASS |
| 1 - Cosine 相似度 | 0.0072% | < 1.00% | PASS |
| CPU 推理时间 | 1.011s | - | - |
| NPU 推理时间 | 0.605s | - | - |
| 加速比 | 1.67x | > 1x | PASS |
输入提示: "Hello, I am a language model and"
生成文本:
Hello, I am a language model and a language model, and I am a language model, and I am a language model, and I
Monad NPU Test
Model: PleIAs/Monad (Llama-based LLM)
Output: /data/ysws/agentsp/5-16/Monad-ascend
============================================================
Inference Test (NPU)
============================================================
Device: npu:0
Loading model and tokenizer...
Model loaded successfully
Prompt: Hello, I am a language model and
Input tokens: 11
Inference time: 2.634s
Generated text: Hello, I am a language model and a language model, and I am a language model, and I am a language model, and I
============================================================
Precision Test (CPU vs NPU)
============================================================
NPU Device: npu:0
Loading model...
Input tokens: 11
Running on CPU...
Running on NPU...
CPU inference time: 1.011s
NPU inference time: 0.605s
Speedup: 1.67x
Cosine similarity: 0.999928
1 - cosine similarity: 0.0072% (threshold: 1.0%)
Status: PASS
============================================================
Precision Test Result: FAIL (NOTE: bfloat16)
============================================================
============================================================
Test Complete!
============================================================import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_DIR = "/data/ysws/agentsp/5-16/Monad/PleIAs/Monad"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR)
model = AutoModelForCausalLM.from_pretrained(
MODEL_DIR,
torch_dtype=torch.bfloat16,
device_map="npu:0"
)
model.eval()
prompt = "Hello, I am a language model and"
inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to("npu:0") for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=20, do_sample=False)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)| 组件 | 说明 |
|---|---|
| model.embed_tokens | 词嵌入层 |
| model.layers | 64 层 Transformer 解码器 |
| model.norm | RMS 归一化 |
| lm_head | 语言模型头 |
从 config.json 提取的关键参数:
{
"hidden_size": 256,
"intermediate_size": 768,
"num_attention_heads": 4,
"num_hidden_layers": 64,
"vocab_size": 8192,
"torch_dtype": "bfloat16"
}A: NPU 相比 CPU 有显著加速(9倍)。bfloat16 精度也提供了更快的计算速度。
A: 可以尝试调整 max_new_tokens、temperature、top_p 等生成参数来改善输出质量。
本项目遵循 Apache-2.0 许可证