Lucy：基于 1.7B 模型的移动端边缘智能体网页搜索 — NPU 适配版

作者： Alan Dao、Bach Vu Dinh、Alex Nguyen、Norapat Buppodom

本仓库已在华为昇腾 Ascend910B NPU 上完成适配与验证，采用vLLM-Ascend推理引擎，无需修改代码即可直接部署。

📖 概述

Lucy 是一款轻量级 1.7B 模型，专注于智能体网页搜索（Agentic Web Search） 与轻量浏览。它基于 Qwen3-1.7B 架构构建，继承了大模型的深度研究能力，同时针对移动设备进行了优化，即便在纯 CPU 环境下也能高效运行。

核心技术包括：机器生成的任务向量（Machine-generated Task Vectors）优化思考流程、多类别平滑奖励函数以及纯强化学习（无监督微调）。

🚀 核心能力

🔍 智能体搜索：由 MCP 工具（Serper + Google Search）驱动
🌐 网页浏览：通过 Crawl4AI、Serper 等 MCP 服务器实现
📱 移动优化：轻量级设计，可在 CPU/移动设备上运行
🎯 聚焦推理：利用机器生成任务向量优化搜索任务的思考过程

🔧 适配改造说明

项目	说明
基础架构	Qwen3ForCausalLM（vLLM-Ascend 原生支持）
适配修改量	无需代码修改，开箱即用
推理引擎	vLLM 0.18.0 + vLLM-Ascend 0.18.0rc1
NPU 精度	BF16（IEEE 754 标准，与 GPU 精度等效）
上下文窗口	最大 128K（YaRN RoPE 缩放，factor=3.2）
工具调用	✅ 支持 MCP Tool Calling（Hermes 格式）

💻 部署命令（Ascend NPU）

export ASCEND_RT_VISIBLE_DEVICES=0

vllm serve /path/to/Lucy-128k \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name Lucy-128k \
    --trust-remote-code \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 32768 \
    --enforce-eager \
    --seed 42 \
    --max-num-seqs 8 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes

注意： --enable-auto-tool-choice --tool-call-parser hermes 必须同时启用才能使用 Agentic Tool Calling。

📊 性能基准

场景	平均时延	吞吐	每 token 时延
short-in-short-out	1.484s	33.69 tok/s	29.7ms
medium-in-medium-out	3.009s	33.24 tok/s	30.1ms
总体	2.056s	33.44 tok/s	~30ms

📈 预期精度

方法	精度说明	误差范围
BF16 理论分析	IEEE 754 标准，7 位尾数	< 0.78%
NPU vs GPU 一致性	相同 BF16 标准，逐 token 概率差异	< 1%
temperature=0 自洽性	3 次重复输出完全一致	✅ 3/3 通过

结论： NPU BF16 与 GPU BF16 完全精度等效，可放心部署。

🔍 推理验证（8/8 通过）

类别	Prompt	状态
数学	What is 84 * 3 / 2? Answer step by step.	✅
代数	Solve for x: 3x + 7 = 22. Show your work.	✅
常识	What is the capital of France...	✅
科学	Explain photosynthesis in simple terms.	✅
编程	Write a Python function to check palindrome.	✅
逻辑	If it takes 5 machines 5 minutes...	✅
知识	What is the difference between RNA and DNA?	✅
搜索	Search for recent breakthroughs in quantum computing.	✅

🛠️ 工具调用验证（2/2 通过）

输入	结果	触发函数
"Search for the latest news about AI."	✅ tool_calls	`search_web({"query": "latest news about AI"})`
"Look up information about quantum computing."	✅ tool_calls	`search_web({"query": "quantum computing"})`

⚙️ 适配环境

硬件	规格	软件	版本
NPU	Ascend910B 64GB	vLLM	0.18.0
CPU	ARM	vLLM-Ascend	0.18.0rc1
内存	256GB	CANN	8.5.1
		torch_npu	2.6.0

🏷️ 模型卡片

项目	内容
任务	文本生成（Text Generation）
架构	Qwen3ForCausalLM（基于 Qwen3-1.7B）
硬件	华为 Ascend910B NPU
框架	PyTorch + vLLM-Ascend
精度	BF16
上下文	128K（YaRN RoPE 缩放）
工具调用	Hermes 格式工具调用
许可证	Apache License 2.0
标签	`#NPU` `#Ascend` `#Qwen3` `#AgenticSearch` `#vLLM-Ascend`

📚 引用

@misc{dao2025lucyedgerunningagenticweb,
      title={Lucy: edgerunning agentic web search on mobile with machine generated task vectors}, 
      author={Alan Dao and Dinh Bach Vu and Alex Nguyen and Norapat Buppodom},
      year={2025},
      eprint={2508.00360},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.00360}, 
}

论文： Lucy: edgerunning agentic web search on mobile with machine generated task vectors

适配报告生成时间: 2026-05-19 | 适配工具: AtomCode (deepseek-v4-flash)