Reranker:可在NPU环境快速部署和验证Qwen3-Reranker-0.6B模型，用于文本生成、向量嵌入、排序和评分。支持多模式推理，零代码适配，昇腾NPU原生加速，精度与CPU对齐。【此简介由AI生成】 - AtomGit AI社区

Qwen3-Reranker-0.6B

可用于在 NPU 环境 快速部署和验证 Qwen3-Reranker-0.6B 模型，支持文本生成、向量嵌入、排序（Rerank）和评分（Score）功能，已完成昇腾 Ascend NPU 适配验证及精度评测。

English

模型描述

项目	内容
模型名称	Qwen3-Reranker-0.6B
模型来源	ModelScope: Qwen/Qwen3-Reranker-0.6B
架构类型	`Qwen3ForCausalLM`
参数量	0.6B
权重大小	1.11 GB (model.safetensors)
支持精度	bfloat16, float32
嵌入向量维度	1024
最大序列长度	4096 tokens
适配结论	✅ 完全适配 — 无需代码修改

Qwen3-Reranker-0.6B 基于 Qwen3-0.6B-Base 微调，使用标准 Qwen3 解码器架构，专为文本排序与检索任务设计。该模型支持以下运行模式：

运行模式	启动参数	状态
📝 文本生成	`--task generate`（默认）	✅
📐 向量嵌入 / 排序	`--runner pooling --convert embed`	✅
📊 分类模式	`--runner pooling --convert classify`	❌ 模型无评分头

关键特性

零代码适配：Qwen3ForCausalLM 架构在 vLLM 中原生支持，vLLM-Ascend 全面兼容
多模式推理：支持文本生成、嵌入、排序三大功能
极致轻量：0.6B 参数，仅需 1.14 GB 显存
昇腾原生化：所有算子（GQA、RoPE、SwiGLU）通过昇腾 CANN 原生加速

软硬件环境

硬件环境

项目	配置
服务器型号	Atlas 800 A2
NPU 型号	Ascend 910B (64GB)
CPU 核数	64
内存	512 GB

软件环境

软件	版本
操作系统	Ubuntu 22.04.5 LTS / EulerOS 2.10
Python	3.11.14
昇腾驱动	CANN 8.5.1
torch	2.6.0
torch_npu	2.6.0
vLLM	0.18.0
vLLM-Ascend	0.18.0
transformers	4.51.3
modelscope	1.23.0

快速开始

1. 安装依赖

# 安装 vLLM 及 vLLM-Ascend
pip install vllm==0.18.0
pip install vllm-ascend==0.18.0

# 安装 transformers 和 modelscope
pip install transformers>=4.51.0 modelscope

2. 启动推理服务

基础部署（推荐用于排序）

python3 -m vllm.entrypoints.openai.api_server \
    --model /path/to/Qwen3-Reranker-0.6B \
    --runner pooling \
    --convert embed \
    --enforce-eager \
    --trust-remote-code \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.3 \
    --port 8000

文本生成模式

python3 -m vllm.entrypoints.openai.api_server \
    --model /path/to/Qwen3-Reranker-0.6B \
    --enforce-eager \
    --trust-remote-code \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.5 \
    --port 8000

3. API 调用

向量嵌入

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/path/to/Qwen3-Reranker-0.6B",
    "input": ["你的文本"]
  }'
# 返回: embedding 数组 (1024维)

排序（Rerank）

curl http://localhost:8000/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/path/to/Qwen3-Reranker-0.6B",
    "query": "查询文本",
    "documents": ["文档1", "文档2", "文档3"]
  }'
# 返回: 排序后的文档列表及分数

评分（Score）

curl http://localhost:8000/score \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/path/to/Qwen3-Reranker-0.6B",
    "text_1": ["查询"],
    "text_2": ["文档"]
  }'
# 返回: 相似度分数

文本生成

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/path/to/Qwen3-Reranker-0.6B",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 50
  }'

模型下载

从 ModelScope 下载

export MODELSCOPE_CACHE=/path/to/cache

python3 -c "
from modelscope import snapshot_download
model_dir = snapshot_download('Qwen/Qwen3-Reranker-0.6B')
print(f'模型路径: {model_dir}')
"

模型文件结构

Qwen3-Reranker-0.6B/
├── config.json              # 模型配置
├── generation_config.json   # 生成配置
├── model.safetensors        # 模型权重 (1.11 GB)
├── tokenizer.json           # Tokenizer
├── tokenizer_config.json    # Tokenizer 配置
└── tokenizer.model          # Tokenizer 模型文件

模型配置一览

配置项	值
hidden_size	1024
num_hidden_layers	28
num_attention_heads	16
num_key_value_heads	8 (GQA)
head_dim	128
intermediate_size	3072
max_position_embeddings	40960
rope_theta	1,000,000
vocab_size	151,669
tie_word_embeddings	true

模型部署

部署流程图

Python API 方式

from vllm import LLM

# 排序/嵌入模式
llm = LLM(
    model="/path/to/Qwen3-Reranker-0.6B",
    runner="pooling",
    convert="embed",
    enforce_eager=True,
    max_model_len=4096,
    gpu_memory_utilization=0.3,
)

# 向量嵌入
embeddings = llm.embed(["Hello world"])
print(f"Embedding dimension: {len(embeddings[0].outputs.embedding)}")

资源占用

指标	值
模型权重显存	1.14 GB
KV Cache 容量 (gpu_mem=0.3)	~158,848 tokens
最大并发请求 (4K context)	~38
单请求嵌入延迟	<100ms

精度评测

测试方法

使用 transformers 框架分别在 CPU（基线）和 Ascend NPU（待验证）上运行，通过相同随机种子确保结果可复现。测试流程：

加载 Qwen3-Reranker-0.6B 权重
对 8 组不同语义的测试输入进行推理
提取 最后一层隐藏状态（Last Hidden State） 和 Logits 输出
计算逐样本精度指标

精度指标定义

指标	定义	通过标准
余弦相似度（Cosine Similarity）	向量方向一致性	>0.99
平均绝对误差（MAE）	逐元素绝对差均值	—
均方根误差（RMSE）	逐元素差平方均值开根	—
平均相对误差（Mean Rel%）	相对误差百分比的均值	<1%
最大相对误差（Max Rel%）	相对误差百分比的最大值	—

测试输入

#	输入文本	长度 (tokens)
1	What is the capital of France?	7
2	The capital of France is Paris.	7
3	Paris is a beautiful city.	6
4	Explain the theory of relativity.	6
5	Machine learning is a subset of artificial intelligence.	9
6	The Earth orbits around the Sun.	7
7	Quantization reduces model size and improves inference speed.	10
8	The mitochondria is the powerhouse of the cell.	9

精度对比结果

隐藏状态（Hidden States）对比

输入 #	余弦相似度	MAE	RMSE	平均相对误差	最大相对误差
1	1.000000	5.45e-06	7.51e-06	0.0076%	5.92%
2	1.000000	5.42e-06	7.51e-06	0.0027%	1.34%
3	1.000000	4.88e-06	6.64e-06	0.0041%	1.20%
4	1.000000	4.25e-06	6.51e-06	0.0014%	0.15%
5	1.000000	5.46e-06	7.57e-06	0.0014%	0.09%
6	1.000000	4.43e-06	8.22e-06	0.0027%	1.43%
7	1.000000	4.80e-06	6.90e-06	0.0015%	0.14%
8	1.000000	4.21e-06	6.61e-06	0.0022%	0.60%
平均	1.000000	4.86e-06	7.18e-06	0.0029%	1.36%

Logits 对比（前100个token维度）

指标	值
平均余弦相似度	1.000000
平均绝对误差	8.21e-06

精度分析

指标	实测值	通过标准	判定
隐藏状态平均余弦相似度	1.000000	>0.99	✅ 通过
隐藏状态平均相对误差	0.0029%	<1%	✅ 通过
Logits 平均余弦相似度	1.000000	>0.99	✅ 通过
隐藏状态平均绝对误差	4.86e-06	—	极小误差

误差分析

整体精度优异：NPU 与 CPU 输出的余弦相似度达到完美的 1.000000，平均相对误差仅 0.0029%，远低于 1% 的通过标准。
最大相对误差分析：个别维度的最大相对误差达到 5.92%（输入 #1），但这是因为该维度的隐藏状态值本身接近零（数量级约 1e-8~1e-10），导致极小的绝对误差（~1e-08）被放大为较大的相对误差。这种"大相对误差 + 小绝对误差"的组合是浮点运算中常见现象，不影响模型实际推理效果。
与非零维度的对比：隐藏状态活跃维度（值 > 1e-3）上的相对误差均在 0.01% 以下，与零值附近的误差形成鲜明对比，进一步确认了上述分析的合理性。
Logits 一致性：Logits 的余弦相似度同样达到 1.000000，说明模型输出层的数值精度完全对齐。
浮点运算差异：观察到的最小差异（MAE ~5e-06）来源于 CPU 与 NPU 在 float32 累加顺序和指令集上的微小差异（FMA 融合乘加 vs 独立乘加），属于硬件平台差异的正常范围。

端到端排序精度验证

使用 /rerank API 进行端到端排序测试：

查询	文档	排序分数	排序
What is the capital of France?	The capital of France is Paris.	0.866	1
	The capital of Germany is Berlin.	0.862	2
	Paris is a beautiful city.	0.818	3

排序结果正确：与查询直接相关的文档获得了最高分数。

精度结论

✅ 昇腾 NPU 与 CPU 精度完全对齐。余弦相似度 1.000000，平均相对误差 0.0029%，远低于用户要求的 <1% 标准。NPU 推理精度满足生产部署要求。

FAQ

Q1: 启动时提示 `ValueError: Following weights were not initialized from checkpoint: {'score.weight'}`

原因： 使用了 --convert classify 参数，但该模型没有独立的分类/评分头。

解决： 改用 --convert embed 参数启动。

Q2: `/rerank` 接口返回 404

原因： 服务器未使用 --runner pooling 参数启动，默认只包含 /v1/chat/completions 接口。

解决： 使用 --runner pooling --convert embed 参数重新启动。

Q3: 模型加载时显示架构识别错误

原因： Transformers 版本过低，未识别 Qwen3 模型架构。

解决： 升级 transformers 到 4.51.0 以上版本。

Q4: NPU 显存不足

原因： --gpu-memory-utilization 参数设置过高。

解决： 对 0.6B 模型推荐设置为 0.2~0.3：

--gpu-memory-utilization 0.25

Q5: 排序结果区分度不高

原因： 0.6B 小模型的 bi-encoder 方式区分能力有限。

建议： 使用更大型号（Qwen3-Reranker-4B 或 8B），或改用原版 cross-encoder 推理方式。

参考信息

原始权重仓库：hf_mirrors/Qwen/Qwen3-Reranker-0.6B
本项目地址：2502_90647073/Reranker
ModelScope: Qwen3-Reranker-0.6B
Qwen3-Embedding GitHub
vLLM-Ascend
vLLM Official

Qwen3-Reranker-0.6B

English

模型描述

项目	内容
模型名称	Qwen3-Reranker-0.6B
模型来源	ModelScope: Qwen/Qwen3-Reranker-0.6B
架构类型	`Qwen3ForCausalLM`
参数量	0.6B
权重大小	1.11 GB (model.safetensors)
支持精度	bfloat16, float32
嵌入向量维度	1024
最大序列长度	4096 tokens
适配结论	✅ 完全适配 — 无需代码修改

Qwen3-Reranker-0.6B 基于 Qwen3-0.6B-Base 微调，使用标准 Qwen3 解码器架构，专为文本排序与检索任务设计。该模型支持以下运行模式：

运行模式	启动参数	状态
📝 文本生成	`--task generate`（默认）	✅
📐 向量嵌入 / 排序	`--runner pooling --convert embed`	✅
📊 分类模式	`--runner pooling --convert classify`	❌ 模型无评分头

关键特性

零代码适配：Qwen3ForCausalLM 架构在 vLLM 中原生支持，vLLM-Ascend 全面兼容
多模式推理：支持文本生成、嵌入、排序三大功能
极致轻量：0.6B 参数，仅需 1.14 GB 显存
昇腾原生化：所有算子（GQA、RoPE、SwiGLU）通过昇腾 CANN 原生加速

软硬件环境

硬件环境

项目	配置
服务器型号	Atlas 800 A2
NPU 型号	Ascend 910B (64GB)
CPU 核数	64
内存	512 GB

软件环境

软件	版本
操作系统	Ubuntu 22.04.5 LTS / EulerOS 2.10
Python	3.11.14
昇腾驱动	CANN 8.5.1
torch	2.6.0
torch_npu	2.6.0
vLLM	0.18.0
vLLM-Ascend	0.18.0
transformers	4.51.3
modelscope	1.23.0

快速开始

1. 安装依赖

# 安装 vLLM 及 vLLM-Ascend
pip install vllm==0.18.0
pip install vllm-ascend==0.18.0

# 安装 transformers 和 modelscope
pip install transformers>=4.51.0 modelscope

2. 启动推理服务

基础部署（推荐用于排序）

python3 -m vllm.entrypoints.openai.api_server \
    --model /path/to/Qwen3-Reranker-0.6B \
    --runner pooling \
    --convert embed \
    --enforce-eager \
    --trust-remote-code \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.3 \
    --port 8000

文本生成模式

python3 -m vllm.entrypoints.openai.api_server \
    --model /path/to/Qwen3-Reranker-0.6B \
    --enforce-eager \
    --trust-remote-code \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.5 \
    --port 8000

3. API 调用

向量嵌入

curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/path/to/Qwen3-Reranker-0.6B",
    "input": ["你的文本"]
  }'
# 返回: embedding 数组 (1024维)

排序（Rerank）

curl http://localhost:8000/rerank \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/path/to/Qwen3-Reranker-0.6B",
    "query": "查询文本",
    "documents": ["文档1", "文档2", "文档3"]
  }'
# 返回: 排序后的文档列表及分数

评分（Score）

curl http://localhost:8000/score \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/path/to/Qwen3-Reranker-0.6B",
    "text_1": ["查询"],
    "text_2": ["文档"]
  }'
# 返回: 相似度分数

文本生成

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/path/to/Qwen3-Reranker-0.6B",
    "messages": [{"role": "user", "content": "Hello"}],
    "max_tokens": 50
  }'

模型下载

从 ModelScope 下载

export MODELSCOPE_CACHE=/path/to/cache

python3 -c "
from modelscope import snapshot_download
model_dir = snapshot_download('Qwen/Qwen3-Reranker-0.6B')
print(f'模型路径: {model_dir}')
"

模型文件结构

Qwen3-Reranker-0.6B/
├── config.json              # 模型配置
├── generation_config.json   # 生成配置
├── model.safetensors        # 模型权重 (1.11 GB)
├── tokenizer.json           # Tokenizer
├── tokenizer_config.json    # Tokenizer 配置
└── tokenizer.model          # Tokenizer 模型文件

模型配置一览

配置项	值
hidden_size	1024
num_hidden_layers	28
num_attention_heads	16
num_key_value_heads	8 (GQA)
head_dim	128
intermediate_size	3072
max_position_embeddings	40960
rope_theta	1,000,000
vocab_size	151,669
tie_word_embeddings	true

模型部署

部署流程图

Python API 方式

from vllm import LLM

# 排序/嵌入模式
llm = LLM(
    model="/path/to/Qwen3-Reranker-0.6B",
    runner="pooling",
    convert="embed",
    enforce_eager=True,
    max_model_len=4096,
    gpu_memory_utilization=0.3,
)

# 向量嵌入
embeddings = llm.embed(["Hello world"])
print(f"Embedding dimension: {len(embeddings[0].outputs.embedding)}")

资源占用

指标	值
模型权重显存	1.14 GB
KV Cache 容量 (gpu_mem=0.3)	~158,848 tokens
最大并发请求 (4K context)	~38
单请求嵌入延迟	<100ms

精度评测

测试方法

使用 transformers 框架分别在 CPU（基线）和 Ascend NPU（待验证）上运行，通过相同随机种子确保结果可复现。测试流程：

加载 Qwen3-Reranker-0.6B 权重
对 8 组不同语义的测试输入进行推理
提取 最后一层隐藏状态（Last Hidden State） 和 Logits 输出
计算逐样本精度指标

精度指标定义

指标	定义	通过标准
余弦相似度（Cosine Similarity）	向量方向一致性	>0.99
平均绝对误差（MAE）	逐元素绝对差均值	—
均方根误差（RMSE）	逐元素差平方均值开根	—
平均相对误差（Mean Rel%）	相对误差百分比的均值	<1%
最大相对误差（Max Rel%）	相对误差百分比的最大值	—

测试输入

#	输入文本	长度 (tokens)
1	What is the capital of France?	7
2	The capital of France is Paris.	7
3	Paris is a beautiful city.	6
4	Explain the theory of relativity.	6
5	Machine learning is a subset of artificial intelligence.	9
6	The Earth orbits around the Sun.	7
7	Quantization reduces model size and improves inference speed.	10
8	The mitochondria is the powerhouse of the cell.	9

精度对比结果

隐藏状态（Hidden States）对比

输入 #	余弦相似度	MAE	RMSE	平均相对误差	最大相对误差
1	1.000000	5.45e-06	7.51e-06	0.0076%	5.92%
2	1.000000	5.42e-06	7.51e-06	0.0027%	1.34%
3	1.000000	4.88e-06	6.64e-06	0.0041%	1.20%
4	1.000000	4.25e-06	6.51e-06	0.0014%	0.15%
5	1.000000	5.46e-06	7.57e-06	0.0014%	0.09%
6	1.000000	4.43e-06	8.22e-06	0.0027%	1.43%
7	1.000000	4.80e-06	6.90e-06	0.0015%	0.14%
8	1.000000	4.21e-06	6.61e-06	0.0022%	0.60%
平均	1.000000	4.86e-06	7.18e-06	0.0029%	1.36%

Logits 对比（前100个token维度）

指标	值
平均余弦相似度	1.000000
平均绝对误差	8.21e-06

精度分析

指标	实测值	通过标准	判定
隐藏状态平均余弦相似度	1.000000	>0.99	✅ 通过
隐藏状态平均相对误差	0.0029%	<1%	✅ 通过
Logits 平均余弦相似度	1.000000	>0.99	✅ 通过
隐藏状态平均绝对误差	4.86e-06	—	极小误差

误差分析

整体精度优异：NPU 与 CPU 输出的余弦相似度达到完美的 1.000000，平均相对误差仅 0.0029%，远低于 1% 的通过标准。
最大相对误差分析：个别维度的最大相对误差达到 5.92%（输入 #1），但这是因为该维度的隐藏状态值本身接近零（数量级约 1e-8~1e-10），导致极小的绝对误差（~1e-08）被放大为较大的相对误差。这种"大相对误差 + 小绝对误差"的组合是浮点运算中常见现象，不影响模型实际推理效果。
与非零维度的对比：隐藏状态活跃维度（值 > 1e-3）上的相对误差均在 0.01% 以下，与零值附近的误差形成鲜明对比，进一步确认了上述分析的合理性。
Logits 一致性：Logits 的余弦相似度同样达到 1.000000，说明模型输出层的数值精度完全对齐。
浮点运算差异：观察到的最小差异（MAE ~5e-06）来源于 CPU 与 NPU 在 float32 累加顺序和指令集上的微小差异（FMA 融合乘加 vs 独立乘加），属于硬件平台差异的正常范围。

端到端排序精度验证

使用 /rerank API 进行端到端排序测试：

查询	文档	排序分数	排序
What is the capital of France?	The capital of France is Paris.	0.866	1
	The capital of Germany is Berlin.	0.862	2
	Paris is a beautiful city.	0.818	3

排序结果正确：与查询直接相关的文档获得了最高分数。

精度结论

✅ 昇腾 NPU 与 CPU 精度完全对齐。余弦相似度 1.000000，平均相对误差 0.0029%，远低于用户要求的 <1% 标准。NPU 推理精度满足生产部署要求。

FAQ

Q1: 启动时提示 `ValueError: Following weights were not initialized from checkpoint: {'score.weight'}`

原因： 使用了 --convert classify 参数，但该模型没有独立的分类/评分头。

解决： 改用 --convert embed 参数启动。

Q2: `/rerank` 接口返回 404

原因： 服务器未使用 --runner pooling 参数启动，默认只包含 /v1/chat/completions 接口。

解决： 使用 --runner pooling --convert embed 参数重新启动。

Q3: 模型加载时显示架构识别错误

原因： Transformers 版本过低，未识别 Qwen3 模型架构。

解决： 升级 transformers 到 4.51.0 以上版本。

Q4: NPU 显存不足

原因： --gpu-memory-utilization 参数设置过高。

解决： 对 0.6B 模型推荐设置为 0.2~0.3：

--gpu-memory-utilization 0.25

Q5: 排序结果区分度不高

原因： 0.6B 小模型的 bi-encoder 方式区分能力有限。

建议： 使用更大型号（Qwen3-Reranker-4B 或 8B），或改用原版 cross-encoder 推理方式。

参考信息

原始权重仓库：hf_mirrors/Qwen/Qwen3-Reranker-0.6B
本项目地址：2502_90647073/Reranker
ModelScope: Qwen3-Reranker-0.6B
Qwen3-Embedding GitHub
vLLM-Ascend
vLLM Official

Qwen3-Reranker-0.6B

目录

模型描述

关键特性

软硬件环境

硬件环境

软件环境

快速开始

1. 安装依赖

2. 启动推理服务

基础部署（推荐用于排序）

文本生成模式

3. API 调用

向量嵌入

排序（Rerank）

评分（Score）

文本生成

模型下载

从 ModelScope 下载

模型文件结构

模型配置一览

模型部署

部署流程图

Python API 方式

资源占用

精度评测

测试方法

精度指标定义

测试输入

精度对比结果

隐藏状态（Hidden States）对比

Logits 对比（前100个token维度）

精度分析

误差分析

端到端排序精度验证

精度结论

FAQ

Q1: 启动时提示 ValueError: Following weights were not initialized from checkpoint: {'score.weight'}

Q2: /rerank 接口返回 404

Q3: 模型加载时显示架构识别错误

Q4: NPU 显存不足

Q5: 排序结果区分度不高

参考信息

Qwen3-Reranker-0.6B

目录

模型描述

关键特性

软硬件环境

硬件环境

软件环境

快速开始

1. 安装依赖

2. 启动推理服务

基础部署（推荐用于排序）

文本生成模式

3. API 调用

向量嵌入

排序（Rerank）

评分（Score）

文本生成

模型下载

从 ModelScope 下载

模型文件结构

模型配置一览

模型部署

部署流程图

Python API 方式

资源占用

精度评测

测试方法

精度指标定义

测试输入

精度对比结果

隐藏状态（Hidden States）对比

Logits 对比（前100个token维度）

精度分析

误差分析

端到端排序精度验证

精度结论

FAQ

Q1: 启动时提示 `ValueError: Following weights were not initialized from checkpoint: {'score.weight'}`

Q2: `/rerank` 接口返回 404

Q1: 启动时提示 `ValueError: Following weights were not initialized from checkpoint: {'score.weight'}`

Q2: `/rerank` 接口返回 404