HuggingFace镜像/Qwen3-Reranker-0.6B
模型介绍文件和版本分析
下载使用量0

Qwen3-Reranker-0.6B

主要亮点

Qwen3 Embedding模型系列是Qwen家族的最新专有模型,专为文本嵌入和排序任务设计。该系列基于Qwen3系列的稠密基础模型构建,提供了多种尺寸(0.6B、4B和8B)的全面文本嵌入和重排序模型。此系列继承了其基础模型卓越的多语言能力、长文本理解能力和推理技能。Qwen3 Embedding系列在多个文本嵌入和排序任务中取得了显著进步,包括文本检索、代码检索、文本分类、文本聚类和双语语料挖掘。

卓越的多功能性:该嵌入模型在各种下游应用评估中均实现了最先进的性能。8B尺寸的嵌入模型在MTEB多语言排行榜中排名第一(截至2025年6月5日,得分70.58),而重排序模型在各种文本检索场景中表现出色。

全面的灵活性:Qwen3 Embedding系列为嵌入和重排序模型提供了全谱系的尺寸选择(从0.6B到8B),以满足不同场景下对效率和效果的优先需求。开发者可以无缝组合这两个模块。此外,嵌入模型允许在所有维度上灵活定义向量,并且嵌入和重排序模型均支持用户自定义指令,以增强特定任务、语言或场景的性能。

多语言能力:得益于Qwen3模型的多语言能力,Qwen3 Embedding系列支持超过100种语言。这包括各种编程语言,并提供强大的多语言、跨语言和代码检索能力。

模型概述

Qwen3-Reranker-0.6B具有以下特点:

  • 模型类型:文本重排序
  • 支持语言:100+种语言
  • 参数数量:0.6B
  • 上下文长度:32k

有关更多详细信息,包括基准测试评估、硬件要求和推理性能,请参阅我们的博客、GitHub。

Qwen3 嵌入系列模型列表

模型类型模型名称规模层数序列长度嵌入维度MRL 支持指令感知
文本嵌入Qwen3-Embedding-0.6B0.6B2832K1024是是
文本嵌入Qwen3-Embedding-4B4B3632K2560是是
文本嵌入Qwen3-Embedding-8B8B3632K4096是是
文本重排序Qwen3-Reranker-0.6B0.6B2832K--是
文本重排序Qwen3-Reranker-4B4B3632K--是
文本重排序Qwen3-Reranker-8B8B3632K--是

注意:

  • MRL 支持表示嵌入模型是否支持自定义最终嵌入的维度。
  • 指令感知表示嵌入或重排序模型是否支持根据不同任务自定义输入指令。
  • 我们的评估表明,对于大多数下游任务,使用指令(instruct)通常比不使用指令能带来 1% 到 5% 的性能提升。因此,我们建议开发者针对其特定任务和场景创建定制化指令。在多语言环境下,我们也建议用户使用英文编写指令,因为模型训练过程中使用的大多数指令均为英文原版。

使用方法

若使用 4.51.0 之前版本的 Transformers,可能会遇到以下错误:

KeyError: 'qwen3'

Transformers 使用方法

# Requires transformers>=4.51.0
import torch
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM

def format_instruction(instruction, query, doc):
    if instruction is None:
        instruction = 'Given a web search query, retrieve relevant passages that answer the query'
    output = "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}".format(instruction=instruction,query=query, doc=doc)
    return output

def process_inputs(pairs):
    inputs = tokenizer(
        pairs, padding=False, truncation='longest_first',
        return_attention_mask=False, max_length=max_length - len(prefix_tokens) - len(suffix_tokens)
    )
    for i, ele in enumerate(inputs['input_ids']):
        inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens
    inputs = tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_length)
    for key in inputs:
        inputs[key] = inputs[key].to(model.device)
    return inputs

@torch.no_grad()
def compute_logits(inputs, **kwargs):
    batch_scores = model(**inputs).logits[:, -1, :]
    true_vector = batch_scores[:, token_true_id]
    false_vector = batch_scores[:, token_false_id]
    batch_scores = torch.stack([false_vector, true_vector], dim=1)
    batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
    scores = batch_scores[:, 1].exp().tolist()
    return scores

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Reranker-0.6B", padding_side='left')
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-0.6B").eval()
# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-0.6B", torch_dtype=torch.float16, attn_implementation="flash_attention_2").cuda().eval()
token_false_id = tokenizer.convert_tokens_to_ids("no")
token_true_id = tokenizer.convert_tokens_to_ids("yes")
max_length = 8192

prefix = "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n"
suffix = "<|im_end|>\n<|im_start|>assistant\n\n\n"
prefix_tokens = tokenizer.encode(prefix, add_special_tokens=False)
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
        
task = 'Given a web search query, retrieve relevant passages that answer the query'

queries = ["What is the capital of China?",
    "Explain gravity",
]

documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

pairs = [format_instruction(task, query, doc) for query, doc in zip(queries, documents)]

# Tokenize the input texts
inputs = process_inputs(pairs)
scores = compute_logits(inputs)

print("scores: ", scores)

vLLM 使用方法

# Requires vllm>=0.8.5
import logging
from typing import Dict, Optional, List

import json
import logging

import torch

from transformers import AutoTokenizer, is_torch_npu_available
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel
import gc
import math
from vllm.inputs.data import TokensPrompt


        
def format_instruction(instruction, query, doc):
    text = [
        {"role": "system", "content": "Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\"."},
        {"role": "user", "content": f"<Instruct>: {instruction}\n\n<Query>: {query}\n\n<Document>: {doc}"}
    ]
    return text

def process_inputs(pairs, instruction, max_length, suffix_tokens):
    messages = [format_instruction(instruction, query, doc) for query, doc in pairs]
    messages =  tokenizer.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=False, enable_thinking=False
    )
    messages = [ele[:max_length] + suffix_tokens for ele in messages]
    messages = [TokensPrompt(prompt_token_ids=ele) for ele in messages]
    return messages

def compute_logits(model, messages, sampling_params, true_token, false_token):
    outputs = model.generate(messages, sampling_params, use_tqdm=False)
    scores = []
    for i in range(len(outputs)):
        final_logits = outputs[i].outputs[0].logprobs[-1]
        token_count = len(outputs[i].outputs[0].token_ids)
        if true_token not in final_logits:
            true_logit = -10
        else:
            true_logit = final_logits[true_token].logprob
        if false_token not in final_logits:
            false_logit = -10
        else:
            false_logit = final_logits[false_token].logprob
        true_score = math.exp(true_logit)
        false_score = math.exp(false_logit)
        score = true_score / (true_score + false_score)
        scores.append(score)
    return scores

number_of_gpu = torch.cuda.device_count()
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Reranker-0.6B')
model = LLM(model='Qwen/Qwen3-Reranker-0.6B', tensor_parallel_size=number_of_gpu, max_model_len=10000, enable_prefix_caching=True, gpu_memory_utilization=0.8)
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
suffix = "<|im_end|>\n<|im_start|>assistant\n\n\n"
max_length=8192
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
true_token = tokenizer("yes", add_special_tokens=False).input_ids[0]
false_token = tokenizer("no", add_special_tokens=False).input_ids[0]
sampling_params = SamplingParams(temperature=0, 
    max_tokens=1,
    logprobs=20, 
    allowed_token_ids=[true_token, false_token],
)

        
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = ["What is the capital of China?",
    "Explain gravity",
]
documents = [
    "The capital of China is Beijing.",
    "Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]

pairs = list(zip(queries, documents))
inputs = process_inputs(pairs, task, max_length-len(suffix_tokens), suffix_tokens)
scores = compute_logits(model, inputs, sampling_params, true_token, false_token)
print('scores', scores)

destroy_model_parallel()

📌 提示:我们建议开发者根据具体场景、任务和语言对 instruct 进行自定义。测试表明,在大多数检索场景中,若查询端不使用 instruct,检索性能可能会下降约 1% 至 5%。

评估

模型参数规模MTEB-RCMTEB-RMMTEB-RMLDRMTEB-CodeFollowIR
Qwen3-Embedding-0.6B0.6B61.8271.0264.6450.2675.415.09
Jina-multilingual-reranker-v2-base0.3B58.2263.3763.7339.6658.98-0.68
gte-multilingual-reranker-base0.3B59.5174.0859.4466.3354.18-1.64
BGE-reranker-v2-m30.6B57.0372.1658.3659.5141.38-0.01
Qwen3-Reranker-0.6B0.6B65.8071.3166.3667.2873.425.41
Qwen3-Reranker-4B4B69.7675.9472.7469.9781.2014.84
Qwen3-Reranker-8B8B69.0277.4572.9470.1981.228.05

注意:

  • 上述为重排序模型的评估结果。我们使用 MTEB(英文,v2)、MTEB(中文,v1)、MMTEB 和 MTEB(代码)的检索子集,即 MTEB-R、CMTEB-R、MMTEB-R 和 MTEB-Code。
  • 所有分数均基于稠密嵌入模型 Qwen3-Embedding-0.6B 检索出的前 100 个候选结果得出。

引用

如果我们的工作对您有所帮助,欢迎引用。

@article{qwen3embedding,
  title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
  author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
  journal={arXiv preprint arXiv:2506.05176},
  year={2025}
}