Qwen3 Embedding模型系列是Qwen家族的最新专有模型,专为文本嵌入和排序任务设计。该系列基于Qwen3系列的稠密基础模型构建,提供了多种尺寸(0.6B、4B和8B)的全面文本嵌入和重排序模型。此系列继承了其基础模型卓越的多语言能力、长文本理解能力和推理技能。Qwen3 Embedding系列在多个文本嵌入和排序任务中取得了显著进步,包括文本检索、代码检索、文本分类、文本聚类和双语语料挖掘。
卓越的多功能性:该嵌入模型在各种下游应用评估中均实现了最先进的性能。8B尺寸的嵌入模型在MTEB多语言排行榜中排名第一(截至2025年6月5日,得分70.58),而重排序模型在各种文本检索场景中表现出色。
全面的灵活性:Qwen3 Embedding系列为嵌入和重排序模型提供了全谱系的尺寸选择(从0.6B到8B),以满足不同场景下对效率和效果的优先需求。开发者可以无缝组合这两个模块。此外,嵌入模型允许在所有维度上灵活定义向量,并且嵌入和重排序模型均支持用户自定义指令,以增强特定任务、语言或场景的性能。
多语言能力:得益于Qwen3模型的多语言能力,Qwen3 Embedding系列支持超过100种语言。这包括各种编程语言,并提供强大的多语言、跨语言和代码检索能力。
Qwen3-Reranker-0.6B具有以下特点:
有关更多详细信息,包括基准测试评估、硬件要求和推理性能,请参阅我们的博客、GitHub。
| 模型类型 | 模型名称 | 规模 | 层数 | 序列长度 | 嵌入维度 | MRL 支持 | 指令感知 |
|---|---|---|---|---|---|---|---|
| 文本嵌入 | Qwen3-Embedding-0.6B | 0.6B | 28 | 32K | 1024 | 是 | 是 |
| 文本嵌入 | Qwen3-Embedding-4B | 4B | 36 | 32K | 2560 | 是 | 是 |
| 文本嵌入 | Qwen3-Embedding-8B | 8B | 36 | 32K | 4096 | 是 | 是 |
| 文本重排序 | Qwen3-Reranker-0.6B | 0.6B | 28 | 32K | - | - | 是 |
| 文本重排序 | Qwen3-Reranker-4B | 4B | 36 | 32K | - | - | 是 |
| 文本重排序 | Qwen3-Reranker-8B | 8B | 36 | 32K | - | - | 是 |
注意:
MRL 支持表示嵌入模型是否支持自定义最终嵌入的维度。指令感知表示嵌入或重排序模型是否支持根据不同任务自定义输入指令。- 我们的评估表明,对于大多数下游任务,使用指令(instruct)通常比不使用指令能带来 1% 到 5% 的性能提升。因此,我们建议开发者针对其特定任务和场景创建定制化指令。在多语言环境下,我们也建议用户使用英文编写指令,因为模型训练过程中使用的大多数指令均为英文原版。
若使用 4.51.0 之前版本的 Transformers,可能会遇到以下错误:
KeyError: 'qwen3'# Requires transformers>=4.51.0
import torch
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM
def format_instruction(instruction, query, doc):
if instruction is None:
instruction = 'Given a web search query, retrieve relevant passages that answer the query'
output = "<Instruct>: {instruction}\n<Query>: {query}\n<Document>: {doc}".format(instruction=instruction,query=query, doc=doc)
return output
def process_inputs(pairs):
inputs = tokenizer(
pairs, padding=False, truncation='longest_first',
return_attention_mask=False, max_length=max_length - len(prefix_tokens) - len(suffix_tokens)
)
for i, ele in enumerate(inputs['input_ids']):
inputs['input_ids'][i] = prefix_tokens + ele + suffix_tokens
inputs = tokenizer.pad(inputs, padding=True, return_tensors="pt", max_length=max_length)
for key in inputs:
inputs[key] = inputs[key].to(model.device)
return inputs
@torch.no_grad()
def compute_logits(inputs, **kwargs):
batch_scores = model(**inputs).logits[:, -1, :]
true_vector = batch_scores[:, token_true_id]
false_vector = batch_scores[:, token_false_id]
batch_scores = torch.stack([false_vector, true_vector], dim=1)
batch_scores = torch.nn.functional.log_softmax(batch_scores, dim=1)
scores = batch_scores[:, 1].exp().tolist()
return scores
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Reranker-0.6B", padding_side='left')
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-0.6B").eval()
# We recommend enabling flash_attention_2 for better acceleration and memory saving.
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-Reranker-0.6B", torch_dtype=torch.float16, attn_implementation="flash_attention_2").cuda().eval()
token_false_id = tokenizer.convert_tokens_to_ids("no")
token_true_id = tokenizer.convert_tokens_to_ids("yes")
max_length = 8192
prefix = "<|im_start|>system\nJudge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\".<|im_end|>\n<|im_start|>user\n"
suffix = "<|im_end|>\n<|im_start|>assistant\n\n\n"
prefix_tokens = tokenizer.encode(prefix, add_special_tokens=False)
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = ["What is the capital of China?",
"Explain gravity",
]
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]
pairs = [format_instruction(task, query, doc) for query, doc in zip(queries, documents)]
# Tokenize the input texts
inputs = process_inputs(pairs)
scores = compute_logits(inputs)
print("scores: ", scores)# Requires vllm>=0.8.5
import logging
from typing import Dict, Optional, List
import json
import logging
import torch
from transformers import AutoTokenizer, is_torch_npu_available
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import destroy_model_parallel
import gc
import math
from vllm.inputs.data import TokensPrompt
def format_instruction(instruction, query, doc):
text = [
{"role": "system", "content": "Judge whether the Document meets the requirements based on the Query and the Instruct provided. Note that the answer can only be \"yes\" or \"no\"."},
{"role": "user", "content": f"<Instruct>: {instruction}\n\n<Query>: {query}\n\n<Document>: {doc}"}
]
return text
def process_inputs(pairs, instruction, max_length, suffix_tokens):
messages = [format_instruction(instruction, query, doc) for query, doc in pairs]
messages = tokenizer.apply_chat_template(
messages, tokenize=True, add_generation_prompt=False, enable_thinking=False
)
messages = [ele[:max_length] + suffix_tokens for ele in messages]
messages = [TokensPrompt(prompt_token_ids=ele) for ele in messages]
return messages
def compute_logits(model, messages, sampling_params, true_token, false_token):
outputs = model.generate(messages, sampling_params, use_tqdm=False)
scores = []
for i in range(len(outputs)):
final_logits = outputs[i].outputs[0].logprobs[-1]
token_count = len(outputs[i].outputs[0].token_ids)
if true_token not in final_logits:
true_logit = -10
else:
true_logit = final_logits[true_token].logprob
if false_token not in final_logits:
false_logit = -10
else:
false_logit = final_logits[false_token].logprob
true_score = math.exp(true_logit)
false_score = math.exp(false_logit)
score = true_score / (true_score + false_score)
scores.append(score)
return scores
number_of_gpu = torch.cuda.device_count()
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen3-Reranker-0.6B')
model = LLM(model='Qwen/Qwen3-Reranker-0.6B', tensor_parallel_size=number_of_gpu, max_model_len=10000, enable_prefix_caching=True, gpu_memory_utilization=0.8)
tokenizer.padding_side = "left"
tokenizer.pad_token = tokenizer.eos_token
suffix = "<|im_end|>\n<|im_start|>assistant\n\n\n"
max_length=8192
suffix_tokens = tokenizer.encode(suffix, add_special_tokens=False)
true_token = tokenizer("yes", add_special_tokens=False).input_ids[0]
false_token = tokenizer("no", add_special_tokens=False).input_ids[0]
sampling_params = SamplingParams(temperature=0,
max_tokens=1,
logprobs=20,
allowed_token_ids=[true_token, false_token],
)
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = ["What is the capital of China?",
"Explain gravity",
]
documents = [
"The capital of China is Beijing.",
"Gravity is a force that attracts two bodies towards each other. It gives weight to physical objects and is responsible for the movement of planets around the sun.",
]
pairs = list(zip(queries, documents))
inputs = process_inputs(pairs, task, max_length-len(suffix_tokens), suffix_tokens)
scores = compute_logits(model, inputs, sampling_params, true_token, false_token)
print('scores', scores)
destroy_model_parallel()📌 提示:我们建议开发者根据具体场景、任务和语言对 instruct 进行自定义。测试表明,在大多数检索场景中,若查询端不使用 instruct,检索性能可能会下降约 1% 至 5%。
| 模型 | 参数规模 | MTEB-R | CMTEB-R | MMTEB-R | MLDR | MTEB-Code | FollowIR |
|---|---|---|---|---|---|---|---|
| Qwen3-Embedding-0.6B | 0.6B | 61.82 | 71.02 | 64.64 | 50.26 | 75.41 | 5.09 |
| Jina-multilingual-reranker-v2-base | 0.3B | 58.22 | 63.37 | 63.73 | 39.66 | 58.98 | -0.68 |
| gte-multilingual-reranker-base | 0.3B | 59.51 | 74.08 | 59.44 | 66.33 | 54.18 | -1.64 |
| BGE-reranker-v2-m3 | 0.6B | 57.03 | 72.16 | 58.36 | 59.51 | 41.38 | -0.01 |
| Qwen3-Reranker-0.6B | 0.6B | 65.80 | 71.31 | 66.36 | 67.28 | 73.42 | 5.41 |
| Qwen3-Reranker-4B | 4B | 69.76 | 75.94 | 72.74 | 69.97 | 81.20 | 14.84 |
| Qwen3-Reranker-8B | 8B | 69.02 | 77.45 | 72.94 | 70.19 | 81.22 | 8.05 |
注意:
- 上述为重排序模型的评估结果。我们使用 MTEB(英文,v2)、MTEB(中文,v1)、MMTEB 和 MTEB(代码)的检索子集,即 MTEB-R、CMTEB-R、MMTEB-R 和 MTEB-Code。
- 所有分数均基于稠密嵌入模型 Qwen3-Embedding-0.6B 检索出的前 100 个候选结果得出。
如果我们的工作对您有所帮助,欢迎引用。
@article{qwen3embedding,
title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
journal={arXiv preprint arXiv:2506.05176},
year={2025}
}