LaBSE for English and Russian

这是 sentence-transformers/LaBSE 的精简版本，而后者又是 Google LaBSE 的移植版本。

当前模型的词汇表中仅保留了英语和俄语标记。因此，词汇表大小仅为原始版本的 10%，整个模型的参数数量为原始版本的 27%，但英语和俄语嵌入的质量没有任何损失。

要获取句子嵌入，您可以使用以下代码：

import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("cointegrated/LaBSE-en-ru")
model = AutoModel.from_pretrained("cointegrated/LaBSE-en-ru")
sentences = ["Hello World", "Привет Мир"]
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=64, return_tensors='pt')
with torch.no_grad():
    model_output = model(**encoded_input)
embeddings = model_output.pooler_output
embeddings = torch.nn.functional.normalize(embeddings)
print(embeddings)

该模型已在此笔记本中进行了截断处理。您可以将其适配于其他语言（例如EIStakovskii/LaBSE-fr-de）、模型或数据集。

参考文献：

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Narveen Ari, Wei Wang. Language-agnostic BERT Sentence Embedding. 2020年7月

许可证：https://tfhub.dev/google/LaBSE/1

LaBSE for English and Russian

这是 sentence-transformers/LaBSE 的精简版本，而后者又是 Google LaBSE 的移植版本。

要获取句子嵌入，您可以使用以下代码：

import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("cointegrated/LaBSE-en-ru")
model = AutoModel.from_pretrained("cointegrated/LaBSE-en-ru")
sentences = ["Hello World", "Привет Мир"]
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=64, return_tensors='pt')
with torch.no_grad():
    model_output = model(**encoded_input)
embeddings = model_output.pooler_output
embeddings = torch.nn.functional.normalize(embeddings)
print(embeddings)

该模型已在此笔记本中进行了截断处理。您可以将其适配于其他语言（例如EIStakovskii/LaBSE-fr-de）、模型或数据集。

参考文献：

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Narveen Ari, Wei Wang. Language-agnostic BERT Sentence Embedding. 2020年7月

许可证：https://tfhub.dev/google/LaBSE/1