我们介绍了 Emu3,** 一套全新的最先进的多模态模型,仅通过 下一个标记预测 进行训练!通过将图像、文本和视频标记化到一个离散空间,我们从零开始训练一个单一的变压器模型,以处理多种模态序列的混合。
Emu3 在生成和感知任务上均超越了多个成熟的特定任务模型,超过了诸如 SDXL、LLaVA-1.6 和 OpenSora-1.2 等旗舰级开源模型,同时消除了对扩散或组合架构的需求。
from PIL import Image
from transformers import AutoTokenizer, AutoModel, AutoImageProcessor, AutoModelForCausalLM
from transformers.generation.configuration_utils import GenerationConfig
import torch
import sys
sys.path.append(PATH_TO_BAAI_Emu3-Chat_MODEL)
from processing_emu3 import Emu3Processor
# model path
EMU_HUB = "BAAI/Emu3-Chat"
VQ_HUB = "BAAI/Emu3-VisionTokenier"
# prepare model and processor
model = AutoModelForCausalLM.from_pretrained(
EMU_HUB,
device_map="cuda:0",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(EMU_HUB, trust_remote_code=True, padding_side="left")
image_processor = AutoImageProcessor.from_pretrained(VQ_HUB, trust_remote_code=True)
image_tokenizer = AutoModel.from_pretrained(VQ_HUB, device_map="cuda:0", trust_remote_code=True).eval()
processor = Emu3Processor(image_processor, image_tokenizer, tokenizer)
# prepare input
text = "Please describe the image"
image = Image.open("assets/demo.png")
inputs = processor(
text=text,
image=image,
mode='U',
return_tensors="pt",
padding="longest",
)
# prepare hyper parameters
GENERATION_CONFIG = GenerationConfig(
pad_token_id=tokenizer.pad_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=1024,
)
# generate
outputs = model.generate(
inputs.input_ids.to("cuda:0"),
GENERATION_CONFIG,
attention_mask=inputs.attention_mask.to("cuda:0"),
)
outputs = outputs[:, inputs.input_ids.shape[-1]:]
print(processor.batch_decode(outputs, skip_special_tokens=True)[0])当然,请提供您希望翻译成中文的文本内容,我将按照您的要求进行翻译。