meituan-longcat/LongCat-Next
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

LongCat-Next

LongCat Logo

Blog Hugging Face GitHub Demo
Paper Wechat Twitter Follow
License

技术报告 📄

模型介绍

evaluation

我们研发了LongCat-Next,这是一款原生多模态模型,它以单一自回归目标处理文本、视觉和音频,除语言范式外,仅引入最小化的归纳偏置。作为具备A3B模型规模的工业级基础模型,它在视觉感知、内容创作和语音交互方面表现卓越,在各类多模态基准测试中均取得了优异成绩。特别是,通过利用语义完整的离散表示,它突破了离散视觉建模在理解任务上长期存在的性能瓶颈,并为视觉理解与生成提供了统一解决方案。这一成果表明,离散标记能够通用地表示多模态信号,并可深度内化于单一离散嵌入空间中。我们进一步通过大量实验分析了这种统一的离散训练范式,并揭示了若干有趣发现。

作为向原生多模态迈出的有意义尝试,我们开源了LongCat-Next及其标记器,希望能促进社区的进一步研究与发展。

核心特性

本研究主要通过以简洁为优先的设计理念,将视觉和音频视为语言的内在延伸,从而解决原生多模态的根本障碍。为实现这一目标,我们提出了LongCat-Next——一款离散原生多模态模型,它在离散框架内实现了工业级性能,同时在广泛的专业领域保持高度竞争力。该模型以LongCat-Flash-Lite MoE骨干网络(A3B)为基础,作为一种多任务学习器,将语言、视觉和音频统一在单一离散框架中。在本文中,我们主要贡献如下:

🌟 离散原生自回归范式(DiNA)

我们引入了DiNA,这是一种统一范式,将语言领域的下一个标记预测扩展到原生多模态,将多种模态内化为共享的标记空间。它通过创建感知模态的标记器-解标记器对,并利用大型语言模型成熟的训练基础设施,简化了多模态建模。

🌟 离散视觉表示的语义完整性

我们通过将语义对齐编码器(SAE)与残差向量量化(RVQ)相结合,改进了离散视觉建模。这种整合创建了分层离散标记,既能保留语义抽象,又能保留细粒度的视觉细节,突破了传统表示的局限性。

🌟 离散原生分辨率视觉Transformer(dNaViT)

类比语言标记器,我们提出dNaViT作为一种高度灵活、统一的视觉离散接口,它将语义特征提取为“视觉词汇”,构建支持动态标记化和解标记化的分层表示空间。dNaViT与大型语言模型无缝集成,确保高性能且不降级。

🌟 统一模型中实现卓越的视觉感知、内容创作与语音交互

在DiNA框架内,视觉理解和生成被优雅地重新表述为同一预测过程的两种表现形式,且不会牺牲性能。这种表述弥合了长期存在的架构鸿沟,同时在这些传统上相互竞争的目标之间引入最小干扰,并保留核心语言能力。值得注意的是,LongCat-Next与专业的理解模型相比性能具有竞争力,同时即使在28倍压缩比下仍保持强大的生成质量(特别是在文本渲染方面),并且在高级语音理解、低延迟语音对话和可定制语音克隆方面也表现出色。

详情请参阅我们的技术报告!

评估结果

evaluation

快速开始

要结合 transformers 使用 LongCat-Next,我们至少需要 3 块 GPU(每块 80GB 显存,例如 H100/A100 80GB),并且建议使用以下环境:

  • python >= 3.10
  • torch >= 2.6
  • transformers >= 4.57.6
  • accelerate >= 1.10.0
# (Install python=3.10, ffffmpeg<7, soundfile==0.13.1)
conda env create -f environment.yml -v

# (Install torch and other pip dependencies)
pip install -r requirements.txt && pip install -r requirements-post.txt --no-build-isolation

基本使用示例:

  • 请记得修改 ./config.json 中的 WEIGHT_PATH_TO_LONGCAT_NEXT,因为解码器采用延迟加载方式。
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor

# Load model
model_name = "meituan-longcat/LongCat-Next"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, fix_mistral_regex=True)
model.text_tokenizer = tokenizer # Dynamic binding
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# Set messages
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What book is this?<longcat_img_start>./assets/book.png<longcat_img_end>"}
]

# Apply chat-template
text_input = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
print(f"{text_input=}")

# Preprocessing
text_inputs, visual_inputs, audio_inputs = processor(text=text_input, return_tensors="pt")
text_inputs = text_inputs.to(model.device)
if visual_inputs is not None:
    visual_inputs = visual_inputs.to(model.device)
if audio_inputs is not None:
    audio_inputs = audio_inputs.to(model.device)

# AR
with torch.no_grad():
    outputs = model.generate(
        input_ids=text_inputs["input_ids"],
        visual_inputs=visual_inputs,
        audio_inputs=audio_inputs,
        return_dict_in_generate=True,
    )

# Text decoding
output_input_ids = outputs.sequences
text_output = tokenizer.decode(output_input_ids[0][len(text_inputs["input_ids"][0]):], skip_special_tokens=True)
print(f"{text_output=}")

# Images decoding
output_visual_ids = outputs.visual_ids
if output_visual_ids.size(0) > 0:
    image_path_list = model.model.decode_visual_ids_and_save(
        output_visual_ids,
        save_prefix="./output_image",
        **model.generation_config.visual_generation_config["custom_params"],
    )
    print(f"{image_path_list=}")

# Audio decoding
output_audio_text_ids = outputs.audio_text_ids
output_audio_ids = outputs.audio_ids
if output_audio_text_ids.size(-1) > 0:
    audio_text = tokenizer.decode(output_audio_text_ids[0], skip_special_tokens=True)
    print(f"{audio_text=}")
if output_audio_ids.size(0) > 0:
    audio_path_list = model.model.decode_audio_ids_and_save(
        output_audio_ids,
        save_prefix="./output_audio",
        **model.generation_config.audio_generation_config["custom_params"],
    )
    print(f"{audio_path_list=}")
文本 - 工具调用示例
from parse_model_response import parse_model_response

tools = [
    {
        "type": "function",
        "function": {
            "name": "func_add",
            "description": "Calculate the sum of two numbers",
            "parameters": {
                "type": "object",
                "properties": {
                    "x1": {"type": "number", "description": "The first addend"},
                    "x2": {"type": "number", "description": "The second addend"}
                },
                "required": ["x1", "x2"]
            }
        }
    }
]
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Please tell me what is $$125679 + 234519$$?"},
    {
        "role": "assistant",
        "content": "I'll calculate the sum of 125679 and 234519 for you.",
        "tool_calls": [{"type": "function", "function": {"name": "func_add", "arguments": {"x1": 125679, "x2": 234519}}}]
    },
    {"role": "tool", "name": "func_add", "content": '{"ans": 360198}'}
]

text_input = tokenizer.apply_chat_template(
    messages,
    tools=tools, # add tools here
    tokenize=False,
    add_generation_prompt=True,
)
print(f"{text_input=}")


# Preprocessing - AR - Text decoding
...

# Results parsing
parsed_message = parse_model_response(text_output.strip("\n"), tools)
print(f"{parsed_message=}")

详见 parse_model_response.py 以获取详细实现和示例。

图像 - 理解示例
# Simply replace the messages in the main example with the messages below.
messages = [
    {"role": "user", "content": "What book is this?<longcat_img_start>./assets/book.png<longcat_img_end>"}
]
图片 - 生成示例
# Simply replace the messages in the main example with the messages below.
# Suffix user content with '<longcat_img_start>' to force image generation.
messages = [
    {"role": "system", "content": ""},
    {"role": "user", "content": "A small kitten sitting naturally on a moss-covered forest floor, centered in the frame, holding a rectangular wooden sign gently with its front paws resting over the top edge. The kitten has soft, fluffy fur, a natural relaxed posture, and a calm, curious expression with a slightly open mouth (not exaggerated), looking directly at the camera.\n\nThe sign is positioned firmly in front of the kitten\'s chest, supported by its paws, with realistic contact and no floating effect. The board reads \"LongCat-Next: Lexicalizing Modalities as Discrete Tokens\" in clean, sharp black text, perfectly legible.\n\nThe environment is a lush forest with tall trees, ferns, and soft green foliage. The ground is covered with moss and small plants. Background softly blurred with natural depth of field. Lighting is soft, diffused sunlight filtering through the trees, creating gentle highlights and shadows. Realistic photography style, natural colors, high detail, no cartoonish exaggeration.<longcat_img_start>"}
]
音频 - 音频转文本示例
# Simply replace the messages in the main example with the messages below.
messages = [
    {"role": "user", "content": "<longcat_audio_start>./assets/math1.wav<longcat_audio_end>"}
]
音频 - 音频转音频示例
# Simply replace the messages in the main example with the messages below.
# Suffix user content with '<longcat_audiogen_start>' to force audio generation.
messages = [
    {"role": "system", "content": "Replicate the voice in the audio clip to formulate an answer:<longcat_audio_start>./assets/system_audio.wav<longcat_audio_end>"},
    {"role": "user", "content": "<longcat_audio_start>./assets/math1.wav<longcat_audio_end><longcat_audiogen_start>"}
]
音频 - 语音合成示例
# Simply replace the messages in the main example with the messages below.
# Suffix user content with '<longcat_audiogen_start>' to force audio generation.
messages = [
    {"role": "system", "content": "Replicate the voice in the audio clip to formulate an answer:<longcat_audio_start>./assets/vc_zh3.wav<longcat_audio_end>"},
    {"role": "user", "content": "用这个声音合成以下内容:明天的meeting在三楼的Conference Room举行。<longcat_audiogen_start>"}
]

[!Tip] 我们建议使用以下一组采样参数进行生成:

  • 文本:{"max_new_tokens":2048,"do_sample":false}
  • 图像 - 理解:{"max_new_tokens":1024,"do_sample":true,"temperature":0.4,"top_k":40,"top_p":0.85,"repetition_penalty":1.1}
  • 图像 - 生成:{"max_new_tokens":2048,"do_sample":false,"visual_generation_config":{"do_sample":true,"temperature":0.5,"top_p":0.75,"top_k":1024,"custom_params":{"cfg_scale":3,"token_h":37,"token_w":37,"anyres_prefix":"<longcat_img_token_size>{h} {w}</longcat_img_token_size>"}}}
  • 音频 - 音频转文本:{"max_new_tokens":1024,"do_sample":true,"temperature":0.2,"top_k":20,"top_p":0.85,"repetition_penalty":1.1}
  • 音频 - 音频转音频/语音合成:{"max_new_tokens":2048,"do_sample":true,"temperature":0.2,"top_k":20,"top_p":0.85,"repetition_penalty":1.1,"audio_generation_config":{"audio_parallel_decoding":false,"do_sample":true,"temperature":0.5,"top_k":5,"top_p":0.85,"repetition_penalty":1.3,"custom_params":{"sampling_rate":24000,"wave_concat_overlap":1200}}}

请注意,采样参数的支持情况因推理框架而异(对于 transformers,推理参数配置位于 ./generation_config.json 中)。

部署

我们已在 SGLang 中完成基本适配,以支持 LongCat-Next 的部署。更多信息请参考此仓库:meituan-longcat/LongCat-Next-inference

许可协议

本仓库(包括模型权重和源代码)均基于 MIT 许可证 发布。

除非另有说明,对本仓库的任何贡献均采用 MIT 许可证。本许可证不授予使用美团商标或专利的任何权利。

详情请参见 LICENSE 文件。

使用注意事项

本模型并非针对所有可能的下游应用进行专门设计或全面评估。

开发人员应考虑到大型语言模型的已知局限性,包括在不同语言上的性能差异,并在将模型部署到敏感或高风险场景之前,仔细评估其准确性、安全性和公平性。 开发人员和下游用户有责任了解并遵守与其使用场景相关的所有适用法律法规,包括但不限于数据保护、隐私和内容安全要求。

本模型卡片中的任何内容均不应被解释为更改或限制模型发布所依据的 MIT 许可证条款。

引用说明

如果您觉得我们的研究工作对您有所帮助,我们诚挚建议您在相关成果中引用本项目。

@misc{meituanlongcatteam2026longcatnextlexicalizingmodalitiesdiscrete,
      title={LongCat-Next: Lexicalizing Modalities as Discrete Tokens}, 
      author={Meituan LongCat Team},
      year={2026},
      eprint={2603.27538},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.27538}, 
}

联系方式

如有任何问题,请通过邮箱 longcat-team@meituan.com 与我们联系,或提交 issue。