HyperCLOVAX-SEED-Vision-Instruct-3B:可用于视觉问答、图表解读及视频理解等任务，是韩国首个开源视觉语言模型，采用轻量级架构优化计算效率，支持韩语输入，在相关基准测试中表现优于同规模开源模型。【此简介由AI生成】

image/png

概述

HyperCLOVAX-SEED-Vision-Instruct-3B 是由 NAVER 开发的模型，它基于其专有骨干模型构建，并通过后期训练进行了微调。该模型能够理解文本和图像，并生成文本。

该模型的设计主要侧重于轻量级架构，优化计算效率。在视觉理解方面，它可以处理视觉问答（VQA）、图表解读，甚至内容理解。HyperCLOVAX-SEED-Vision-Instruct-3B 旨在实现针对韩语的帕累托最优平衡，并且在推理场景中，与其他类似规模的模型相比，使用更少的视觉 tokens 就能展现出具有竞争力的性能。

特别是，该模型在处理韩语输入方面表现出相对优势，在相关基准测试中优于同等规模的开源模型。作为韩国首个具备视觉理解能力的开源视觉语言模型，它有望为增强韩国的主权 AI 能力做出重大贡献。

基本信息

模型架构：基于 LLaVA 的视觉语言模型
- LLM 模块：基于 Transformer 的架构（密集模型）
- 视觉编码器：基于 SigLIP 的架构，每个网格输入分辨率为 378x378 像素。
- 视觉-语言连接器：基于 C-Abstractor 的架构，具有 AnyRes 机制，支持跨 9 个网格的总计高达 129 万像素。
参数数量：32 亿（LLM 模块）+ 4.3 亿（视觉模块）
输入/输出格式：文本 + 图像 + 视频 / 文本
上下文长度：16k
知识截止日期：模型训练所用数据收集于 2024 年 8 月之前。

训练

文本

即使在后期训练阶段，确保高质量数据也是至关重要的，但人工创建或修改大规模数据集在成本和资源方面都存在显著限制。此外，需要领域专业知识的任务难以处理，且人为错误风险较高。为了克服这些挑战，我们利用了由 HyperCLOVA X 驱动的自动化验证系统，该系统提高了数据质量，简化了训练过程，并最终提升了模型的整体性能。因此，该模型在数学和编码等具有明确答案的领域表现出显著改进。

虽然降低数据收集成本很重要，但找到高效的训练策略同样关键。HyperCLOVAX-SEED-Vision-Instruct-3B 从 HyperCLOVAX-SEED-Text-Base-3B 开始开发，并应用了监督微调（SFT）和基于名为 GRPO 的在线强化算法的人类反馈强化学习（RLHF）。

视觉

视觉理解功能——即模型接收图像和问题作为输入并生成基于文本的答案——并非 HyperCLOVA X 初始设计的一部分。因此，在不损害 HCX LLM 现有性能的前提下，对模型架构进行了精心设计，以增加处理视觉相关任务的能力，如图像问答（VQA）和图表解读。特别关注了输入中辅助信息的处理，尤其是考虑到上下文长度。

尽管 HyperCLOVAX-SEED-Vision-Instruct-3B 是一个轻量级模型，但它能够执行基本的图像 VQA 任务，甚至支持免 OCR 处理。这个 30 亿参数模型的一个关键重点是优化视频输入 tokens 的效率。由于输入 token 长度直接影响计算成本，因此对每帧提取的 token 数量进行了仔细调整，以使用尽可能少的 tokens 实现高效的视频理解。此外，在 RLHF 训练阶段，如同在文本领域一样，使用了特定于视觉的 V-RLHF 数据来增强模型的学习。

基准测试

文本

模型	KMMLU（5轮，准确率）	HAE-RAE（5轮，准确率）	CLiCK（5轮，准确率）	KoBEST（5轮，准确率）
HyperCLOVAX-SEED-Text-Base-3B	0.4847	0.7635	0.6386	0.7792
HyperCLOVAX-SEED-Vision-Instruct-3B	0.4422	0.6499	0.5599	0.7180
Qwen2.5-3B-instruct	0.4451	0.6031	0.5649	0.7053
gemma-3-4b-it	0.3895	0.6059	0.5303	0.7262

视觉

模型名称	每段视频的最大令牌数	VideoMME（韩语）	NAVER-TV-CLIP（韩语）	VideoChatGPT（韩语）	PerceptionTest（英语）	ActivityNet-QA（英语）	KoNet（韩语）	MMBench-Val（英语）	TextVQA-Val（英语）	Korean VisIT-Bench（韩语）	图像（4个基准测试）	视频（5个基准测试）	全部（9个基准测试）
HyperCLOVAX-SEED-Vision-Instruct-3B	1856 令牌，108 帧	48.2	61.0	53.6	55.2	50.6	69.2	81.8	79.2	37.0	46.68	53.70	59.54
HyperCLOVAX-SEED-Vision-Instruct-3B（无 OCR）	1856 令牌，108 帧	48.2	61.0	53.6	55.2	50.6	36.6	80.7	76.0	43.5	56.74	53.70	55.05
Qwen-2.5-VL-3B	24576 令牌，768 帧	55.1	48.3	45.6	66.9	55.7	58.3	84.3	79.6	81.5	59.35	54.31	56.55
Qwen-2.5-VL-3B（2000 令牌）	2000 令牌，128 帧	50.3	43.9	44.3	58.3	54.2	58.5	84.3	79.3	15.7	59.50	50.18	54.33
Qwen-2.5-VL-7B	24576 令牌，768 帧	60.6	66.7	51.8	70.5	56.6	68.4	88.3	84.9	85.6	69.34	61.23	64.84
Gemma-3-4B	4096 令牌，16 帧	45.4	36.8	57.1	50.6	46.3	25.0	79.2	58.9	32.3	48.91	47.24	47.98
GPT4V (gpt-4-turbo-2024-04-09)	未知，原始图像，8 帧	49.1	75.0	55.5	57.4	45.7	38.7	84.2	60.4	52.0	58.88	51.59	54.83
GPT4o (gpt-4o-2024-08-06)	未知，512 尺寸调整，128 帧	61.6	66.6	61.8	50.2	41.7	60.6	84.2	73.2	50.5	67.15	56.42	61.19
InternV-2-2B	4096 令牌，16 帧	28.9	21.1	40.2	50.5	50.3	3.3	79.3	75.1	51.1	39.74	38.19	38.88
InternV-2-4B	4096 令牌，16 帧	33.8	36.0	22.8	54.2	52.0	22.7	83.0	76.9	51.6	46.11	39.75	42.58
InternV-2-8B	4096 令牌，16 帧	43.7	41.2	32.4	58.5	53.2	28.5	86.6	79.0	97.0	50.32	45.79	47.81

依赖项

示例


from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(device="cuda")
preprocessor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# LLM Example
# It is recommended to use the chat template with HyperCLOVAX models.
# Using the chat template allows you to easily format your input in ChatML style.
chat = [
        {"role": "system", "content": "you are helpful assistant!"},
        {"role": "user", "content": "Hello, how are you?"},
        {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
        {"role": "user", "content": "I'd like to show off how chat templating works!"},
]
input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt", tokenize=True)
input_ids = input_ids.to(device="cuda")

# Please adjust parameters like top_p appropriately for your use case.
output_ids = model.generate(
        input_ids,
        max_new_tokens=64,
        do_sample=True,
        top_p=0.6,
        temperature=0.5,
        repetition_penalty=1.0,
)
print("=" * 80)
print("LLM EXAMPLE")
print(tokenizer.batch_decode(output_ids)[0])
print("=" * 80)

# VLM Example
# For image and video inputs, you can use url, local_path, base64, or bytes.
vlm_chat = [
        {"role": "system", "content": {"type": "text", "text": "System Prompt"}},
        {"role": "user", "content": {"type": "text", "text": "User Text 1"}},
        {
                "role": "user",
                "content": {
                        "type": "image",
                        "filename": "tradeoff_sota.png",
                        "image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff_sota.png?raw=true",
                        "ocr": "List the words in the image in raster order. Even if the word order feels unnatural for reading, the model will handle it as long as it follows raster order.",
                        "lens_keywords": "Gucci Ophidia, cross bag, Ophidia small, GG, Supreme shoulder bag",
                        "lens_local_keywords": "[0.07, 0.21, 0.92, 0.90] Gucci Ophidia",
                }
        },
        {
                "role": "user",
                "content": {
                        "type": "image",
                        "filename": "tradeoff.png",
                        "image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff.png?raw=true",
                }
        },
        {"role": "assistant", "content": {"type": "text", "text": "Assistant Text 1"}},
        {"role": "user", "content": {"type": "text", "text": "User Text 2"}},
        {
                "role": "user",
                "content": {
                        "type": "video",
                        "filename": "rolling-mist-clouds.mp4",
                        "video": "freenaturestock-rolling-mist-clouds.mp4",
                }
        },
        {"role": "user", "content": {"type": "text", "text": "User Text 3"}},
]

new_vlm_chat, all_images, is_video_list = preprocessor.load_images_videos(vlm_chat)
preprocessed = preprocessor(all_images, is_video_list=is_video_list)
input_ids = tokenizer.apply_chat_template(
        new_vlm_chat, return_tensors="pt", tokenize=True, add_generation_prompt=True,
)

output_ids = model.generate(
        input_ids=input_ids.to(device="cuda"),
        max_new_tokens=8192,
        do_sample=True,
        top_p=0.6,
        temperature=0.5,
        repetition_penalty=1.0,
        **preprocessed,
)
print(tokenizer.batch_decode(output_ids)[0])

为确保最高水平的图像理解性能，建议包含光学字符识别（OCR）结果和实体识别（Lens）等额外信息。所提供的使用示例均基于已获取OCR和Lens结果的假设编写。若按此格式输入数据，可显著提升输出质量。

image/png

概述

基本信息

模型架构：基于 LLaVA 的视觉语言模型
- LLM 模块：基于 Transformer 的架构（密集模型）
- 视觉编码器：基于 SigLIP 的架构，每个网格输入分辨率为 378x378 像素。
- 视觉-语言连接器：基于 C-Abstractor 的架构，具有 AnyRes 机制，支持跨 9 个网格的总计高达 129 万像素。
参数数量：32 亿（LLM 模块）+ 4.3 亿（视觉模块）
输入/输出格式：文本 + 图像 + 视频 / 文本
上下文长度：16k
知识截止日期：模型训练所用数据收集于 2024 年 8 月之前。

训练

文本

视觉

基准测试

文本

模型	KMMLU（5轮，准确率）	HAE-RAE（5轮，准确率）	CLiCK（5轮，准确率）	KoBEST（5轮，准确率）
HyperCLOVAX-SEED-Text-Base-3B	0.4847	0.7635	0.6386	0.7792
HyperCLOVAX-SEED-Vision-Instruct-3B	0.4422	0.6499	0.5599	0.7180
Qwen2.5-3B-instruct	0.4451	0.6031	0.5649	0.7053
gemma-3-4b-it	0.3895	0.6059	0.5303	0.7262

视觉

模型名称	每段视频的最大令牌数	VideoMME（韩语）	NAVER-TV-CLIP（韩语）	VideoChatGPT（韩语）	PerceptionTest（英语）	ActivityNet-QA（英语）	KoNet（韩语）	MMBench-Val（英语）	TextVQA-Val（英语）	Korean VisIT-Bench（韩语）	图像（4个基准测试）	视频（5个基准测试）	全部（9个基准测试）
HyperCLOVAX-SEED-Vision-Instruct-3B	1856 令牌，108 帧	48.2	61.0	53.6	55.2	50.6	69.2	81.8	79.2	37.0	46.68	53.70	59.54
HyperCLOVAX-SEED-Vision-Instruct-3B（无 OCR）	1856 令牌，108 帧	48.2	61.0	53.6	55.2	50.6	36.6	80.7	76.0	43.5	56.74	53.70	55.05
Qwen-2.5-VL-3B	24576 令牌，768 帧	55.1	48.3	45.6	66.9	55.7	58.3	84.3	79.6	81.5	59.35	54.31	56.55
Qwen-2.5-VL-3B（2000 令牌）	2000 令牌，128 帧	50.3	43.9	44.3	58.3	54.2	58.5	84.3	79.3	15.7	59.50	50.18	54.33
Qwen-2.5-VL-7B	24576 令牌，768 帧	60.6	66.7	51.8	70.5	56.6	68.4	88.3	84.9	85.6	69.34	61.23	64.84
Gemma-3-4B	4096 令牌，16 帧	45.4	36.8	57.1	50.6	46.3	25.0	79.2	58.9	32.3	48.91	47.24	47.98
GPT4V (gpt-4-turbo-2024-04-09)	未知，原始图像，8 帧	49.1	75.0	55.5	57.4	45.7	38.7	84.2	60.4	52.0	58.88	51.59	54.83
GPT4o (gpt-4o-2024-08-06)	未知，512 尺寸调整，128 帧	61.6	66.6	61.8	50.2	41.7	60.6	84.2	73.2	50.5	67.15	56.42	61.19
InternV-2-2B	4096 令牌，16 帧	28.9	21.1	40.2	50.5	50.3	3.3	79.3	75.1	51.1	39.74	38.19	38.88
InternV-2-4B	4096 令牌，16 帧	33.8	36.0	22.8	54.2	52.0	22.7	83.0	76.9	51.6	46.11	39.75	42.58
InternV-2-8B	4096 令牌，16 帧	43.7	41.2	32.4	58.5	53.2	28.5	86.6	79.0	97.0	50.32	45.79	47.81

依赖项

示例


from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(device="cuda")
preprocessor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# LLM Example
# It is recommended to use the chat template with HyperCLOVAX models.
# Using the chat template allows you to easily format your input in ChatML style.
chat = [
        {"role": "system", "content": "you are helpful assistant!"},
        {"role": "user", "content": "Hello, how are you?"},
        {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
        {"role": "user", "content": "I'd like to show off how chat templating works!"},
]
input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt", tokenize=True)
input_ids = input_ids.to(device="cuda")

# Please adjust parameters like top_p appropriately for your use case.
output_ids = model.generate(
        input_ids,
        max_new_tokens=64,
        do_sample=True,
        top_p=0.6,
        temperature=0.5,
        repetition_penalty=1.0,
)
print("=" * 80)
print("LLM EXAMPLE")
print(tokenizer.batch_decode(output_ids)[0])
print("=" * 80)

# VLM Example
# For image and video inputs, you can use url, local_path, base64, or bytes.
vlm_chat = [
        {"role": "system", "content": {"type": "text", "text": "System Prompt"}},
        {"role": "user", "content": {"type": "text", "text": "User Text 1"}},
        {
                "role": "user",
                "content": {
                        "type": "image",
                        "filename": "tradeoff_sota.png",
                        "image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff_sota.png?raw=true",
                        "ocr": "List the words in the image in raster order. Even if the word order feels unnatural for reading, the model will handle it as long as it follows raster order.",
                        "lens_keywords": "Gucci Ophidia, cross bag, Ophidia small, GG, Supreme shoulder bag",
                        "lens_local_keywords": "[0.07, 0.21, 0.92, 0.90] Gucci Ophidia",
                }
        },
        {
                "role": "user",
                "content": {
                        "type": "image",
                        "filename": "tradeoff.png",
                        "image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff.png?raw=true",
                }
        },
        {"role": "assistant", "content": {"type": "text", "text": "Assistant Text 1"}},
        {"role": "user", "content": {"type": "text", "text": "User Text 2"}},
        {
                "role": "user",
                "content": {
                        "type": "video",
                        "filename": "rolling-mist-clouds.mp4",
                        "video": "freenaturestock-rolling-mist-clouds.mp4",
                }
        },
        {"role": "user", "content": {"type": "text", "text": "User Text 3"}},
]

new_vlm_chat, all_images, is_video_list = preprocessor.load_images_videos(vlm_chat)
preprocessed = preprocessor(all_images, is_video_list=is_video_list)
input_ids = tokenizer.apply_chat_template(
        new_vlm_chat, return_tensors="pt", tokenize=True, add_generation_prompt=True,
)

output_ids = model.generate(
        input_ids=input_ids.to(device="cuda"),
        max_new_tokens=8192,
        do_sample=True,
        top_p=0.6,
        temperature=0.5,
        repetition_penalty=1.0,
        **preprocessed,
)
print(tokenizer.batch_decode(output_ids)[0])

为确保最高水平的图像理解性能，建议包含光学字符识别（OCR）结果和实体识别（Lens）等额外信息。所提供的使用示例均基于已获取OCR和Lens结果的假设编写。若按此格式输入数据，可显著提升输出质量。