
HyperCLOVAX-SEED-Vision-Instruct-3B 是由 NAVER 开发的模型,它基于其专有骨干模型构建,并通过后期训练进行了微调。该模型能够理解文本和图像,并生成文本。
该模型的设计主要侧重于轻量级架构,优化计算效率。在视觉理解方面,它可以处理视觉问答(VQA)、图表解读,甚至内容理解。HyperCLOVAX-SEED-Vision-Instruct-3B 旨在实现针对韩语的帕累托最优平衡,并且在推理场景中,与其他类似规模的模型相比,使用更少的视觉 tokens 就能展现出具有竞争力的性能。
特别是,该模型在处理韩语输入方面表现出相对优势,在相关基准测试中优于同等规模的开源模型。作为韩国首个具备视觉理解能力的开源视觉语言模型,它有望为增强韩国的主权 AI 能力做出重大贡献。
即使在后期训练阶段,确保高质量数据也是至关重要的,但人工创建或修改大规模数据集在成本和资源方面都存在显著限制。此外,需要领域专业知识的任务难以处理,且人为错误风险较高。为了克服这些挑战,我们利用了由 HyperCLOVA X 驱动的自动化验证系统,该系统提高了数据质量,简化了训练过程,并最终提升了模型的整体性能。因此,该模型在数学和编码等具有明确答案的领域表现出显著改进。
虽然降低数据收集成本很重要,但找到高效的训练策略同样关键。HyperCLOVAX-SEED-Vision-Instruct-3B 从 HyperCLOVAX-SEED-Text-Base-3B 开始开发,并应用了监督微调(SFT)和基于名为 GRPO 的在线强化算法的人类反馈强化学习(RLHF)。
视觉理解功能——即模型接收图像和问题作为输入并生成基于文本的答案——并非 HyperCLOVA X 初始设计的一部分。因此,在不损害 HCX LLM 现有性能的前提下,对模型架构进行了精心设计,以增加处理视觉相关任务的能力,如图像问答(VQA)和图表解读。特别关注了输入中辅助信息的处理,尤其是考虑到上下文长度。
尽管 HyperCLOVAX-SEED-Vision-Instruct-3B 是一个轻量级模型,但它能够执行基本的图像 VQA 任务,甚至支持免 OCR 处理。这个 30 亿参数模型的一个关键重点是优化视频输入 tokens 的效率。由于输入 token 长度直接影响计算成本,因此对每帧提取的 token 数量进行了仔细调整,以使用尽可能少的 tokens 实现高效的视频理解。此外,在 RLHF 训练阶段,如同在文本领域一样,使用了特定于视觉的 V-RLHF 数据来增强模型的学习。
| 模型 | KMMLU(5轮,准确率) | HAE-RAE(5轮,准确率) | CLiCK(5轮,准确率) | KoBEST(5轮,准确率) |
|---|---|---|---|---|
| HyperCLOVAX-SEED-Text-Base-3B | 0.4847 | 0.7635 | 0.6386 | 0.7792 |
| HyperCLOVAX-SEED-Vision-Instruct-3B | 0.4422 | 0.6499 | 0.5599 | 0.7180 |
| Qwen2.5-3B-instruct | 0.4451 | 0.6031 | 0.5649 | 0.7053 |
| gemma-3-4b-it | 0.3895 | 0.6059 | 0.5303 | 0.7262 |
| 模型名称 | 每段视频的最大令牌数 | VideoMME(韩语) | NAVER-TV-CLIP(韩语) | VideoChatGPT(韩语) | PerceptionTest(英语) | ActivityNet-QA(英语) | KoNet(韩语) | MMBench-Val(英语) | TextVQA-Val(英语) | Korean VisIT-Bench(韩语) | 图像(4个基准测试) | 视频(5个基准测试) | 全部(9个基准测试) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HyperCLOVAX-SEED-Vision-Instruct-3B | 1856 令牌,108 帧 | 48.2 | 61.0 | 53.6 | 55.2 | 50.6 | 69.2 | 81.8 | 79.2 | 37.0 | 46.68 | 53.70 | 59.54 |
| HyperCLOVAX-SEED-Vision-Instruct-3B(无 OCR) | 1856 令牌,108 帧 | 48.2 | 61.0 | 53.6 | 55.2 | 50.6 | 36.6 | 80.7 | 76.0 | 43.5 | 56.74 | 53.70 | 55.05 |
| Qwen-2.5-VL-3B | 24576 令牌,768 帧 | 55.1 | 48.3 | 45.6 | 66.9 | 55.7 | 58.3 | 84.3 | 79.6 | 81.5 | 59.35 | 54.31 | 56.55 |
| Qwen-2.5-VL-3B(2000 令牌) | 2000 令牌,128 帧 | 50.3 | 43.9 | 44.3 | 58.3 | 54.2 | 58.5 | 84.3 | 79.3 | 15.7 | 59.50 | 50.18 | 54.33 |
| Qwen-2.5-VL-7B | 24576 令牌,768 帧 | 60.6 | 66.7 | 51.8 | 70.5 | 56.6 | 68.4 | 88.3 | 84.9 | 85.6 | 69.34 | 61.23 | 64.84 |
| Gemma-3-4B | 4096 令牌,16 帧 | 45.4 | 36.8 | 57.1 | 50.6 | 46.3 | 25.0 | 79.2 | 58.9 | 32.3 | 48.91 | 47.24 | 47.98 |
| GPT4V (gpt-4-turbo-2024-04-09) | 未知,原始图像,8 帧 | 49.1 | 75.0 | 55.5 | 57.4 | 45.7 | 38.7 | 84.2 | 60.4 | 52.0 | 58.88 | 51.59 | 54.83 |
| GPT4o (gpt-4o-2024-08-06) | 未知,512 尺寸调整,128 帧 | 61.6 | 66.6 | 61.8 | 50.2 | 41.7 | 60.6 | 84.2 | 73.2 | 50.5 | 67.15 | 56.42 | 61.19 |
| InternV-2-2B | 4096 令牌,16 帧 | 28.9 | 21.1 | 40.2 | 50.5 | 50.3 | 3.3 | 79.3 | 75.1 | 51.1 | 39.74 | 38.19 | 38.88 |
| InternV-2-4B | 4096 令牌,16 帧 | 33.8 | 36.0 | 22.8 | 54.2 | 52.0 | 22.7 | 83.0 | 76.9 | 51.6 | 46.11 | 39.75 | 42.58 |
| InternV-2-8B | 4096 令牌,16 帧 | 43.7 | 41.2 | 32.4 | 58.5 | 53.2 | 28.5 | 86.6 | 79.0 | 97.0 | 50.32 | 45.79 | 47.81 |
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer
model_name = "naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(device="cuda")
preprocessor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# LLM Example
# It is recommended to use the chat template with HyperCLOVAX models.
# Using the chat template allows you to easily format your input in ChatML style.
chat = [
{"role": "system", "content": "you are helpful assistant!"},
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
{"role": "user", "content": "I'd like to show off how chat templating works!"},
]
input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt", tokenize=True)
input_ids = input_ids.to(device="cuda")
# Please adjust parameters like top_p appropriately for your use case.
output_ids = model.generate(
input_ids,
max_new_tokens=64,
do_sample=True,
top_p=0.6,
temperature=0.5,
repetition_penalty=1.0,
)
print("=" * 80)
print("LLM EXAMPLE")
print(tokenizer.batch_decode(output_ids)[0])
print("=" * 80)
# VLM Example
# For image and video inputs, you can use url, local_path, base64, or bytes.
vlm_chat = [
{"role": "system", "content": {"type": "text", "text": "System Prompt"}},
{"role": "user", "content": {"type": "text", "text": "User Text 1"}},
{
"role": "user",
"content": {
"type": "image",
"filename": "tradeoff_sota.png",
"image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff_sota.png?raw=true",
"ocr": "List the words in the image in raster order. Even if the word order feels unnatural for reading, the model will handle it as long as it follows raster order.",
"lens_keywords": "Gucci Ophidia, cross bag, Ophidia small, GG, Supreme shoulder bag",
"lens_local_keywords": "[0.07, 0.21, 0.92, 0.90] Gucci Ophidia",
}
},
{
"role": "user",
"content": {
"type": "image",
"filename": "tradeoff.png",
"image": "https://github.com/naver-ai/rdnet/blob/main/resources/images/tradeoff.png?raw=true",
}
},
{"role": "assistant", "content": {"type": "text", "text": "Assistant Text 1"}},
{"role": "user", "content": {"type": "text", "text": "User Text 2"}},
{
"role": "user",
"content": {
"type": "video",
"filename": "rolling-mist-clouds.mp4",
"video": "freenaturestock-rolling-mist-clouds.mp4",
}
},
{"role": "user", "content": {"type": "text", "text": "User Text 3"}},
]
new_vlm_chat, all_images, is_video_list = preprocessor.load_images_videos(vlm_chat)
preprocessed = preprocessor(all_images, is_video_list=is_video_list)
input_ids = tokenizer.apply_chat_template(
new_vlm_chat, return_tensors="pt", tokenize=True, add_generation_prompt=True,
)
output_ids = model.generate(
input_ids=input_ids.to(device="cuda"),
max_new_tokens=8192,
do_sample=True,
top_p=0.6,
temperature=0.5,
repetition_penalty=1.0,
**preprocessed,
)
print(tokenizer.batch_decode(output_ids)[0])