KORMo-VL is a vision-language model developed from scratch by the KAIST MLP Lab (https://sites.google.com/view/aailab), built on top of KORMo-10B. The system consists of two components:
The VLM is designed for Document UI Agent tasks, enabling structured understanding and manipulation of document-based interfaces. As part of our commitment to open research, the KORMo team releases the training data (or generation pipelines) and training recipes used for our models to help researchers easily build Korean vision-language models.
KORMo-VL은 KAIST MLP 연구실에서 from scratch로 개발한 시각-언어 모델로,
KORMo-10B를 기반으로 (1) 시각-언어 모델과 (2) 이미지 생성 모델로 구성되어 있습니다.
이 중 시각-언어 모델은 Document UI Agent에 특화된 모델로 일반적인 VLM 기능을 포함하면서
대규모 한국어 문서 이미지를 World Model 형태로 학습하여 한/영 문서 이해 및 작성에 강점을 갖습니다.
1. LLM: KORMo-10B-sft
2. Vision Encoder: siglip-so400m
3. Languages: Korean / English
4. Training Data: Synthetic data + public datasets (약 3.4M text–image pairs)
5. Training Method: Image–text pretraining → Visual instruction tuning → Reinforcement learning
여러분의 Agent 시스템에 한국어 문서 이해와 제어에 특화된 KORMo-VL을 활용해 보세요.
agent.click(), agent.write(), agent.rewrite() 등의 기능을 통해 문서, 이미지, 웹사이트를 제어할 수 있으며,
추가 fine-tuning을 통해 scroll(), drag() 등의 새로운 동작도 확장할 수 있습니다.
KORMo 팀은 언제나 그랬듯 학습에 활용(제작)된 학습데이터와 레시피를 공개해 누구나 쉽고 편하게 한국어 시각-언어모델을 개발할 수 있도록 기여하고자 합니다.📄 Document-Centric Vision-Language Understanding Trained on large-scale Korean document images to enable strong document comprehension and generation.
🌏 World-Model Style Training Document layouts and visual-text interactions are modeled as part of a world representation.
🇰🇷🇺🇸 Bilingual Support Optimized for Korean and English document understanding.
🤖 Agent-Friendly Design Supports interaction-based tasks such as UI control and document editing.
| Component | Model |
|---|---|
| LLM | KORMo-10B-sft |
| Vision Encoder | siglip-so400m |
Languages
Training Data
KORMo-VL is designed to operate as a Document UI Agent.
Example capabilities:
agent.click()
agent.write()
agent.rewrite()These actions allow the model to interact with documents, images, and web interfaces.
The model can be further extended via fine-tuning to support additional actions such as:
scroll
dragKORMo-VL is particularly suitable for:
| Benchmark | KORMo-VL | KORMo-10B | smolLM3-3B | olmo2-7B | olmo2-13B | kanana1.5-8B | qwen3-8B | llama3.1-8B | gemma3-4B | gemma3-12B |
|---|---|---|---|---|---|---|---|---|---|---|
| 🇺🇸 English Benchmarks | ||||||||||
| arc_challenge | 63.74 | 58.96 | 55.55 | 59.13 | 61.01 | 56.48 | 63.82 | 54.61 | 53.58 | 63.82 |
| arc_easy | 87.25 | 85.48 | 83.21 | 85.06 | 86.57 | 82.74 | 87.50 | 84.01 | 82.83 | 87.37 |
| boolq | 86.18 | 83.46 | 82.17 | 84.50 | 86.48 | 84.53 | 87.71 | 81.87 | 80.70 | 86.61 |
| copa | 93.00 | 93.00 | 91.00 | 92.00 | 93.00 | 88.00 | 92.00 | 93.00 | 89.00 | 95.00 |
| gpqa_main | 33.04 | 30.13 | 26.79 | 26.34 | 29.24 | 29.24 | 30.13 | 23.44 | 30.13 | 35.71 |
| hellaswag | 79.25 | 60.25 | 56.78 | 61.52 | 65.02 | 59.93 | 59.54 | 60.96 | 57.56 | 63.67 |
| mmlu | 69.53 | 67.96 | 61.37 | 62.81 | 66.85 | 63.73 | 76.95 | 65.03 | 59.60 | 73.58 |
| mmlu_global | 64.97 | 63.44 | 57.52 | 59.88 | 63.99 | 60.21 | 75.05 | 61.30 | 57.23 | 70.23 |
| mmlu_pro | 47.02 | 40.18 | 34.94 | 27.29 | 32.50 | 34.93 | 56.58 | 36.23 | 27.79 | 37.07 |
| mmlu_redux | 70.88 | 69.00 | 62.95 | 63.53 | 68.37 | 65.88 | 78.19 | 65.86 | 60.86 | 75.25 |
| openbookqa | 48.40 | 39.00 | 36.40 | 39.00 | 39.60 | 36.80 | 39.20 | 39.00 | 37.00 | 40.20 |
| piqa | 82.15 | 81.12 | 78.45 | 80.79 | 82.64 | 80.30 | 79.05 | 80.90 | 79.49 | 82.59 |
| social_iqa | - | 52.81 | 50.72 | 55.89 | 57.57 | 57.01 | 56.96 | 53.12 | 51.84 | 56.45 |
| English Avg. | 68.78 | 63.45 | 59.83 | 61.36 | 64.06 | 61.52 | 67.90 | 61.49 | 59.05 | 66.73 |
| 🇰🇷 Korean Benchmarks | ||||||||||
| click | 56.74 | 55.29 | 46.97 | 37.79 | 41.80 | 62.76 | 60.70 | 49.22 | 49.62 | 62.21 |
| csatqa | - | 38.00 | 26.67 | 19.33 | 24.67 | 44.67 | 52.00 | 28.67 | 28.67 | 31.33 |
| haerae | 70.03 | 68.29 | 55.82 | 31.62 | 37.58 | 80.75 | 67.19 | 53.25 | 60.68 | 74.34 |
| k2_eval | 85.19 | 84.89 | 75.23 | 49.54 | 63.43 | 84.72 | 84.72 | 76.62 | 76.39 | 85.42 |
| kobest | - | 75.05 | 69.13 | 57.27 | 59.02 | 81.93 | 80.05 | 70.55 | 69.33 | 77.70 |
| kobalt | - | 22.86 | 15.86 | 11.43 | 13.14 | 26.29 | 26.57 | 17.43 | 15.57 | 23.86 |
| kmmlu | 48.12 | 46.48 | 38.52 | 33.05 | 31.24 | 48.86 | 56.93 | 40.75 | 39.84 | 51.60 |
| mmlu_global (ko) | 57.43 | 55.16 | 44.15 | 34.00 | 36.95 | 52.65 | 61.95 | 46.34 | 46.33 | 59.68 |
| kr_clinical_qa | 75.89 | 77.32 | 53.97 | 48.33 | 46.22 | 65.84 | 80.00 | 63.54 | 60.00 | 77.22 |
| Korean Avg. | 65.57 | 58.15 | 47.37 | 35.82 | 39.34 | 60.94 | 63.35 | 49.60 | 49.60 | 60.37 |
| Benchmark | KORMo-VL | VARCO-VISION-2.0-14B (Qwen-based) | Qwen3-VL-8B Instruct |
|---|---|---|---|
| ChartQA | 86.68 | 82.6 | 82.96 |
| DocVQA | 87.16 | 87.46 | 95.76 |
| MMBench_DEV_EN | 79.8 | 83.67 | 84.52 |
| SEEDBench_IMG | 73.13 | 77.45 | 77.53 |
| K-DTCBench(Korean) | 82.08 | 80.00 | 87.5 |
uv pip install transformers==4.57.1 pillow torchvisionfrom transformers import AutoConfig, AutoProcessor, AutoModelForImageTextToText, PretrainedConfig
from transformers.modeling_rope_utils import rope_config_validation
import torch
########## KORMo Config ##########
class KORMoConfig(PretrainedConfig):
model_type = "kormo"
keys_to_ignore_at_inference = ["past_key_values"]
base_model_tp_plan = {
"layers.*.self_attn.q_proj": "colwise",
"layers.*.self_attn.k_proj": "colwise",
"layers.*.self_attn.v_proj": "colwise",
"layers.*.self_attn.o_proj": "rowwise",
"layers.*.mlp.gate_proj": "colwise",
"layers.*.mlp.up_proj": "colwise",
"layers.*.mlp.down_proj": "rowwise",
}
def __init__(
self,
vocab_size=112576,
hidden_size=6144,
intermediate_size=21504,
num_hidden_layers=48,
num_attention_heads=40,
num_key_value_heads=8,
hidden_act="silu",
max_position_embeddings=131072,
initializer_range=0.02,
rms_norm_eps=1e-05,
use_cache=True,
pad_token_id=None,
bos_token_id=0,
eos_token_id=1,
pretraining_tp=1,
tie_word_embeddings=False,
rope_theta=500000.0,
attention_bias=False,
attention_dropout=0.0,
rope_scaling=None,
mlp_bias=False,
head_dim=128,
**kwargs,
):
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
if num_key_value_heads is None:
num_key_value_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.hidden_act = hidden_act
self.initializer_range = initializer_range
self.rms_norm_eps = rms_norm_eps
self.pretraining_tp = pretraining_tp
self.use_cache = use_cache
self.rope_theta = rope_theta
self.rope_scaling = rope_scaling
self.attention_bias = attention_bias
self.attention_dropout = attention_dropout
self.mlp_bias = mlp_bias
self.head_dim = head_dim if head_dim is not None else self.hidden_size // self.num_attention_heads
self.mask_type = None
if self.rope_scaling is not None and "type" in self.rope_scaling:
self.rope_scaling["rope_type"] = self.rope_scaling["type"]
rope_config_validation(self)
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
tie_word_embeddings=tie_word_embeddings,
**kwargs,
)
AutoConfig.register("kormo", KORMoConfig)
##########
model = AutoModelForImageTextToText.from_pretrained(
"KORMo-VL/KORMo-VL", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("KORMo-VL/KORMo-VL")
messages = [
{
"role": "user",
"content": [
{
"type": "image", "image": "https://github.com/MLP-Lab/KORMo-tutorial/blob/main/tutorial/attachment/kormo_logo.svg?raw=true",
},
{"type": "text", "text": "이 이미지에 대해 설명해주세요."},
],
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=4096)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True
)
print(output_text)
If you want to enable the thinking mode, simply set enable_thinking=True:
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
enable_thinking=True
)ktlim@kaist.ac.kr@misc{KORMo,
author = {Minjun Kim, Hyeonseok Lim, Hangyeol Yoo, Inho Won, Seungwoo Song, Minkyung Cho, Junghun Yuk, Changsu Choi, Dongjae Shin, Huije Lee, Hoyun Song, Alice Oh, and KyungTae Lim},
title = {KORMo: Korean Open Reasoning Model for Everyone},
year = {2025},
publisher = {GitHub},
journal = {Technical Report},
paperLink = {\url{https://arxiv.org/abs/2510.09426}},
},
}