HuggingFace镜像/KORMo-VL
模型介绍文件和版本分析
下载使用量0

🚀 Update News

  • 2026-03-02: Official release of KORMo-VL.
  • 2025-10-13: Official release of KORMo-10B-sft.

💡 About KORMo-VL

KORMo-VL is a vision-language model developed from scratch by the KAIST MLP Lab (https://sites.google.com/view/aailab), built on top of KORMo-10B. The system consists of two components:

  • Vision-Language Model (VLM)
  • Image Generation Model

The VLM is designed for Document UI Agent tasks, enabling structured understanding and manipulation of document-based interfaces. As part of our commitment to open research, the KORMo team releases the training data (or generation pipelines) and training recipes used for our models to help researchers easily build Korean vision-language models.


KORMo-VL은 KAIST MLP 연구실에서 from scratch로 개발한 시각-언어 모델로,
KORMo-10B를 기반으로 (1) 시각-언어 모델과 (2) 이미지 생성 모델로 구성되어 있습니다.

이 중 시각-언어 모델은 Document UI Agent에 특화된 모델로 일반적인 VLM 기능을 포함하면서
대규모 한국어 문서 이미지를 World Model 형태로 학습하여 한/영 문서 이해 및 작성에 강점을 갖습니다.

1. LLM: KORMo-10B-sft
2. Vision Encoder: siglip-so400m
3. Languages: Korean / English
4. Training Data: Synthetic data + public datasets (약 3.4M text–image pairs)
5. Training Method: Image–text pretraining → Visual instruction tuning → Reinforcement learning

여러분의 Agent 시스템에 한국어 문서 이해와 제어에 특화된 KORMo-VL을 활용해 보세요.
agent.click(), agent.write(), agent.rewrite() 등의 기능을 통해 문서, 이미지, 웹사이트를 제어할 수 있으며,
추가 fine-tuning을 통해 scroll(), drag() 등의 새로운 동작도 확장할 수 있습니다.

KORMo 팀은 언제나 그랬듯 학습에 활용(제작)된 학습데이터와 레시피를 공개해 누구나 쉽고 편하게 한국어 시각-언어모델을 개발할 수 있도록 기여하고자 합니다.

Key Features

  • 📄 Document-Centric Vision-Language Understanding Trained on large-scale Korean document images to enable strong document comprehension and generation.

  • 🌏 World-Model Style Training Document layouts and visual-text interactions are modeled as part of a world representation.

  • 🇰🇷🇺🇸 Bilingual Support Optimized for Korean and English document understanding.

  • 🤖 Agent-Friendly Design Supports interaction-based tasks such as UI control and document editing.


Model Architecture

ComponentModel
LLMKORMo-10B-sft
Vision Encodersiglip-so400m

Training

Languages

  • Korean/English

Training Data

  • Synthetic datasets
  • Public datasets
  • ~3.4M text-image pairs

Agent Capabilities

KORMo-VL is designed to operate as a Document UI Agent.

Example capabilities:

agent.click()
agent.write()
agent.rewrite()

These actions allow the model to interact with documents, images, and web interfaces.

The model can be further extended via fine-tuning to support additional actions such as:

scroll
drag

Intended Use

KORMo-VL is particularly suitable for:

  • Document understanding
  • Document editing agents
  • GUI / UI agents
  • Web automation
  • Document-grounded reasoning
  • Korean document AI systems

📈 Benchmark Performance

📊 Text Evaluation

BenchmarkKORMo-VLKORMo-10BsmolLM3-3Bolmo2-7Bolmo2-13Bkanana1.5-8Bqwen3-8Bllama3.1-8Bgemma3-4Bgemma3-12B
🇺🇸 English Benchmarks
arc_challenge63.7458.9655.5559.1361.0156.4863.8254.6153.5863.82
arc_easy87.2585.4883.2185.0686.5782.7487.5084.0182.8387.37
boolq86.1883.4682.1784.5086.4884.5387.7181.8780.7086.61
copa93.0093.0091.0092.0093.0088.0092.0093.0089.0095.00
gpqa_main33.0430.1326.7926.3429.2429.2430.1323.4430.1335.71
hellaswag79.2560.2556.7861.5265.0259.9359.5460.9657.5663.67
mmlu69.5367.9661.3762.8166.8563.7376.9565.0359.6073.58
mmlu_global64.9763.4457.5259.8863.9960.2175.0561.3057.2370.23
mmlu_pro47.0240.1834.9427.2932.5034.9356.5836.2327.7937.07
mmlu_redux70.8869.0062.9563.5368.3765.8878.1965.8660.8675.25
openbookqa48.4039.0036.4039.0039.6036.8039.2039.0037.0040.20
piqa82.1581.1278.4580.7982.6480.3079.0580.9079.4982.59
social_iqa-52.8150.7255.8957.5757.0156.9653.1251.8456.45
English Avg.68.7863.4559.8361.3664.0661.5267.9061.4959.0566.73
🇰🇷 Korean Benchmarks
click56.7455.2946.9737.7941.8062.7660.7049.2249.6262.21
csatqa-38.0026.6719.3324.6744.6752.0028.6728.6731.33
haerae70.0368.2955.8231.6237.5880.7567.1953.2560.6874.34
k2_eval85.1984.8975.2349.5463.4384.7284.7276.6276.3985.42
kobest-75.0569.1357.2759.0281.9380.0570.5569.3377.70
kobalt-22.8615.8611.4313.1426.2926.5717.4315.5723.86
kmmlu48.1246.4838.5233.0531.2448.8656.9340.7539.8451.60
mmlu_global (ko)57.4355.1644.1534.0036.9552.6561.9546.3446.3359.68
kr_clinical_qa75.8977.3253.9748.3346.2265.8480.0063.5460.0077.22
Korean Avg.65.5758.1547.3735.8239.3460.9463.3549.6049.6060.37

📊 Vision Evaluation

BenchmarkKORMo-VLVARCO-VISION-2.0-14B (Qwen-based)Qwen3-VL-8B Instruct
ChartQA86.6882.682.96
DocVQA87.1687.4695.76
MMBench_DEV_EN79.883.6784.52
SEEDBench_IMG73.1377.4577.53
K-DTCBench(Korean)82.0880.0087.5

📦 Installation

uv pip install transformers==4.57.1 pillow torchvision

🚀 Inference Example

from transformers import AutoConfig, AutoProcessor, AutoModelForImageTextToText, PretrainedConfig
from transformers.modeling_rope_utils import rope_config_validation
import torch

########## KORMo Config ##########

class KORMoConfig(PretrainedConfig):
    model_type = "kormo"
    keys_to_ignore_at_inference = ["past_key_values"]
    base_model_tp_plan = {
        "layers.*.self_attn.q_proj": "colwise",
        "layers.*.self_attn.k_proj": "colwise",
        "layers.*.self_attn.v_proj": "colwise",
        "layers.*.self_attn.o_proj": "rowwise",
        "layers.*.mlp.gate_proj": "colwise",
        "layers.*.mlp.up_proj": "colwise",
        "layers.*.mlp.down_proj": "rowwise",
    }

    def __init__(
        self,
        vocab_size=112576,
        hidden_size=6144,
        intermediate_size=21504,
        num_hidden_layers=48,
        num_attention_heads=40,
        num_key_value_heads=8,
        hidden_act="silu",
        max_position_embeddings=131072,
        initializer_range=0.02,
        rms_norm_eps=1e-05,
        use_cache=True,
        pad_token_id=None,
        bos_token_id=0,
        eos_token_id=1,
        pretraining_tp=1,
        tie_word_embeddings=False,
        rope_theta=500000.0,
        attention_bias=False,
        attention_dropout=0.0,
        rope_scaling=None,
        mlp_bias=False,
        head_dim=128,
        **kwargs,
    ):
        self.vocab_size = vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads

        if num_key_value_heads is None:
            num_key_value_heads = num_attention_heads

        self.num_key_value_heads = num_key_value_heads
        self.hidden_act = hidden_act
        self.initializer_range = initializer_range
        self.rms_norm_eps = rms_norm_eps
        self.pretraining_tp = pretraining_tp
        self.use_cache = use_cache
        self.rope_theta = rope_theta
        self.rope_scaling = rope_scaling
        self.attention_bias = attention_bias
        self.attention_dropout = attention_dropout
        self.mlp_bias = mlp_bias
        self.head_dim = head_dim if head_dim is not None else self.hidden_size // self.num_attention_heads
        self.mask_type = None
        
        if self.rope_scaling is not None and "type" in self.rope_scaling:
            self.rope_scaling["rope_type"] = self.rope_scaling["type"]
        rope_config_validation(self)

        super().__init__(
            pad_token_id=pad_token_id,
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            tie_word_embeddings=tie_word_embeddings,
            **kwargs,
        )


AutoConfig.register("kormo", KORMoConfig)

##########

model = AutoModelForImageTextToText.from_pretrained(
    "KORMo-VL/KORMo-VL", dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("KORMo-VL/KORMo-VL")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image", "image": "https://github.com/MLP-Lab/KORMo-tutorial/blob/main/tutorial/attachment/kormo_logo.svg?raw=true",
            },
            {"type": "text", "text": "이 이미지에 대해 설명해주세요."},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=4096)
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True
)
print(output_text)

🧠 Enabling Thinking Mode

If you want to enable the thinking mode, simply set enable_thinking=True:

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
    enable_thinking=True

)

Contact

  • KyungTae Lim, Professor at KAIST. ktlim@kaist.ac.kr

Contributor

  • Hangyeol Yoo (hgyoo@seoultech.ac.kr)
  • Dongjae Shin (djshin1998@tutoruslabs.com)
  • Yerin Nam (nyl0522@naver.com)
  • Minkyu Kim (mingu2912@gmail.com)
  • Wonjun Oh (wjoh@kaist.ac.kr)
  • KyungTae Lim (ktlim@kaist.ac.kr)

Citation

@misc{KORMo,
  author = {Minjun Kim, Hyeonseok Lim, Hangyeol Yoo, Inho Won, Seungwoo Song, Minkyung Cho, Junghun Yuk, Changsu Choi, Dongjae Shin, Huije Lee, Hoyun Song, Alice Oh, and KyungTae Lim},
  title = {KORMo: Korean Open Reasoning Model for Everyone},
  year = {2025},
  publisher = {GitHub},
  journal = {Technical Report},
  paperLink = {\url{https://arxiv.org/abs/2510.09426}},
 },
}