Teaser

EndoCoT：扩散模型中的内生思维链推理扩展

本仓库包含EndoCoT的官方模型 checkpoint，相关研究成果已发表于论文 EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models。

📝待办事项

开源训练代码
开源训练数据
开源主任务模型 checkpoint
开源编辑模型 checkpoint
重构代码库以提升易用性和可维护性

📰新闻

🚀 [2026年3月12日] 我们已发布 EndoCoT 代码仓库和模型权重。

🌟亮点

main

EndoCoT 是一种用于扩散模型的推理范式，支持逐步推理。在 Qwen-Image-Edit-2511 上，其性能优于传统训练方法。

exp

并提供透明的中间推理轨迹。

case

⚡快速开始

环境设置

git clone https://github.com/InternLM/EndoCoT
cd EndoCoT
conda create -n EndoCoT python=3.10
conda activate EndoCot
# Please install the version of torch compatible with your machine.
pip install -r requirements.txt
# Please install the version of vLLM compatible with your machine.

推理

下载模型 checkpoint：
- 您可以在以下地址找到我们的预训练权重：EndoCoT
遵循 Diffthinker 的配置，我们为 Qwen-Image-Edit 提供了一个定制化的 checkpoint。此 checkpoint 已从原始 safetensors 合并而来，以确保与 Diffsynth-Studio 训练的兼容性。为确保正确加载和推理，请使用本仓库提供的 checkpoint，而非官方版本。

测试单个案例

cd test
python test.py \
    --task Maze \
    --model_root /path/to/merged_ckpts \
    --lora_path /path/to/your_lora_weight.safetensors \
    --input_image ./data/sudoku_sample.png \
    --output_dir ./outputs/sudoku_results

评估我们的模型 checkpoint

我们采用与 Diffthinker 完全相同的设置
```
cd Maze
bash eval/gen_and_parse.sh
bash eval/eval_path.sh
```

训练

下载数据集和 metadata.csv
- 您可以在以下地址找到我们的训练数据：EndoCoT dataset
由于元数据使用相对路径，请确保数据集文件与 metadata.csv 放置在同一目录下。

训练您的模型

cd DiffSynth-Studio
bash add/Maze/stage1.sh
python change_ckpt_prefix.py --src /path/to/the/Maze/save/dir/Maze_stage1	
bash add/Maze/stage2.sh
python change_ckpt_prefix.py --src /path/to/the/Maze/save/dir/Maze_stage2

如何更改潜在推理步骤？

自定义说明： 由于当前实现方式较为直接，您只能在 DiffSynth-Studio/diffsynth/pipelines/qwen_image.py 中手动调整潜在推理步骤：

第 442 行： 修改 infer_steps。

第 471 行： 修改 training_steps。

我们计划在未来版本中对此进行优化。

def encode_prompt_edit(self, pipe: QwenImagePipeline, prompt, edit_image, is_final, gt_prompt=None, idx=None):

        drop_idx = 64
        if type(prompt[0])==str:
            template =  "<|im_start|>system
Describe the key features of the input image (color, shape, size, texture, objects, background), then explain how the user's text instruction should alter or modify the image. Generate a new image that meets the user's requirements while maintaining consistency with the original input where appropriate.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>{}<|im_end|>
<|im_start|>assistant
"
            txt = template.format(prompt[0])
            model_inputs = pipe.processor(text=txt, images=edit_image, padding=True, return_tensors="pt").to(pipe.device)
            embedding_layers = pipe.text_encoder.model.language_model.get_input_embeddings()
            with torch.no_grad():
                inputs_embeds = embedding_layers(model_inputs.input_ids)
            self.attention_mask = model_inputs.attention_mask
            self.pixel_values = model_inputs.pixel_values
            self.image_grid_thw = model_inputs.image_grid_thw
        else:
            inputs_embeds= prompt[0]
            
        # dxl: test use
        if is_final==None or idx!=None:
            print("现在在inference。或者stage2训练")
            if idx!=None:
                iter_times = idx-2
            else:
                # infer step
                iter_times = 50
                
            with torch.no_grad():
                inputs_embeds = self.manual_generate_eval(
                    pipe, 
                    inputs_embeds=inputs_embeds,
                    max_new_tokens=iter_times,
                ).detach()
            
            # dxl: only update the last 2 tokens
            if idx!=None:
                inputs_embeds = self.manual_generate_eval(
                    pipe,
                    inputs_embeds=inputs_embeds,
                    max_new_tokens=2,
                )

            generated_embeds = inputs_embeds

		... ... 
        
        # dxl：training
        if is_final!=None and idx==None:
            try:
                generated_embeds, _ = self.manual_generate(
                    pipe,
                    inputs_embeds=inputs_embeds,
                    is_final=is_final,
                    # training steps
                    max_new_tokens=2,
                )
            except Exception as e:
                print(f"Error!: {type(e).__name__} - {e}")
                print(inputs_embeds.shape)
                assert False

        try: 
            return split_hidden_states, generated_embeds, eos_loss
        except:
            print(f"[WARNING] Prompt was not updated correctly for inference.")
            return split_hidden_states

📖 引用

@article{dai2026endocot,
  title={EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models},
  author={Dai, Xuanlang and Zhou, Yujie and Xing, Long and Bu, Jiazi and Wei, Xilin and Liu, Yuhong and Zhang, Beichen and Chen, Kai and Zang, Yuhang},
  journal={arXiv preprint arXiv:2603.12252},
  year={2026}
}

⚖️ 许可证

代码许可证数据许可证

Teaser

EndoCoT：扩散模型中的内生思维链推理扩展

本仓库包含EndoCoT的官方模型 checkpoint，相关研究成果已发表于论文 EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models。

📝待办事项

开源训练代码
开源训练数据
开源主任务模型 checkpoint
开源编辑模型 checkpoint
重构代码库以提升易用性和可维护性

📰新闻

🚀 [2026年3月12日] 我们已发布 EndoCoT 代码仓库和模型权重。

🌟亮点

main

EndoCoT 是一种用于扩散模型的推理范式，支持逐步推理。在 Qwen-Image-Edit-2511 上，其性能优于传统训练方法。

exp

并提供透明的中间推理轨迹。

case

⚡快速开始

环境设置

git clone https://github.com/InternLM/EndoCoT
cd EndoCoT
conda create -n EndoCoT python=3.10
conda activate EndoCot
# Please install the version of torch compatible with your machine.
pip install -r requirements.txt
# Please install the version of vLLM compatible with your machine.

推理

下载模型 checkpoint：
- 您可以在以下地址找到我们的预训练权重：EndoCoT
遵循 Diffthinker 的配置，我们为 Qwen-Image-Edit 提供了一个定制化的 checkpoint。此 checkpoint 已从原始 safetensors 合并而来，以确保与 Diffsynth-Studio 训练的兼容性。为确保正确加载和推理，请使用本仓库提供的 checkpoint，而非官方版本。

测试单个案例

cd test
python test.py \
    --task Maze \
    --model_root /path/to/merged_ckpts \
    --lora_path /path/to/your_lora_weight.safetensors \
    --input_image ./data/sudoku_sample.png \
    --output_dir ./outputs/sudoku_results

评估我们的模型 checkpoint

我们采用与 Diffthinker 完全相同的设置
```
cd Maze
bash eval/gen_and_parse.sh
bash eval/eval_path.sh
```

训练

下载数据集和 metadata.csv
- 您可以在以下地址找到我们的训练数据：EndoCoT dataset
由于元数据使用相对路径，请确保数据集文件与 metadata.csv 放置在同一目录下。

训练您的模型

cd DiffSynth-Studio
bash add/Maze/stage1.sh
python change_ckpt_prefix.py --src /path/to/the/Maze/save/dir/Maze_stage1	
bash add/Maze/stage2.sh
python change_ckpt_prefix.py --src /path/to/the/Maze/save/dir/Maze_stage2

如何更改潜在推理步骤？

自定义说明： 由于当前实现方式较为直接，您只能在 DiffSynth-Studio/diffsynth/pipelines/qwen_image.py 中手动调整潜在推理步骤：

第 442 行： 修改 infer_steps。

第 471 行： 修改 training_steps。

我们计划在未来版本中对此进行优化。

def encode_prompt_edit(self, pipe: QwenImagePipeline, prompt, edit_image, is_final, gt_prompt=None, idx=None):

        drop_idx = 64
        if type(prompt[0])==str:
            template =  "<|im_start|>system
Describe the key features of the input image (color, shape, size, texture, objects, background), then explain how the user's text instruction should alter or modify the image. Generate a new image that meets the user's requirements while maintaining consistency with the original input where appropriate.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>{}<|im_end|>
<|im_start|>assistant
"
            txt = template.format(prompt[0])
            model_inputs = pipe.processor(text=txt, images=edit_image, padding=True, return_tensors="pt").to(pipe.device)
            embedding_layers = pipe.text_encoder.model.language_model.get_input_embeddings()
            with torch.no_grad():
                inputs_embeds = embedding_layers(model_inputs.input_ids)
            self.attention_mask = model_inputs.attention_mask
            self.pixel_values = model_inputs.pixel_values
            self.image_grid_thw = model_inputs.image_grid_thw
        else:
            inputs_embeds= prompt[0]
            
        # dxl: test use
        if is_final==None or idx!=None:
            print("现在在inference。或者stage2训练")
            if idx!=None:
                iter_times = idx-2
            else:
                # infer step
                iter_times = 50
                
            with torch.no_grad():
                inputs_embeds = self.manual_generate_eval(
                    pipe, 
                    inputs_embeds=inputs_embeds,
                    max_new_tokens=iter_times,
                ).detach()
            
            # dxl: only update the last 2 tokens
            if idx!=None:
                inputs_embeds = self.manual_generate_eval(
                    pipe,
                    inputs_embeds=inputs_embeds,
                    max_new_tokens=2,
                )

            generated_embeds = inputs_embeds

		... ... 
        
        # dxl：training
        if is_final!=None and idx==None:
            try:
                generated_embeds, _ = self.manual_generate(
                    pipe,
                    inputs_embeds=inputs_embeds,
                    is_final=is_final,
                    # training steps
                    max_new_tokens=2,
                )
            except Exception as e:
                print(f"Error!: {type(e).__name__} - {e}")
                print(inputs_embeds.shape)
                assert False

        try: 
            return split_hidden_states, generated_embeds, eos_loss
        except:
            print(f"[WARNING] Prompt was not updated correctly for inference.")
            return split_hidden_states

📖 引用

@article{dai2026endocot,
  title={EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models},
  author={Dai, Xuanlang and Zhou, Yujie and Xing, Long and Bu, Jiazi and Wei, Xilin and Liu, Yuhong and Zhang, Beichen and Chen, Kai and Zang, Yuhang},
  journal={arXiv preprint arXiv:2603.12252},
  year={2026}
}

⚖️ 许可证

代码许可证数据许可证