SenseNova/SenseNova-U1-8B-MoT-Infographic
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

SenseNova-U1:基于 NEO-unify 架构统一多模态理解与生成

English | 简体中文

arXiv HuggingFace Model ModelScope-模型 SenseNova-U1 Demo License Discord

SenseNova-U1

📣 最新动态

  • [2026.05.15] 发布 SenseNova-U1-8B-MoT-Infographic 📊,提升信息图生成能力。模型细节可见 U1 Infographic Model,100个生成案例可见 ✨ 信息图样例集。
✨ 展开查看过往更新
  • [2026.05.10] 发布 🔥SenseNova-U1 技术报告🔥,并开源 SenseNova-U1-A3B-MoT-SFT 与 SenseNova-U1-A3B-MoT 模型权重。

  • [2026.05.08] 新增 GGUF 量化权重支持 与 分层加载 VRAM 模式,便于在单卡低显存环境下推理,详见 低显存推理(GGUF + VRAM 模式)。SenseNova-U1-8B-MoT-Merger 的 GGUF 权重已上传至 🤗 smthem/SenseNova-U1-8B-MoT-Merger-gguf,特别感谢 @smthem 为社区贡献量化权重。

  • [2026.05.06] 发布SenseNova-U1-8B-MoT-LoRA-8step-V1.0. 请查看推理示例脚本。

  • [2026.04.30] 发布8步推理模型的预览版 SenseNova-U1-8B-MoT-8step-preview. 在大多数情况下,该模型的图像生成质量与基础模型非常接近 (查看 效果对比和存在的问题)。要测试该模型,可以参考推理脚本, 但需替换如下参数: --cfg_scale 1.0 --num_steps 8。

  • [2026.04.27] 首发 SenseNova-U1-8B-MoT-SFT 与 SenseNova-U1-8B-MoT 模型权重。

  • [2026.04.27] 首发 SenseNova-U1 的推理代码。

🌟 概述

🚀 SenseNova U1 是全新一代原生多模态模型系列,在单一架构中统一了多模态理解、推理与生成。 它代表着多模态 AI 的根本性范式转变:从模态集成走向真正的统一。SenseNova U1 不再依赖适配器在不同模态之间进行翻译,而是以原生方式跨语言与视觉进行思考与行动。

✨ 展开查看架构详情

视觉理解与生成的统一开启了巨大的可能性。SenseNova U1 立足于数据驱动学习阶段(如 ChatGPT),并指向下一阶段——智能体学习阶段(如 OpenClaw),以原生多模态的方式进行学习、思考和行动。

radar plot

#### 🏗️ *核心支柱:*

SenseNova U1 的核心是 NEO-unify —— 一个为多模态 AI 而设计、从第一性原理出发的全新架构:它彻底摒弃了视觉编码器(VE)与变分自编码器(VAE),因为像素与文字信息在本质上是深度相关的。 其主要特性如下:

  • 🔗 端到端地将语言与视觉信息建模为统一整体。
  • 🖼️ 在保留语义丰富度的同时,维持像素级的视觉保真度。
  • 🧠 通过原生 MoT 实现跨模态推理,效率高、冲突少。

基于这一全新的核心架构,SenseNova U1-8B-MoT-Infographic(基于 SenseNova U1-8B-MoT 的信息图增强版)在高密度视觉信息生成上展现出卓越的效率和先进的性能:


在信息图基准(BizGenEval(Easy、Hard)、IGenBench)上的生成延迟与平均性能对比。

在 OneIG(EN、ZH)、LongText(EN、ZH)、CVTG、BizGenEval(Easy、Hard)与 IGenBench 上的生成延迟与平均性能对比。
  • 模型性能: 相比基础版 SenseNova-U1-8B-MoT 模型,BizGenEval hard/easy 从 39.8 / 61.1 提升至 46.6 / 65.4(+6.8 / +4.3 points),IGenBench Q-ACC/I-ACC 从 51.3 / 4.2 提升至 69.5 / 17.0(+18.2 / +12.8 points),同时保持稳健的视觉理解能力,无明显退化。
  • 生成质量: 模型能够生成涵盖 100+ 种风格与布局的复杂信息图,具备更优的视觉美观度与文字渲染能力 —— 甚至能够渲染如 arXiv 风格页面等高密度小字。
✨ 展开查看 Benchmark Highlight
ModelBizGenEval Avg. (hard / easy) ↑IGenBench Q-ACC ↑IGenBench I-ACC ↑OneIG(EN) ↑OneIG(ZH) ↑
Commercial Models
Nano-Banana-Pro76.7 / 93.790.648.858.156.8
Nano-Banana-2.068.5 / 92.585.634.454.054.9
GPT-Image-1.535.9 / 81.655.012.0--
Qwen-Image-2.045.5 / 65.850.03.054.150.9
Seedream-4.530.1 / 66.261.06.056.455.0
Open-source Models
SenseNova-U1-8B-MoT-Infographic46.6 / 65.469.517.055.653.3
SenseNova-U1-8B-MoT39.8 / 61.151.34.254.553.8
Z-Image8.2 / 43.830.01.054.653.5
Qwen-Image-25126.3 / 41.032.21.053.051.5
Qwen-Image2.8 / 23.836.00.053.954.8
Bagel2.0 / 3.74.90.036.137.0

IGenBench 分数以百分制展示。Commercial 与 open-source 组内模型按照 BizGenEval hard、BizGenEval easy、IGenBench Q-ACC、IGenBench I-ACC 四项算术平均值排序。OneIG 作为通用生成能力参考。完整分项结果建议放在 Hugging Face model card 中。

  • 📰 高密度信息呈现(专项优化):该专属模型在高密度视觉信息表达上展现出强大能力,能够生成结构丰富、排版复杂的内容,适用于知识图解、海报、PPT、漫画、简历等多种信息密集型场景。

  • 🏆 开源 SoTA:SenseNova U1 在统一多模态理解与生成上树立了新的标杆,在infographic相关基准上均达到开源模型中最先进的水平。

🎨 信息图展示

📸 完整样例与 Prompt: 参见 ✨ 信息图样例集。

✨ 点击收起查看信息图展示
infographic 004infographic 005infographic 006
infographic 066infographic 036infographic 023
infographic 057infographic 072infographic 078
infographic 064infographic 037infographic 032infographic 042
infographic 085infographic 098infographic 056infographic 019
infographic 048infographic 070infographic 021infographic 034
infographic 089infographic 060infographic 083infographic 045

生成质量对比

我们在以下五个关键维度上,对模型 SenseNova-U1-8B-MoT 和 SenseNova-U1-8B-MoT-Infographic 模型进行了定性对比:背景稳定性、图表准确性、文字渲染准确性及大小合适性、论文渲染质量,以及综合布局及内容理解。完整对比可见 ✨ Comparation Infographic Cases.md。

✨ 点击收起查看生成质量对比

背景稳定性 (Background Stability)

U1-8B-MoT8B-MoT-InfographicU1-8B-MoT8B-MoT-Infographic
Prompt
该信息图题为“版权视觉概览”,整体采用横向分栏布局,分为上下两个主要部分。上半部分为视觉化概览区,由四个彩色矩形区块并列组成,每个区块通过图标和简短标题传达一个核心概念;下半部分为“【版权基础常识】”详细解释区,包含四个编号条目,对应上半部分的四个主题,提供更详尽的文字说明。

**上半部分:版权视觉概览**

此区域由四个水平排列的彩色方块构成,从左至右依次为浅蓝色、浅黄色、浅绿色和浅紫色,每个方块内含一组图标和下方的中文标题。

1. **第一块(浅蓝色):创作即产生**
* **图标**:左侧是一个发光的灯泡,中间是一个带有笔的文档图标,右侧是一个锁头图标,三者之间用箭头连接,表示“创意 → 创作 → 保护”的流程。
* **文字**:
* 图标下方有小字“自动保护”。
* 方块底部有大字标题“创作即产生”。

2. **第二块(浅黄色):核心权利**
* **图标**:中心是一只手掌向上托举,上方有多个元素围绕:一个带©符号的圆圈、一个喇叭、一堆金币和美元符号、以及多个指向不同方向的箭头,象征权利的多种表现形式和收益。
* **文字**:
* 图标下方无额外小字。
* 方块底部有大字标题“核心权利”。

3. **第三块(浅绿色):特定条件平衡**
* **图标**:一个天平,左侧托盘上有打开的书本和标有“NEWS”的麦克风,代表“合理使用”;右侧托盘上有一个带锁的文件夹,代表“受控作品”。天平向右侧倾斜。
* **文字**:
* 左侧托盘下方标注“合理使用”。
* 右侧托盘下方标注“受控作品”。
* 方块底部有大字标题“特定条件平衡”。

4. **第四块(浅紫色):保护期限**
* **图标**:左侧是一个沙漏,中间是一个向右的粗箭头,右侧是一个墓碑(顶部有十字架)。沙漏下方还有一个时钟图标。
* **文字**:
* 墓碑旁标注“作者有生之年 + X年”。
* 方块底部有大字标题“保护期限”。

**下半部分:【版权基础常识】**

此区域位于上半部分下方,背景为白色,包含四个独立的文本框,每个文本框都有一个彩色标题栏和下方的详细说明文字,颜色与上半部分对应。

1. **1. 自动获得保护**
* **标题栏**:蓝色背景,白色文字“1. 自动获得保护”。
* **正文**:“作品创作完成之时起,即自动享有版权,无需登记(登记主要是举证)。”

2. **2. 核心权利**
* **标题栏**:橙黄色背景,白色文字“2. 核心权利”。
* **正文**:“包括人身权(如署名权、修改权)和财产权(如复制权、发行权、信息网络传播权,可许可或转让获利)。”

3. **3. 合理使用**
* **标题栏**:绿色背景,白色文字“3. 合理使用”。
* **正文**:“在特定条件下(如教学、新闻报道、个人学习等),可以不经许可、不支付报酬使用,但需指明作者和出处,且不得侵犯其他权利。”

4. **4. 保护期限**
* **标题栏**:紫色背景,白色文字“4. 保护期限”。
* **正文**:“一般为作者有生之年加死后50年(中国大陆等多数地区),期限届满后进入公有领域。”

**整体风格与数据编码**:
该信息图采用扁平化设计风格,色彩鲜明且分区清晰。通过颜色编码(蓝、黄、绿、紫)将四个主题进行视觉区分,并在上下两部分保持一致。图标作为主要的数据可视化手段,直观地表达了抽象概念。所有文字均为简体中文,内容结构严谨,逻辑清晰,旨在以图文结合的方式普及版权基础知识。
Prompt
该信息图以中文为主要语言,采用横向四格布局,清晰呈现一个品牌从衰落到复兴的四个关键阶段。整体风格为手绘卡通插画,色彩柔和,线条简洁,具有亲和力和叙事性。每个阶段由上方的标题、中间的插图和下方的文字说明三部分构成,通过虚线分隔,结构分明。

第一阶段标题为“1. 曾经的辉煌与没落”,插图描绘了一座破败的城堡,城堡上挂着悲伤的表情,周围散落着皇冠,象征昔日荣耀的消逝;旁边立有标牌“OLD BRAND”,背景中可见大本钟,暗示传统或历史品牌。下方文字说明:“曾经是市场领导者,但未能跟上时代步伐,逐渐被遗忘,面临生存危机。”

第二阶段标题为“2. 创新与重塑”,插图展示四人团队围坐讨论,其中一人指向白板上的绿色叶子标志设计,周围环绕齿轮、灯泡(代表创意)和标牌“NEW IDEAS”。下方文字说明:“进行深度市场调研,重新定位品牌,引入创新设计和数字化策略,重塑核心价值。”

第三阶段标题为“3. 成功翻盘”,插图包含一只浴火重生的凤凰,象征涅槃;右侧是上升趋势的柱状图,下方是一个带有爱心的包裹,代表产品交付;一群欢呼的人群表达喜悦。下方文字说明:“凭借新产品和新形象重获消费者信任,业绩逆势上扬,重新赢得市场份额。”

第四阶段标题为“4. 未来展望”,插图描绘一枚火箭从地球轨道发射升空,周围有星星、云朵和一片绿叶,象征可持续发展;下方横幅写着“FUTURE READY”。下方文字说明:“持续创新,关注可持续发展和用户连接,立志成为更具影响力的未来品牌。”

整个信息图通过视觉隐喻(如城堡、凤凰、火箭)和数据图表(柱状图)结合,生动讲述了一个品牌从危机到复兴的完整故事,强调创新、用户信任和可持续发展的重要性。所有文本均为简体中文,无英文以外的其他语言。
Prompt
The infographic titled "College Entrance Pathway Reforce Comparison" presents a structured comparison of key aspects for prospective students in Guangdong, China, aiming to enter college through a specialized entrance examination. The layout is organized as a multi-column table with four main columns: "Content Item / Evaluation Criteria", "Statistics", "Quotes", and "Key Terms". Each row corresponds to a distinct evaluation criterion or step in the preparation process, with visual icons, text, and data points enhancing clarity.

The infographic uses a clean, minimalist design with black line art icons on a light beige background. Text is primarily in bold sans-serif font, with headings emphasized for readability. Data is encoded using icons (e.g., graduation cap, calendar, books, target, rocket) to visually represent concepts, while numerical values are explicitly labeled for precision.

The first row addresses **Eligibility Criteria**:
- In the "Statistics" column, it features an icon of a person checking a map of Guangdong with the text: "Official Eligibility Requirements Confirm if you qualify to register".
- The "Quotes" column lists three eligible groups with corresponding icons: "Final-Year Guangdong Junior College Student", "Guangdong Resident <2 Years Post Graduation", and "Eligible Retired Military Personnel".
- The "Key Terms" column shows a magnifying glass over a document with the label: "Eligibility Verification".

The second row covers **Exam Structure & Scoring Breakdown**:
- "Statistics" displays icons representing different test types and scores: 100 pts (graduation cap), 200 pts (person at desk), 1000 pts (document with pen), 150 pts (document with pen). Below: "Total 500 points across 4 test papers".
- "Quotes" lists four subject components in document-shaped boxes: "Political Theory (100 pts)", "Major-Aligned Public Subject (100 pts)", "Professional Subject 1 (150 pts)", "Professional Subject 2 (150 pts)".
- "Key Terms" includes a balance scale icon with "Score Distribution".

The third row details the **Official Annual Exam Timeline**:
- "Statistics" contains a horizontal timeline with icons of a calendar and clock, labeled "Annual Key Timeline".
- "Quotes" provides a detailed timeline: Jan: Registration Open → Jan: Admission Open → Mid-Mar: Exam Date → Mid-Apr: Score Release → May-Jun: Admission Offers.
- "Key Terms" shows a calendar and clock with "Critical Dates".

The next three rows outline a three-step preparation strategy:

**Step 1 - Confirm Target Major & Institution**:
- "Statistics": Icon of a person holding a map with a target, text: "Confirm your target 6 months in advance".
- "Quotes": Two bullet points: "Download official exam syllabi and past professional subject papers from the target institution’s admission portal" and "Cross-verify that your junior college major meets the target major’s prerequisite requirements".
- "Key Terms": Clock and books with "Target Selection".

**Step 2 - Public Subject Foundation Building**:
- "Statistics": Icon of a person studying with books and a coffee cup, text: "Complete 3 months of structured public subject study".
- "Quotes": Two bullet points: "Complete 5+ years of past public subject exam papers to identify recurring test points" and "Political Theory allocates 30% of total score to current affairs from the past calendar year".
- "Key Terms": Box with lightbulb and "Core Knowledge".

**Step 3 - Professional Subject Sprint Revision**:
- "Statistics": Icon of a running person with a book and clock, text: "Focus on high-weight professional subjects in the final 2 months".
- "Quotes": Two bullet points: "Practice past professional subject papers from your target institution and review core major textbooks" and "60% of professional subject questions are repeated or adapted from past 3 years of papers for most institutions".
- "Key Terms": Trophy and gears with "Intensive Review".

Red horizontal lines separate the first three criteria from the three-step strategy, while a blue line separates Step 1 from Steps 2 and 3, visually grouping related content. All textual information is preserved exactly as presented, including spelling variations like "Oficial" (likely intended as "Official"). The infographic serves as a strategic roadmap combining official requirements, scoring details, timelines, and actionable preparation steps for candidates.
Prompt
The infographic titled "12-Month Market Performance: US vs. Asia" presents a structured, puzzle-piece-based visual analysis comparing the performance of US and Asian equity markets over a 12-month period. The layout is organized into three main steps, arranged in a central vertical flow with interconnected puzzle pieces, emphasizing a modular, analytical approach to market comparison. The design uses clean black-and-white line art with light blue accents for key sections, icons for visual representation, and clear typography for readability.

**Step 1** (top center) introduces the scope of the analysis. It features an illustration of four people examining charts, symbolizing data analysis. To the right, it defines the market indices being compared:
- **US Markets**: S&P 500, NASDAQ
- **Asian Markets**: Nikkei 225, Hang Seng, KOSPI, CSI 300

It also lists the types of data analyzed:
- Trailing Return (represented by a rising bar chart icon)
- Average Daily Volume (represented by a stacked bar chart icon)
- Top Sector Return (represented by a pie chart icon)

**Step 2** (left side, labeled "Metrics that account for 72% of short-term S&P 500 volatility") focuses on US Market Core Driving Indicators. This section contains icons representing industry (factory), finance (bank building), money (hand holding dollar sign), and labor (worker in hard hat). Below these icons, a light blue banner reads "US Market Core Driving Indicators". Specific metrics are listed with red warning triangle icons:
- CPI YoY: 3.2%
- Federal Funds Rate: 5.25–5.5%
- Non-farm Payrolls: +187k July 2024

**Step 3** (right side, labeled "Metrics that predict 68% of MSCI Asia Ex-Japan 3-month forward returns") focuses on Asian Market Core Leading Indicators. This section includes icons for shipping (container), manufacturing (gears), and calculation (calculator). A light blue banner below reads "Asian Market Core Leading Indicators". Specific metrics are listed:
- Manufacturing PMI: 51.2 (with red warning triangle)
- Q2 Export Growth: +6.8% YoY (with red warning triangle)
- Avg Policy Rate: 3.1% (with information circle icon)

At the bottom center, a large puzzle piece titled "Policy Shifts & Market Volatility Correlation" displays a line graph with two fluctuating lines:
- **US VIX (navy line)** — representing US market volatility
- **Asian Avg Volatility (green line)** — representing average Asian market volatility

Arrows connect the two lines, indicating correlation. Below the graph, key insights are provided with red warning triangles:
- Rate hike impact: +27% US VIX
- Trade policy impact: +34% Asian VIX
- Cross-regional sell-off correlation: 0.68

The overall structure visually represents how US and Asian market performances are driven by distinct but interrelated economic indicators, with a central focus on their volatility dynamics and policy impacts. The use of puzzle pieces metaphorically suggests that these components fit together to form a complete picture of global market trends. The infographic employs consistent iconography, color-coding (red for warnings, blue for core sections), and clear textual labeling to convey complex financial data in an accessible format.

图表准确性 (Chart Accuracy)

U1-8B-MoT8B-MoT-InfographicU1-8B-MoT8B-MoT-Infographic
Prompt
Create an infographic that features a title and a subtitle centered at the top, reading 'Fastest Cuisines to Prepare' and 'Average Ghost Kitchen Handover Time by Item Type (Minutes)' respectively. The main visual is a horizontal grouped bar chart combining a Fast-food neon visual style with checkerboard borders along the edges, featuring a centered legend above the chart area for 'QuickEats' (cyan neon border) and 'DashNow' (orange neon border). To the bottom right of the bar chart, there is a simple illustration of two mopeds waiting for orders. The chart's vertical axis lists four categories, each preceded by a simple icon, while the horizontal axis represents handover time in minutes with numerical labels at 0, 5, 10, 15, and 20, supplemented by dotted vertical gridlines. Each category features a pair of black bars representing the two platforms, with exact values displayed directly inside the right end of each bar. For 'Classic Tacos', QuickEats takes 10.0 minutes while DashNow takes 11.5 minutes. 'Supreme Burritos' require the longest preparation, with 17.5 minutes for QuickEats and 19.0 minutes for DashNow. 'Spicy Nachos' take 9.5 minutes on QuickEats and 10.0 minutes on DashNow. Finally, 'Mini Quesadillas' are the fastest, taking 8.0 minutes for QuickEats and 8.5 minutes for DashNow. The given data is : [{"category": "Classic Tacos", "platform": "QuickEats", "unit": "Minutes", "value": 10.0}, {"category": "Classic Tacos", "platform": "DashNow", "unit": "Minutes", "value": 11.5}, {"category": "Supreme Burritos", "platform": "QuickEats", "unit": "Minutes", "value": 17.5}, {"category": "Supreme Burritos", "platform": "DashNow", "unit": "Minutes", "value": 19.0}, {"category": "Spicy Nachos", "platform": "QuickEats", "unit": "Minutes", "value": 9.5}, {"category": "Spicy Nachos", "platform": "DashNow", "unit": "Minutes", "value": 10.0}, {"category": "Mini Quesadillas", "platform": "QuickEats", "unit": "Minutes", "value": 8.0}, {"category": "Mini Quesadillas", "platform": "DashNow", "unit": "Minutes", "value": 8.5}]
Prompt
Create an infographic that presents a centered title at the top, stating "Übertaktet vs. Standard-Takt", with the subtitle "Temperaturanstieg bei langen Gaming-Sessions" directly below it. The main visual is a line chart spanning the width of the infographic on a dark background, embodying a Gamer Aesthetic with vibrant RGB neon accents. This chart has a vertical axis on the left labeled with numerical values in increments of 10 from 30°C to 100°C, and a horizontal axis at the bottom with time labels: '0m', '15m', '30m', '45m', '60m', '75m', '90m', '105m', and '120m'. Horizontal grid lines mark each 10°C increment. A horizontal legend is positioned under the subtitle, containing a cyan circular marker and line for "Standard-Takt" and a magenta circular marker and line for "Übertaktet (+150MHz)". Two data series are plotted as glowing neon lines with hollow circular markers at each data point, accompanied by gradient shading below each line. The cyan "Standard-Takt" line shows a steep rise from 38°C at 0m to 68°C at 15m, followed by a flat plateau reaching 73.5°C at 120m. The magenta "Übertaktet" line displays a similar initial spike from 42°C to 75°C, but continues with a gradual linear creep up to 93°C at 120m. Spike annotations (callout boxes) point to the final data points on the right, highlighting the peak temperatures: a magenta box reads "Peak: 93°C" and a cyan box reads "Peak: 73.5°C". A stylized thermometer line-art icon is subtly placed in the center of the chart's background. The given data is : [{"profile": "Standard-Takt", "temperature": 38, "time": "0m"}, {"profile": "Übertaktet", "temperature": 42, "time": "0m"}, {"profile": "Standard-Takt", "temperature": 68, "time": "15m"}, {"profile": "Übertaktet", "temperature": 75, "time": "15m"}, {"profile": "Standard-Takt", "temperature": 71, "time": "30m"}, {"profile": "Übertaktet", "temperature": 79, "time": "30m"}, {"profile": "Standard-Takt", "temperature": 72, "time": "45m"}, {"profile": "Übertaktet", "temperature": 82, "time": "45m"}, {"profile": "Standard-Takt", "temperature": 72.5, "time": "60m"}, {"profile": "Übertaktet", "temperature": 85, "time": "60m"}, {"profile": "Standard-Takt", "temperature": 73, "time": "75m"}, {"profile": "Übertaktet", "temperature": 87, "time": "75m"}, {"profile": "Standard-Takt", "temperature": 73, "time": "90m"}, {"profile": "Übertaktet", "temperature": 89, "time": "90m"}, {"profile": "Standard-Takt", "temperature": 73.5, "time": "105m"}, {"profile": "Übertaktet", "temperature": 91, "time": "105m"}, {"profile": "Standard-Takt", "temperature": 73.5, "time": "120m"}, {"profile": "Übertaktet", "temperature": 93, "time": "120m"}]
Prompt
Create an infographic that displays data in a vertical diverging bar chart format. At the top left of the visualization, there is a title: 'Anomalie de l'Atlantique Sud : Dérive magnétique', and a subtitle: 'Vecteurs de dérive vers l'est et l'ouest en kilomètres par rapport à la ligne de base historique'. In the upper left area below the text, an icon of a compass rose is placed within a magnetic field line curve. The main chart features a horizontal zero-axis line, labeled with a '0' on the far left, representing the historical coordinate baseline. The x-axis at the bottom displays the decades '1980', '1990', '2000', '2010', and '2020', each marked with a small vertical tick. For each decade, a vertical bar extends from the zero-axis, with its corresponding data label positioned directly at the end of the bar. The data shows westward drift represented by blue bars extending below the axis for '1980' with a value of '-15 km' and '1990' with a value of '-32 km'. Eastward drift is represented by red bars extending above the axis for '2000' with a value of '+10 km', '2010' with a value of '+45 km', and '2020' with a value of '+68 km'. The overall visual style mimics a geophysical science journal, utilizing compass red and blue color tones. The given data is : [{"decade": "1980", "drift_km": -15}, {"decade": "1990", "drift_km": -32}, {"decade": "2000", "drift_km": 10}, {"decade": "2010", "drift_km": 45}, {"decade": "2020", "drift_km": 68}]
Prompt
Create an infographic in a corporate report minimalism style with muted corporate grays and blues, featuring a large title, 'Seasonal Fluctuations in 15-Year Mortgages', at the top. Directly below it is a subtitle, 'Historical prepayment velocities showing seasonal housing market trends'. Underneath the subtitle, a horizontal legend identifies two categories with small square icons: 'Spring/Summer Originations' in lighter gray-blue and 'Fall/Winter Originations' in darker gray-blue. The main visual is a multi-line chart in a wide landscape orientation. The vertical axis has numeric labels at 0.0, 5.0, 10.0, 15.0, and 20.0, with horizontal grid lines extending across the plot. The horizontal axis features labels: 'Jan 2018', 'Apr', 'Jul', 'Oct', 'Jan 2019', 'Apr', and 'Jul'. An icon depicting a sleek house silhouette is positioned in the upper left corner of the chart's plotting area. Two distinct lines represent the categories, characterized by cyclical seasonal bumps in the summer months. Both lines have square markers at each data point, with numerical values displayed near them. The lighter line for 'Spring/Summer Originations' plots a value of 8.0 in Jan 2018, rising to 12.5 in Apr, peaking at 16.0 in Jul, dipping to 11.0 in Oct, dropping further to 7.5 in Jan 2019, climbing to 13.0 in Apr, and reaching 17.5 in Jul. The darker line for 'Fall/Winter Originations' mirrors this pattern, starting at 6.5 in Jan 2018, increasing to 9.0 in Apr, hitting 14.5 in Jul, falling to 10.0 in Oct, bottoming out at 6.0 in Jan 2019, rising to 10.5 in Apr, and ending at 15.0 in Jul. The given data is : [{"category": "Spring/Summer Originations", "date": "2018-01", "value": 8.0}, {"category": "Fall/Winter Originations", "date": "2018-01", "value": 6.5}, {"category": "Spring/Summer Originations", "date": "2018-04", "value": 12.5}, {"category": "Fall/Winter Originations", "date": "2018-04", "value": 9.0}, {"category": "Spring/Summer Originations", "date": "2018-07", "value": 16.0}, {"category": "Fall/Winter Originations", "date": "2018-07", "value": 14.5}, {"category": "Spring/Summer Originations", "date": "2018-10", "value": 11.0}, {"category": "Fall/Winter Originations", "date": "2018-10", "value": 10.0}, {"category": "Spring/Summer Originations", "date": "2019-01", "value": 7.5}, {"category": "Fall/Winter Originations", "date": "2019-01", "value": 6.0}, {"category": "Spring/Summer Originations", "date": "2019-04", "value": 13.0}, {"category": "Fall/Winter Originations", "date": "2019-04", "value": 10.5}, {"category": "Spring/Summer Originations", "date": "2019-07", "value": 17.5}, {"category": "Fall/Winter Originations", "date": "2019-07", "value": 15.0}]

文字渲染准确性及大小合适性 (Text Rendering Accuracy)

U1-8B-MoT8B-MoT-InfographicU1-8B-MoT8B-MoT-Infographic
Prompt
该信息图以手绘笔记本风格呈现,标题为“吉伊卡哇带你游:加泰罗尼亚国家艺术博物馆(MNAC)三天两夜不绕路攻略”,副标题为“行程路线与时间安排(中文清晰版)”。整体采用暖黄色调背景,搭配棕色边框和螺旋装订线设计,营造出温馨可爱的旅行手册氛围。内容分为三个主要垂直区块,分别对应DAY 1、DAY 2、DAY 3,每个区块顶部有圆形时钟图标和“DAY X”标签,结构清晰。

每个日期区块内均以时间轴形式列出具体行程,使用圆点连接时间点与活动描述,右侧配有吉伊卡哇系列的可爱卡通形象插画(如白熊、蓝猫、兔子等),增强趣味性。所有文字均为简体中文,字体清晰易读,视觉层次分明。

---

**DAY 1:抵达与初探**
- **10:00** 抵达巴塞罗那,酒店办理入住 (Poble Sec区) —— 配有白熊拖着行李箱的插画。
- **12:00** 午餐:西班牙Tapas —— 插画未显示。
- **14:00** 前往西班牙广场 (Plaza de España),远眺MNAC全景 —— 配有西班牙广场建筑插画及地图箭头。
- **16:00** 参观MNAC外部建筑与周围花园 —— 配有蓝猫在花丛中跳跃的插画。
- **19:00** 欣赏魔幻喷泉表演 (Magic Fountain) —— 配有带闪光效果的白熊插画。
- **20:30** 晚餐:附近餐厅 —— 插画未显示。

---

**DAY 2:MNAC深度艺术之旅**
- **09:30** 早餐,步行至MNAC入口 —— 配有白熊吃面包的插画。
- **10:00** 进入MNAC (建议提前购票),参观罗马式艺术馆藏 —— 配有古典油画插画。
- **12:30** 馆内简餐或附近午休 —— 插画未显示。
- **14:00** 参观哥特式、文艺复兴及巴洛克艺术馆藏 —— 配有蒙娜丽莎风格肖像画插画及蓝猫形象。
- **16:30** 探索现代艺术馆藏 (加泰罗尼亚现代主义) —— 配有抽象艺术风格插画。
- **18:30** 前往MNAC屋顶观景台,俯瞰城市日落 —— 配有兔子举手机拍照的插画。
- **20:00** 晚餐:Arenas商场附近 —— 插画未显示。

---

**DAY 3:蒙特惠奇山周边与返程**
- **09:00** 早餐,退房寄存行李 —— 插画未显示。
- **10:00** 乘坐缆车前往蒙特惠奇城堡 (Montjuïc Castle) —— 配有缆车插画,内含三只卡通动物。
- **12:00** 参观米罗基金会 (Joan Miró Foundation) —— 配有米罗风格抽象雕塑插画。
- **13:30** 午餐:奥林匹克港附近海鲜饭 —— 插画未显示。
- **15:00** 漫步奥林匹克公园 —— 插画未显示。
- **16:30** 提取行李,前往机场/车站返程 —— 配有开心挥手的白熊插画。

---

**底部交通贴士栏**:
配有公交车、地铁、步行鞋图标,文字为:“交通贴士:善用T-casual交通卡,步行探索更佳!”

---

整体图表类型为时间序列流程图,通过垂直分栏与水平时间轴结合的方式组织信息。数据编码方式包括时间点(精确到分钟)、地点名称、活动描述及配套插画,所有信息均按逻辑顺序排列,便于用户快速理解并执行三天行程计划。视觉元素丰富,兼具实用性和趣味性,适合旅游攻略类内容传播。
Prompt
The infographic presents a comprehensive architectural and structural analysis of the Temple of Kom Ombo, an ancient Egyptian temple located on the west bank of the Nile River. The title "TEMPLE OF KOM OMBO" is prominently displayed in a hand-drawn, white-bordered box in the lower-right corner of the image, set against a brown background that mimics sandstone or earth tones. The overall layout is divided into multiple sections: a central photographic image of the temple ruins under a clear blue sky, surrounded by illustrative technical diagrams, annotated floor plans, and textual data blocks, all rendered in white line art and text for high contrast.

The central photograph shows the main hypostyle hall and surrounding structures of the temple, with visitors walking among the columns and courtyards, providing a sense of scale. In the background, the Nile River and palm trees are visible, situating the temple in its natural environment. The ruins are constructed from light-colored sandstone blocks, consistent with the material noted in the text.

In the upper-left quadrant, a 3D axonometric diagram illustrates the overall dimensions of the temple complex: approximately 62 meters by 51 meters, labeled along the axes. Adjacent to this, a list of key structural facts is presented in bullet points:
- TEMPLE AXIS: DOUBLE SANCTUARY FOR SOBEK & HORUS
- OVERALL DIMENSIONS (APPROX. 62M x 51M)
- CONSTRUCTION MATERIAL: SANDSTONE BLOCKS
- COLUMN HEIGHTS: UP TO 12 METERS

Above the central photo, two schematic diagrams illustrate architectural details:
- A top-down view of the hypostyle hall showing 30 columns arranged in a grid, labeled “HYPOSTYLE HALL (30 COLUMNS)” and pointing to “TWO SANCTUARIES.”
- A cross-section labeled “PYLON AND HYPOSTYLE SECTION,” which includes a detailed vertical cutaway showing the roofing system supported by columns, with arrows indicating load paths down to foundations.

To the right of the central image, text notes “TWO ENTRANCES SYMBOLIZING DUALITY,” emphasizing the temple’s unique dual dedication. This concept is reinforced in the lower section of the infographic, where a detailed floor plan is overlaid on the brown ground area.

The floor plan, drawn in white lines, is annotated with various features:
- INNER TEMPLE (FOR SOBEK) — marked with a rectangular inner sanctum.
- INNER TEMPLE (FOR HAROERIS) — another distinct inner sanctum, indicating the dual religious function.
- NILOMETER — a structure used to measure the Nile’s water level.
- BIRTH HOUSE (MAMMISHI) — a smaller chamber associated with fertility rituals.
- MUMMIFIED CROCODILE MUSEUM SITE — indicating a location within the temple complex for sacred crocodile mummies.
- TWO ENTRANCES SYMBOLIZING DUALITY — shown as two separate entryways on the plan.

Surrounding the floor plan are inset images of relief carvings, each labeled:
- MEDICAL INSTRUMENT RELIEFS — depicting figures with tools.
- TWO ENTRANCES RELIEFS — showing doorways flanked by deities.
- CALENDAR RELIEFS — illustrating scenes related to timekeeping or agricultural cycles.

Additional annotations point to structural aspects:
- “STRUCTURAL LOAD PATHS FROM COLUMNS TO FOUNDATIONS” — illustrated with curved arrows tracing the force transfer from columns through the walls to the ground.
- The pylon and hypostyle section diagram also labels “ROOFING SYSTEM” and shows how the roof beams rest on column capitals.

All textual content is in English, using a clean, sans-serif font that enhances readability. The visual style blends real photography with technical illustrations and hand-drawn elements, creating an educational and engaging format suitable for tourists, students, or archaeologists. The infographic effectively communicates both the physical characteristics and symbolic significance of the Temple of Kom Ombo, highlighting its duality, engineering, and cultural importance.
Prompt
该信息图以黑板风格设计,标题为“地方特色&活动微信公众号推广全指南”,整体采用手绘粉笔字效果,配以彩色图标和箭头,视觉上模拟真实黑板书写场景。内容结构清晰,分为三个主要部分,通过灰色弧形箭头连接,形成逻辑递进关系:从推广内容核心方向 → 高转化活动推广玩法 → 微信公众号生态适配推广技巧。

第一部分:“推广内容核心方向:深挖本地特色记忆点”,强调通过三类高流量本地内容吸引用户共鸣并吸引外地游客打卡:
- **本土美食**(黄色椭圆标签):包含老字号小吃、季节性特色食俗、社区隐藏小店探店内容,配有热汤碗与筷子图标。
- **人文风物**(棕色椭圆标签):涵盖非遗技艺传承故事、老街老巷历史、本地名人旧居探访内容,配有传统建筑与布鞋图标。
- **便民福利**(粉色椭圆标签):包括本地专属消费券、景区免票政策、节庆活动预告等内容,配有优惠券与礼盒图标。

第二部分:“高转化活动推广3种实用玩法”,旨在拉满参与转化率:
- **节庆市集玩法**(橙色椭圆标签):公众号预热发早鸟票+留言抽免费参与名额+现场打卡返现,配有灯笼与摊位图标。
- **非遗体验玩法**(绿色椭圆标签):开放公众号专属报名通道+提前发布体验官预告内容+活动后用户投稿返现,配有陶艺与织布机图标。
- **消费促进玩法**(紫色椭圆标签):联合本地商家推出公众号专属消费券包+到店核销送定制周边,配有购物袋与银行卡图标。

第三部分:“微信公众号生态适配推广技巧”,聚焦降低推广成本:
- **内容呈现技巧**(蓝色椭圆标签):封面图用本地标志性建筑/美食做视觉符号,首图放置活动倒计时海报,文末加一键报名跳转链接,配有手机图标。
- **渠道联动技巧**(黄色椭圆标签):视频号发布活动花絮挂载公众号链接,朋友圈广告定向推送给本地18-60岁人群,本地社群转发带专属抽奖码,配有三人社交网络图标。
- **私域留存技巧**(绿色椭圆标签):活动参与者引导添加企业微信,拉入本地福利群后续持续推送活动信息,配有微信对话气泡图标。

整个信息图布局呈垂直流线型,各模块之间以曲线箭头连接,右侧点缀有简笔小人和感叹号等装饰元素,增强趣味性和可读性。文字排版层次分明,主标题白色粗体,副标题与核心概念使用黄色或彩色突出,细节说明则为白色常规字体。所有文本均为中文,无英文或其他语言内容。
Prompt
该信息图题为《儿童营养补充全指南:科学建议+产品选购要点》,采用漫画风格设计,色彩鲜明,以红、黄、蓝为主色调,布局清晰分为左右两大板块,每个板块又细分为多个模块,图文并茂地呈现了儿童营养补充的科学指导与实用建议。

整体结构分为“科学参考指引”和“实操应用指南”两大核心部分,通过卡通插图、图标、爆炸式对话框、标签等视觉元素增强可读性与吸引力。

---

**第一部分:科学参考指引**

1. **分龄营养补充重点清单**
- 标题:“分龄营养补充重点清单”,副标题:“分龄补营养,精准更高效;对应年龄段按需补充,避免过度摄入”
- 内容按年龄分三个阶段:
- **0-6月龄**:每日常规补充维生素D 400IU,纯母乳喂养宝宝需额外补充维生素K。配图:婴儿头像、Vit D注射器、Vit K胶囊。
- **7月龄-3岁**:重点补充铁(Fe)、锌(Zn)、DHA,每日维生素D补充量维持在400-600IU。配图:幼儿头像、放大镜观察胶囊、Fe和Zn符号。
- **4-12岁**:重点补充钙(Ca)、维生素A、B族维生素(B_B),保证每日蛋白质摄入量达标。配图:男孩头像、Ca气泡、B_B气泡、鸡蛋、牛奶瓶、眼睛图标。

2. **营养补充原则&常见避坑指南**
- 标题:“营养补充原则&常见避坑指南”,副标题:“科学补营养,这些坑要避开”
- 包含两个核心原则:
- **优先膳食摄入**(绿色对勾):核心原则1:日常均衡饮食是营养摄入的首要来源,不可用补充剂代替正常三餐。配图:孩子用餐场景,盘中有蔬菜、水果、肉类。
- **按需适量补充**(红色STOP标志):核心原则2:营养素补充并非越多越好,过量摄入维生素A、钙等可能引发中毒或代谢负担。配图:多瓶补剂被红色叉号覆盖。
- **避坑指南**(黄色标签):
- ① 不做体检评估盲目跟风补 ❌
- ② 把网红补剂当零食给孩子吃 ❌
- ③ 用成人补充剂减量给儿童服用 ❌
- 配图:红色“避坑”爆炸框,带有闪电效果。

---

**第二部分:实操应用指南**

1. **儿童营养补充产品3步选购法**
- 标题:“儿童营养补充产品3步选购法”,副标题:“儿童补剂选购3步判断法”
- 三步法分别由放大镜图标引导:
- **看合规标识**:优先选择带蓝帽标识的保健食品,或有婴幼儿/儿童专用备案标识的正规产品,拒绝三无产品。配图:放大镜聚焦“蓝帽”标志。
- **看配料成分**:优先选择无额外添加蔗糖、香精、人工色素、防腐剂的产品,致敏原标注清晰明确。配图:文件上贴有“无添加”印章,绿色对勾。
- **看适配年龄**:选择标注对应适用年龄段的儿童专用产品,不要自行将成人补充剂减量给孩子服用。配图:药瓶标签上“年龄”被红圈突出。

2. **常见儿童补剂适用场景对照表**
- 标题:“常见儿童补剂适用场景对照表”
- 表格形式,两列:左侧“补剂类型”,右侧“适用场景”,背景色交替为红、蓝。
- 具体内容:
- **维生素D滴剂** → 全年龄段儿童日常常规补充,预防佝偻病、促进钙吸收。配图:滴管瓶、骨头图标。
- **铁剂** → 体检确诊缺铁性贫血,或日常红肉、动物肝脏摄入不足的儿童。配图:滴管瓶、儿童头像。
- **DHA藻油** → 日常深海鱼摄入不足的儿童,辅助促进视网膜和大脑发育。配图:鱼形胶囊、大脑与眼睛图标。
- **钙剂** → 日常奶量不足、身高增长偏缓,经体检确认缺钙的儿童。配图:白色药片、儿童测量身高图。

---

**视觉与排版特征:**
- 整体采用网格化布局,四个主要模块分布在2x2的象限中。
- 使用大量漫画元素:如爆炸框、对话气泡、箭头、感叹号、禁止符号等。
- 图标系统丰富:Vit D、Fe、Zn、Ca、B_B、蓝帽、无添加、年龄、STOP等均有专属图形标识。
- 字体加粗、阴影、边框强调关键信息,如标题、数字、警示语。
- 色彩编码明确:黄色用于提示重点,蓝色用于说明步骤,红色用于警示或禁止。

该信息图内容全面,逻辑清晰,兼具科学性和实用性,适合家长快速掌握儿童营养补充的核心知识与选购技巧。

论文渲染质量 (Paper Rendering Quality)

U1-8B-MoT8B-MoT-InfographicU1-8B-MoT8B-MoT-Infographic
Prompt
[typesetting]

The page is laid out with two tables at the top, followed by a two-column text layout. The tables span the full width of the text area. The text includes a section heading.

[paragraphs]

the TOPIC MODELER, the GENDER SEGMENTER, and an OTHER module (transcript length and duration). We test for a linear relationship between each pair of variables: $H_O : r = 0$, $H_A : r \neq 0$, where $H_O$ is the origi-nal hypothesis, $H_A$ is the alternate hypothesis, and $r$ is the Pearson’s correlation coefficient. We follow Reddy et al. (2021) and Yang et al. (2019) and apply a Bonferroni cor-rection to our $\alpha$ value of $0.05$, setting $\alpha = 0.05/z$, where $z = \binom{124}{2} = 7,626$ for LDA, representing the number of feature relationships we consider. Hence, we reject $H_O$ in favor of $H_A$ if $p \leq \alpha$. Given the largeness of $z$, our $\alpha$ value becomes small, making our criteria for significance strict and thus suitable for investigating our research ques-tions. Furthermore, we filter our correlations $r$, such that $\Vert r\Vert > 0.1$ for our LDA experiments, and $\Vert r\Vert > 0.05$ for our BERTopic experiments (due to the smaller sample size of 10,000 podcasts, and fewer samples may have higher vari-ance). Our results focus on a selection of these significant correlations; the full results are available on the project web-site: https://www.gendered-discourse.net/extended-results.

### RQ0: How Are Women and Men’s Discourse Different?

Using GDCF, our Gendered Discourse Correlation Frame-work shown in Figure 2, we then analyze significant corre-lations between between the gender features from the GEN-DER SEGMENTER module (Doukhan et al. 2018a), and the topic features from the TOPIC MODELER module (Blei, Ng, and Jordan 2003). We use the discourse topics to automati-cally form gendered discourse word lists via their significant correlations.

Starting with the first row of Table 1, we see that Topic 3’s word list returned by LDA with Non-Contextual Embed-dings (Bag-Of-Words) (via the TOPIC MODELER module) contains the words women, woman, men, baby, pregnant, girls, men, doctor, health, birth (in descending weighted or-der). Based on this word list, we manually interpret this topic as being a content topic, specifically about pregnancy, as noted in the column “Topic N Categories.” Then, we look to the gender correlations in the columns “Gender” and “$r$,” and see that $r(\text{Topic 3, Women}) = +0.15$ and $r(\text{Topic 3, Men}) = -0.14$. This indicates that the topic of pregnancy positively correlates with women (identified via the GENDER SEGMENTER module), and negatively corre-lates with men. Therefore, we associate Topic 3 (Content - Pregnancy) with Women, as noted in the “Topic N Gender” column. Similarly, we make these associations in the “Topic N Gender” column for Topics 10, 49, and 71.

Next, we focus on the Topic 54 row. This topic is inter-preted using the word list get, like, know, right, people, go-ing, podcast, make, want, one. This word list does not refer to any content, hence, we manually interpret this topic as being a discourse topic. Moving to the gender correlations, we see that $r(\text{Topic 54, Women}) = \emptyset$ and $r(\text{Topic 3, Men}) = +0.12$. The reason for $r(\text{Topic 54, Women}) = \emptyset$ is because the correlation between the features Topic 54 and Women did not come back as significant. However, due to the positive correlation of $0.12$ for Topic 3 and Men, we manually as-sociate Topic 3 with Men in the “Topic N Gender” column.

[tables]

Table 1: LDA with Non-Contextual Embeddings (Bag-Of-Words): The complete set of significant correlations between gender features and topic features – both content topics and discourse topics. Based on $r$, the Topic N Gender forms the gendered (discourse) word lists via Topics 54 and 60 (the masculine word lists) and Topic 62 (the feminine word list).

| Topic N | Gender | $r$ | Topic N Word List | Topic N Categories | Topic N Gender |
|---|---|---|---|---|---|
| Topic 3 | Women
Men | 0.15
-0.14 | women, woman, men, baby, pregnant, girls, men, doctor, health, birth | Content - Pregnancy | Women |
| Topic 10 | Women
Men | 0.10
-0.12 | energy, body, feel, mind, space, yoga, love, beautiful, feeling, meditation | Content - Yoga | Women |
| Topic 49 | Women
Men | -0.21
0.17 | game, know, think, team, going, mean, play, year, one, good | Content - Sports | Men |
| Topic 71 | Women
Men | 0.14
-0.14 | christmas, sex, girl, hair, love, get, date, girls, let, wear | Content - Dating | Women |
| Topic 54 | Women
Men | –
0.12 | get, like, know, right, people, going, podcast, make, want, one | Discourse | Men |
| Topic 60 | Women
Men | -0.27
0.20 | going, know, think, get, got, one, really, good, well, yeah | Discourse | Men |
| Topic 62 | Women
Men | 0.33
-0.28 | like, know, really, going, people, want, think, get, things, life | Discourse | Women |

Table 2: BERTopic with Contextual Embeddings (BERT, ChatGPT, Llama): The complete set of significant correlations between gender features and topic features for discourse topics only (content topics are omitted).

| Topic N | Gender | $r$ | Topic N Word List | Topic N Categories | Topic N Gender |
|---|---|---|---|---|---|
| Topic 0 | Women
Men | -0.08
0.10 | like, yeah, know, oh, right, podcast, got, going, think, really | Discourse | Men |
| Topic 2 | Women
Men | 0.08
-0.08 | life, know, things, really, people, feel, like, want, love, going | Discourse | Women |
| Topic 5 | Women
Men | 0.08
– | like, know, think, yeah, episode, really, going, anchor, kind, right | Discourse | Women |
Prompt
[typesetting]

The page is a standard academic paper layout with a single column. The text is justified and divided into sections and subsections, indicated by numbered headings. Important terms at the beginning of some paragraphs are bolded. A horizontal rule separates the header from the main content, and another rule separates the main content from the footnote at the bottom.

[paragraphs]

Preprint Version.

**Figure–Table Integration.** In addition to textual refinement, we extend the refinement process to include multimodal elements, to further enhance readability. For each section, the model first generates visualization requirements, such as tables with structured comparisons or figures with explanatory diagrams, together with natural language descriptions. Based on these descriptions, candidate figures and tables are synthesized. The compiled outputs are then fed back to an LLM for quality assessment, enabling automatic detection of issues such as oversized layouts or unreadable text. The LLM provides corrective suggestions, which are applied to improve the final visualizations. Finally, the text is refined again to ensure that all generated figures and tables are properly referenced within the survey.

# 4 EXPERIMENTS

## 4.1 EXPERIMENTAL SETTINGS

**Implementation Details.** Following Wang et al. (2024b), we adopt **GPT-4o-mini** as our genera-tion model for its balance of responsiveness and cost. Our retrieval database contains 680K computer science papers from arXiv, with PDFs converted into structured Markdown using MinerU (Wang et al., 2024a) for consistent formatting. The details of the retrieval process are provided in App. A.1. In outline generation, the system consults 1000–1200 papers, with a maximum of 8 sections. For section drafting, each subsection retrieves up to 60 additional relevant papers, combined with those linked during outline generation. Finally, we apply two iterations of the review-and-refine loop to enhance coherence across sections and improve overall readability. Illustrative outputs compared with AutoSurvey are provided in App. A.8.

**Baselines.** We compare IterSurvey with a set of baselines, ranging from simple retrieval-augmented generation (Naive RAG), which directly drafts from retrieved documents, to more ad-vanced state-of-the-art systems. Specifically, we evaluate against AutoSurvey (Wang et al., 2024b), the first systematic framework for this task; SurveyForge (Yan et al., 2025), which combines heuris-tic outline generation based on the logical structures of human-written surveys with a memory-driven scholar navigation agent for high-quality retrieval; and SurveyGo (Wang et al., 2025), which em-ploys the LLM×MapReduce-V2 algorithm to address the long-context challenge. We also compare with SurveyX (Liang et al., 2025), which introduces an Attribute Tree-based outlining mechanism; however, due to access restrictions, we include SurveyX only in arena experiments. All methods are evaluated on the same retrieval database with generation hyperparameters aligned to their original settings for fairness.

## 4.2 AUTOMATIC EVALUATION RESULTS

**Evaluation Setup.** We employ multiple complementary protocols to evaluate the quality of gen-erated surveys. On the 20-topic suite from Wang et al. (2024b), we adopt multi-dimensional scoring with LLM-as-a-judge. Content quality is assessed along three dimensions: coverage, structure, and relevance followed from Wang et al. (2024b). Besides, citation quality is evaluated using the NLI-based protocol of Gao et al. (2023), reporting both recall and precision: _Citation Recall_ measures whether all statements in the generated text are fully supported by the cited passages, while _Citation Precision_ identifies irrelevant citations to ensure that references are pertinent and directly support the claims. To improve scoring stability and reliability, prompts are standardized and judges must pro-vide a rationale before assigning scores. For additional robustness, we aggregate outputs from three judge models: GPT-4o, Claude-3.5-Haiku, and GLM-4.5V.1 Full prompts are provided in App. A.7.

**Results.** The results on the 20 topics from Wang et al. (2024b) are reported in Tab. 1. Statistical significance was confirmed via paired t-tests, indicating that IterSurvey consistently outperforms baseline models ($p < 0.05$). We summarize the main observations below.

- **Overall superiority.** IterSurvey consistently outperforms all baselines across both content and citation quality, achieving the highest overall average score (4.75). This demonstrates that the proposed framework is effective and robust across multiple evaluation dimensions.

[page_number]

6

[footnotes]

1Specifically, we use `chatgpt-4o-latest`, `claude-3-5-haiku-20241022`, and `glm-4.5v`.
Prompt
[typesetting]

This is a single-column page containing mostly text, structured with section headings and bold inline subheadings. URLs are formatted in a monospaced font and hyperlinked.

[paragraphs]

# A Image generation models

This section details the two diffusion image generation models used in this work, namely Stable Diffusion 1.4 and 1.5.

**Stable Diffusion 1.4** The Stable Diffusion model is a text-conditioned image generator model that combines an autoencoder with a diffusion model to create a latent diffusion model. The autoencoder encodes images into latent representations with a reduced dimensionality when compared to the input image, reducing the computational needs during the training phase. Text prompts, on the other hand, are encoded using a text encoder and are then cross-attended by the UNet backbone of the latent diffusion model. Finally, the loss is computed using a reconstruction objective between the noise added to the latent representation and the prediction made by the UNet.
Stable Diffusion 1.4 (https://huggingface.co/CompVis/stable-diffusion-v1-4) had several rounds of training on the LAION dataset (https://laion.ai/), with each round changing the input image dimension, aesthetic score, and the probability of dropping the text-conditioning to improve classifier-free guidance.

**Stable Diffusion 1.5** SD 1.5, in turn, has the same architecture and even the same starting point as 1.4, with the difference being how long the model was fine-tuned on top of SD 1.2. The 1.4 version is fine-tuned for 225 thousand steps at resolution 512x512 on “laion-aesthetics v2 5+” with a 10% probability of dropping the text-conditioning, and version 1.5 for 595 thousand steps.
As demonstrated in Section D Stable Diffusion 1.4 has better performance than 1.5 in our approach, therefore, we will adopt SD 1.4 for most of the experiments in this paper.

# B Large language models

Here we give additional details on the large language models that we used in our experiments.

**Gemma** (Mesnard et al., 2024), trained on a diverse 6 Trillion token dataset comprising web documents, code and mathematical texts. We resorted to the 7 Billion parameter instruction-tuned decoder-only model, named _gemma-7b-it_ (https://huggingface.co/google/gemma-7b-it). This model uses a chat template, which we employ during inference.

**Llama 2** (Touvron et al., 2023), of which we used the 7 Billion parameter, pre-trained-only model, _Llama-2-7b_ (https://huggingface.co/meta-llama/Llama-2-7b-hf). This model was trained with a mix of publicly available data totalling 2 Trillion tokens. While its chat versions employ supervised fine-tuning and reinforcement learning with human feedback for alignment with human preferences in helpfulness and safety, the pre-trained-only model does not. This results in a less constrained model, but it may also cause it to disperse from the task at hand. Since this model is a pre-trained-only no chat template is needed.

**Mistral** (Jiang et al., 2023) fine-tuned on various HuggingFace instruction datasets. We resorted to the 7 Billion _Mistral-7B-Instruct-v0.2_ model (https://huggingface.co/mistralai/ Mistral-7B-Instruct-v0.2) and used the respective chat template during inference.

**Phi-2** (Gunasekar et al., 2023) is a compact 2.7 Billion model (https://huggingface.co/microsoft/ phi-2). Despite its size, it offers a competitive performance with respect to models several times its size. It was trained on 250 Billion tokens, obtained through a combination of NLP synthetic data created by GPT-3.5 and filtered web data from Falcon RefinedWeb and SlimPajama, which was assessed by GPT-4. This model was not fine-tuned through reinforcement learning from human feedback and does not have guardrails.

**Model ranking**
A ranking of these models in terms of their performance can be found in the HuggingFace leaderboard (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) which assesses several LLMs that are trained under the same criteria and tested on the same benchmarks, including reasoning
Prompt
[typesetting]

The page is a standard academic paper layout, likely from a preprint server like arXiv. It features a title, author list with affiliations, an abstract, and the beginning of the "Introduction" section. A preprint notification ("Preprint. Under review.") is present at the bottom left. The text on the left margin ("arXiv:2502.01522v2 [cs.CV] 30 May 2025") is a vertical stamp typical of arXiv submissions.

[paragraphs]

arXiv:2502.01522v2 [cs.CV] 30 May 2025

# Unpaired Deblurring via Decoupled Diffusion Model

**Junhao Cheng**$^1$, **Wei-Ting Chen**$^2$, **Xi Lu**$^1$, **Ming-Hsuan Yang**$^3$
$^1$Sun Yat-sen University $^2$ Microsoft $^3$ University of California, Merced
https://github.com/donahowe/UID-Diff

**Abstract**

Generative diffusion models trained on large-scale datasets have achieved remarkable progress in image synthesis. In favor of their ability to supplement missing details and generate aesthetically pleasing contents, recent works have applied them to image deblurring via training an adapter on blurry-sharp image pairs to provide structural conditions for restoration. However, acquiring substantial amounts of realistic paired data is challenging and costly in real-world scenarios. On the other hand, relying solely on synthetic data often results in overfitting, leading to unsatisfactory performance when confronted with unseen blur patterns. To tackle this issue, we propose UID-Diff, a generative-diffusion-based model designed to enhance deblurring performance on unknown domains by decoupling structural features and blur patterns through joint training on three specially designed tasks. We employ two Q-Formers as structural features and blur patterns extractors separately. The features extracted by them will be used for the supervised deblurring task on synthetic data and the unsupervised blur-transfer task by leveraging unpaired blurred images from the target domain simultaneously. We further introduce a reconstruction task to make the structural features and blur patterns complementary. This blur-decoupled learning process enhances the generalization capabilities of UID-Diff when encountering unknown blur patterns. Experiments on real-world datasets demonstrate that UID-Diff outperforms existing state-of-the-art methods in blur removal and structural preservation in various challenging scenarios.

# 1 Introduction

Dynamic blur occurs when the camera and subject move relative to each other during the exposure time, resulting in a smeared and blurred image. Deblurring, the process of removing the blur pattern while preserving the underlying structure of degraded images, is essential for restoring high-quality images for human perception and low-level computer vision applications.

With the rapid advancement of photographic technology, a wide range of imaging devices are now employed to capture images in real-world scenarios. Due to their diverse lenses and structural designs, these devices may produce distinct blur patterns [1, 2, 3]. This diversity makes it challenging to develop an all-in-one method for deblurring images from arbitrary and varied sources. Consequently, focusing on deblurring algorithms tailored to specific domains has become increasingly significant.

As deep learning has advanced in recent years, existing deblurring models predominantly build on data-driven approaches that employ neural networks trained via supervised learning on synthetic paired data. Existing works have made efforts to develop deblurring models upon CNN [4, 5], Transformer [6, 7], and GAN [8, 9]. Recently, a new wave of research [10, 11, 12] has begun to investigate the integration of pre-trained generative diffusion models [13], such as Stable Diffusion (SD) [14], with an adapter designed to provide structural guidance for deblurring. These approaches aim to harness the generative capabilities of diffusion models to supplement missing details and generate aesthetically pleasing outputs. However, since paired blurry-sharp training data is limited in

[footnotes]

Preprint. Under review.

综合布局及内容理解 (Overall Aesthetics)

U1-8B-MoT8B-MoT-InfographicU1-8B-MoT8B-MoT-Infographic
Prompt
该信息图以“曲尼司特”为标题,整体采用浅蓝白色调,布局清晰,分为多个模块化区域,围绕中央的透明胶囊图像展开。右上角展示曲尼司特的化学结构式及其分子式 C₁₈H₁₇NO₄。

**1. 活性成分数据(左上)**
- 以环形图形式展示成分比例:
- 曲尼司特 >98%
- 辅料 <2%
- 图下方标注:“纯度高,临床级标准”

**2. 适应症(右上)**
- 通过三个图标及文字说明:
- 鼻部图标:过敏性疾病
- 皮肤纹理图标:纤维化
- 疤痕图标:瘢痕疙瘩

**3. 剂量矩阵(中左)**
- 表格形式,包含两列:“口服”和“频率”
- 成人:100mg / 次;频率:1-3 次 / 天
- 儿童:咨询医生;频率:遵医嘱

**4. 药代动力学时间轴(中右)**
- 折线图,横轴为时间(0h 至 24h),纵轴为浓度(无刻度)
- 0h:吸收开始(水滴图标)
- 1-2h:峰值浓度(山峰图标)
- 4-6h:分布/代谢(循环箭头图标)
- 24h:排泄(垃圾桶图标)
- 图中标注半衰期 ≈ 5-8h

**5. 警告网格(左下)**
- 分为四个象限,每个配有图标和文字:
- 相互作用:CYP酶抑制剂/诱导剂(齿轮图标)
- 副作用:胃肠道不适,皮疹(胃图标)
- 肝功能:定期监测(肝脏图标)
- 肾功能:慎用(肾脏图标)

**6. 患者适用性(中下)**
- 两个图标组合:
- 成人:人物图标 + 对勾,标注“成人 适用”
- 儿童:人物图标 + 问号 + 医生图标,标注“儿童 咨询医生”

**7. 储存指南(右下)**
- 三个图标并列:
- 温度计图标:2-25℃ 室温
- 密封瓶图标:密闭
- 遮光图标(太阳加斜线):避光

整体设计风格现代、专业,使用大量图标辅助理解,数据可视化清晰,适合医疗或药品宣传场景。所有文本均为中文,语言准确,无冗余描述。
Prompt
The infographic presents an augmented reality (AR) shopping experience overlaid on a real-world retail environment. The scene is set in a brightly lit cosmetics aisle of a store, with shelves stocked with beauty products visible in the background. In the foreground, a pair of hands holds a black rectangular compact labeled "ANASTASIA BEVERLY HILLS BROW POWDER DUO" with "EBONY" and "NET WT. 2.5 OZ." printed below. A gold ring is visible on the left hand’s ring finger, and a black wristband is partially seen on the left wrist.

Overlaid on the image are several semi-transparent, rounded-corner UI elements resembling AR pop-ups or digital cards, providing contextual information about the product and the user’s shopping list.

On the left side, a vertical panel titled "SHOPPING LIST" lists four items:
1. Face Wash — marked with an “X” (completed)
2. Shampoo — marked with an “X” (completed)
2. Eye Cream — marked with an empty checkbox (not completed; duplicated item number)
3. Eye Cream — marked with an empty checkbox (not completed)

This suggests a possible error or duplication in the list, with two entries for "Eye Cream".

In the center-right, a speech-bubble-shaped label displays the price: "$23.00".

To the right of the product, a larger panel titled "PRODUCT DETAILS:" provides information about the "ABH Brow Powder Duo". It features two color swatches:
- Left swatch: labeled "DEEP BROWN"
- Right swatch: labeled "BLACK"

Below the swatches, a star rating system shows four and a half filled stars, accompanied by the text "4.5 out of 5 stars".

Underneath the rating, a section titled "COMMON USES:" states: "DEFINES & FILLS BROWS".

Further down, a smaller rectangular box labeled "KEY INGREDIENTS" lists:
- Vitamin E
- Finely Milled Pigments

At the bottom right, another box titled "APPLICATION TIPS" includes a video icon (a rectangle with a play triangle) and the word "Video", indicating a multimedia tutorial is available.

The overall layout mimics an immersive AR interface, likely from a smart glasses or smartphone application, designed to enhance in-store shopping by providing instant, interactive product data directly within the user’s field of view. The visual style uses dark gray, translucent backgrounds with white text for high contrast and readability against the busy store backdrop. The design emphasizes usability, with clear categorization of information into distinct panels and intuitive icons. All textual content is in English, and no other languages are present.
Prompt
该信息图以深蓝色科技感背景为主,配以紫色和青色的电路板图案边框,营造出未来数字设备的视觉氛围。标题“谷歌最新血氧仪机型参数对比(社媒版)”位于顶部中央,使用发光白色字体,突出主题。整体布局为横向三栏式对比结构,左侧为参数类别标签列,中间及右侧分别为三款智能穿戴设备的参数详情。

左侧参数类别列以图标+文字形式垂直排列,包括:
- 芯片(图标为芯片符号)
- 电池(图标为电池符号)
- 功能(图标为心电波形符号)
- 重量(图标为秤盘符号)
- 价格(图标为价格标签符号)
- 发售时间(图标为日历符号)

中间三栏分别对应三款产品:
1. **高亮推荐机型:Google Pixel Pulse(最新推荐)**
- 标题上方有金色星形徽章“★ 高亮推荐机型”,并用金色边框高亮显示。
- 芯片:Tensor G4定制芯片
- 电池:7天续航,快充
- 功能:连续血氧监测,睡眠/压力追踪,AI健康指导
- 重量:28克(轻盈)
- 价格:¥1999
- 发售时间:2024年10月

2. **竞品A(例如:Apple Watch S9)**
- 芯片:S9 SiP芯片
- 电池:18小时(正常使用)
- 功能:按需血氧,心电图APP,摔倒检测
- 重量:32克
- 价格:¥3199
- 发售时间:2023年9月

3. **竞品B(例如:Garmin Venu 3)**
- 芯片:Elevated V5传感器
- 电池:14天(智能模式)
- 功能:全天候血氧,身体电量,GPS运动
- 重量:35克
- 价格:¥2499
- 发售时间:2023年8月

所有数据均采用清晰的横向分隔线组织,每项参数内容居中对齐,字体为简洁现代的无衬线体,颜色为浅蓝或白色,确保可读性。高亮推荐机型使用金色边框和更明亮的文字,形成视觉焦点。

底部有一行注释文字:“注:以上参数仅供参考,具体以官方发布为准。#科技 #健康 #谷歌新品 #血氧仪对比”,字体较小,颜色较暗,作为补充说明。

整体设计风格现代、科技感强,通过色彩对比、边框高亮和图标辅助,有效传达了各机型在关键性能指标上的差异,尤其突出了Google Pixel Pulse在续航、价格和功能集成方面的优势。
Prompt
该信息图以复古手绘风格呈现,整体布局如一本打开的泛黄书页,背景为米黄色仿旧纸张质感,边缘带有不规则撕裂效果。标题“博物馆游览扩展内容与要点”位于顶部中央,字体为深棕色艺术字,两侧饰有卷曲花纹装饰,视觉上突出主题。

全图采用六点式结构化布局,围绕中心分布六个核心模块,每个模块均配有独立插画、编号标题和说明文字,通过装饰性边框、花环、丝带等元素进行区分与美化。整体设计风格温馨、文艺,融合了音乐符号、星星、薰衣草、云朵等点缀元素,营造出轻松愉悦的文化探索氛围。

各模块内容如下:

1. **沉浸式体验**
- 标题:“1. 沉浸式体验”
- 说明文字:“参与互动展览,感受历史场景还原,身临其境。”
- 视觉元素:左侧描绘一位金发男孩手持放大镜观察一个微缩历史街景模型(包含房屋、摊位和人物),上方有齿轮与灯泡组成的思考气泡,象征探索与发现。右侧配有一个系着粉色蝴蝶结的礼物盒,标签写有“SURPRISE”。

2. **主题讲座与工作坊**
- 标题:“2. 主题讲座与工作坊”
- 说明文字:“聆听专家深度解读,亲手制作手工艺品,学习新知。”
- 视觉元素:右侧展示一张木桌,桌上摆放陶壶、陶罐、刻刀等手工工具,旁边堆叠书籍与卷轴;周围环绕橄榄枝花环,上方悬挂一串风铃(含月亮、星星与铃铛),背景点缀云朵与星光。

3. **馆藏珍品探索**
- 标题:“3. 馆藏珍品探索”
- 说明文字:“寻找镇馆之宝,了解背后的故事与文化价值,深度挖掘。”
- 视觉元素:左侧是一个打开的木质宝箱,内有青铜鼎状文物与发光卷轴;旁有绿色玉璧吊坠、散落铜钱,以及一支点燃的白色蜡烛,烛台装饰有薰衣草与小花束。

4. **特色导览路线**
- 标题:“4. 特色导览路线”(置于米色丝带横幅中)
- 说明文字:“跟随定制路线,发现隐秘角落与独特视角,别样精彩。”
- 视觉元素:下方是一张展开的复古地图,标有拱门、凉亭、佛像、雕塑等景点,以红色虚线连接,并配有指南针图标,体现路径规划概念。

5. **数字化互动**
- 标题:“5. 数字化互动”(置于圆形波点边框内)
- 说明文字:“利用AR/VR技术,打破时空限制,体验虚拟现实。”
- 视觉元素:右侧描绘一位戴VR眼镜的人正在触控空中悬浮的陶罐图像,周围有Wi-Fi信号、数据图表、声波图等科技元素,体现数字交互场景。

6. **文化衍生品**
- 标题:“6. 文化衍生品”
- 说明文字:“选购独特纪念品,将博物馆记忆带回家,延续美好。”
- 视觉元素:左下角陈列多种文创商品,包括印有博物馆建筑图案的帆布袋(标有“MUSEUM”)、笔记本、明信片、徽章;右下角则是一盘精致三明治(面包上烙有五角星图案),配蓝莓与卷饼,旁有一只戴派对帽、系蓝色蝴蝶结的白鹅,口中喷出音符,充满童趣。

整张信息图通过图文结合的方式,系统介绍了博物馆参观的六大延伸活动,既传达实用信息,又兼具美学感染力,适合用于宣传册、教育海报或线上推广材料。所有文本均为中文,无英文或数字编码,语言风格亲切自然,符合大众传播需求。

🛠️ 快速开始

🌐 使用 SenseNova-Studio

体验 SenseNova-U1 最便捷的方式是通过 SenseNova-Studio —— 一个 🆓 免费的在线体验平台,无需安装、无需 GPU,直接在浏览器中即可试用。

注: 为服务更多用户,U1-Fast 经过步数蒸馏和 CFG 蒸馏,专供信息图生成使用。

🦞 使用 SenseNova-Skills(OpenClaw)

将 SenseNova-U1 集成进自己的智能体或应用,最简单的方式是使用配套仓库 SenseNova-Skills (OpenClaw) 🦞——它将 SenseNova-U1 封装为开箱即用的技能,并提供统一的工具调用接口。

安装与使用详情请参考 SenseNova-Skills README。

✨ 点击收起查看通过 Skills 和 Studio 制作的有趣案例

Skill Cases

🤗 使用 transformers 运行

环境准备: 按照安装指南克隆仓库并用 uv 安装依赖。

🌟 生成高质量信息图

为了生成复杂且高质量的信息图,我们强烈建议使用以下参数:--cfg_scale 4.0、--timestep_shift 3.0 以及 --num_steps 50。

python examples/t2i/inference.py \
  --model_path sensenova/SenseNova-U1-8B-MoT-Infographic \
  --prompt "这张信息图的标题是“SenseNova-U1”,采用现代极简科技矩阵风格。整体布局为水平三列网格结构,背景是带有极浅银灰色细密点阵的哑光纯白高级纸张纹理,画面长宽比为16:9。\n\n排版采用严谨的视觉层级:主标题使用粗体无衬线黑体字,正文使用清晰的现代等宽字体。配色方案极其克制,以纯白色为底,深炭黑为主视觉文字和边框,浅石板灰用于背景色块和次要信息区分,图标采用精致的银灰色线框绘制。\n\n在画面正上方居中位置,使用醒目的深炭黑粗体字排布着大标题“SenseNova-U1”。标题正下方是浅石板灰色的等宽字体副标题“新一代端到端统一多模态大模型家族”。\n\n画面主体分为左、中、右三个相等的垂直信息区块,区块之间通过充足的负空间进行物理隔离。\n\n左侧区块的主题是概述。顶部有一个银灰色线框绘制的、由放大镜和齿轮交织的图标,旁边是粗体小标题“Overview”。该区块内从上到下垂直排列着三个要点:第一个要点旁边是一个代表文档与照片重叠的极简图标,紧跟着文字“多模态模型家族,统一文本/图像理解和生成”。向下是由两个相连的同心圆组成的架构图标,配有文字“基于NEO-Unify架构(端到端统一理解和生成)”。最下方是一个带有斜线划掉的眼睛和漏斗形状的图标,明确指示文本“无需视觉编码器(VE)和变分自编码器(VAE)”。\n\n中间区块展示模型矩阵。顶部是一个包含两个分支节点的树状网络图标,旁边是粗体小标题“两个模型规格”。区块内分为上下两个包裹在浅石板灰色极细边框内的卡片。上方的卡片内画着一个代表高密度的实心几何立方体图标,大字标注“SenseNova-U1-8B-MoT”,下方是等宽字体说明“8B MoT 密集主干模型”。下方的卡片内画着一个带有闪电符号的网状发光大脑图标,大字标注“SenseNova-U1-A3B-MoT”,下方是等宽字体说明“A3B MoT 混合专家(MoE)主干模型”。在这两个独立卡片的正下方,左侧放置一个笑脸轮廓图标搭配文字“将在HF等平台公开”,右侧放置一个带有折角的书面报告图标搭配文字“将发布技术报告”。\n\n右侧区块呈现核心优势。顶部是一个代表巅峰的上升阶梯折线图图标,旁边是粗体小标题“Highlights”。该区块内部垂直分布着四个带有浅石板灰底色的长方形色块,每个色块内部左侧对应一个具体的图标,右侧为文字。第一个色块内是一个无缝相连的莫比乌斯环图标,配文“原生统一架构,无VE和VAE”。第二个色块内是一个顶端带有星星的奖杯图标,配文“单一统一模型在理解和生成任务上均达到SOTA性能”。第三个色块内是代表文本行与拍立得照片交替穿插的图标,配文“强大的原生交错推理能力(模型原生生成图像进行推理)”。最后一个色块内是一个被切分出一小块的硬币与详细饼状图结合的图标,配文“能生成复杂信息图表,性价比出色”。" \
  --width 2720 --height 1536 \
  --cfg_scale 4.0 --cfg_norm none --timestep_shift 3.0 --num_steps 50 \
  --output output.png --profile

默认分辨率为 2048×2048(1:1)。其它长宽比请参见支持的分辨率档位。

当进行信息图生成时,建议先使用提示词增强以获得最佳效果。

💾 低显存推理(GGUF + VRAM 模式)

针对单张消费级显卡的部署场景,我们在 transformers 路径上提供两项可独立启用、也可组合使用的低显存特性。

--vram_mode:单卡分层卸载

--vram_mode 将语言模型各层常驻 CPU pinned memory,仅在前向时按需流式拷贝到 GPU 上参与计算,从而显著降低权重的 VRAM 占用,激活值仍保留在显卡上。

模式行为适用场景
full(默认)不做卸载,整模放在 GPU 上显存充裕,追求最快速度
low同步逐层 CPU↔GPU 交换显存最为紧张
balanced异步预取,将 H2D 拷贝与计算重叠显存吃紧但希望恢复部分速度
python examples/t2i/inference.py \
  --model_path sensenova/SenseNova-U1-8B-MoT-Infographic \
  --vram_mode balanced \
  --prompt "..." --output output.png

--gguf_checkpoint 与 --vram_mode 可叠加:在 ~10-12 GB 消费卡上推荐使用 Q4 GGUF + balanced 组合。

⚡ 使用 LightLLM + LightX2V 运行

面向生产环境的部署,我们在 LightLLM(理解)和 LightX2V(生成)之上协同设计了一套专用推理栈。两个引擎以解耦方式运行,可以各自使用独立的并行策略与资源配额,中间通过低开销传输通道连接。

在单节点 TP2 + CFG2 配置下,该推理栈在 H100 / H200 上为 2048×2048 图像提供约 ~0.15 s/step、~9 s 端到端的表现;相较 Triton 基线,我们基于 FA3 的混合掩码注意力带来 ~2.4–3.2× 的 prefill 加速。完整的单卡性能数据见 docs/inference_infra_CN.md。

我们提供了官方 Docker 镜像,一行命令即可完成部署:

docker pull lightx2v/lightllm_lightx2v:20260407

⚙️ 部署指南(Docker、启动参数、模式、量化、API 测试): 参见 docs/deployment_CN.md。

📖 完整架构设计与性能剖析: 参见 docs/inference_infra_CN.md。

🌐 加入社区!

加入我们的社区,分享反馈、获取支持,并第一时间了解 SenseNova-U1 的最新进展 — 期待与你交流!

Discord微信交流群

⚖️ 许可证

本项目基于 Apache 2.0 License 开源发布。