SciCore-Mol：通过可插拔分子认知模块增强大型语言模型

陈雨轩¹，吕昌伟²，肖云铎²，严宇坤²，曾振宁^*3，刘知远²

¹北京大学深圳研究生院电子与计算机工程学院，中国深圳
²清华大学，中国北京
³南京大学智能科学与技术学院，中国南京
^*通讯作者：zengzn@nju.edu.cn

📖 引言

大型语言模型（LLMs）在专业领域的应用日益广泛，但在处理异构科学数据时面临着一个根本性的认知矛盾：LLMs 是为离散的自然语言符号序列设计的，而以分子为代表的科学实体本质上具有拓扑和几何特性。将这些结构强行转化为线性文本不可避免地会导致信息丢失，并且语义噪声会干扰 LLM 的认知推理。

我们提出了 SciCore-Mol，这是一种新颖的范式，通过可插拔的外部认知模块来增强 LLM，包括 GVP 编码器、扩散生成器和数值敏感型 Transformer（Reaction Transformer）。该架构在保留 LLM 通用能力的同时，为其提供了专门的分子感知能力。通过两阶段对齐机制，外部模块通过特殊标记被调用，并在隐藏状态层面进行融合，使 LLM 能够深入理解分子信息，同时不牺牲其核心推理过程。

⚙️ 设置

前提条件

Python 3.10
CUDA 12.1
8 块 A800/A100 80GB GPU（推荐用于完整训练）

安装

git clone https://github.com/ChenYX24/SciCore-Mol.git
cd SciCore-Mol

# Option A: Install with uv (recommended)
pip install uv
uv sync
uv sync --extra graph      # GVP-GNN dependencies (torch-geometric, torch-scatter, torch-cluster)
uv sync --extra flashattn  # FlashAttention (requires CUDA)
uv sync --group train      # DeepSpeed for distributed training

# Option B: Install with pip
python -m venv .venv
source .venv/bin/activate
pip install -e .
pip install -e ".[graph]"       # optional: GVP-GNN
pip install -e ".[flashattn]"   # optional: FlashAttention
pip install deepspeed swanlab   # optional: distributed training

环境变量

cp configs/env.example.sh configs/env.sh
# Edit configs/env.sh to set your paths, then:
source configs/env.sh

变量	描述
`SCICORE_ROOT`	项目根目录
`MODEL_DIR`	基础模型目录（例如，Qwen3-8B）
`CHECKPOINT_DIR`	训练后的 checkpoint 目录
`DATA_DIR`	训练与评估数据
`GVP_CHECKPOINT`	预训练的 GVP-GNN 权重
`OPENAI_API_KEY`	用于 GPT 基线评估的 API 密钥

🔧 训练

SciCore-Mol 采用三阶段训练流程（参见上图）：

阶段 1：组件预训练

在联合训练前，独立预训练各个组件。

GVP 编码器 + MLP 适配器：将 GVP 分子嵌入对齐至 LLM 隐藏空间。
```
bash scripts/run/gvp_mlp_pretrain_qwen.sh
```
反应 Transformer（Layer2）：在反应数据上进行训练，以实现产率预测和嵌入重构。
```
python scripts/layer2/train_layer2.py \
    --config scripts/layer2/layer2_train_config.yaml
```

阶段 2：跨模态对齐训练

连接所有模块进行联合 SFT 训练。LLM 学习通过特殊的 <mol> 令牌调用外部模块。

# Configure training in configs/qwen3_sft_epoch2_1.yaml
# Uses DeepSpeed ZeRO-3 for multi-GPU training
torchrun --nproc_per_node=4 \
    cotrain_llm_diffusion/train_step1_llm.py \
    --config configs/qwen3_sft_epoch2_1.yaml

关键配置字段（位于 configs/qwen3_sft_epoch2_*.yaml）：

paths.llm_name_or_path：基础 LLM 检查点
paths.gnn_state_dict_path：预训练 GVP 权重
paths.deepspeed_config：DeepSpeed 配置（ZeRO-2 或 ZeRO-3）
training.freeze_strategy：控制哪些模块被冻结/可训练

阶段 3：任务特定微调

在下游任务上微调 Layer2（反应转换器），支持可配置的模块冻结：

python scripts/layer2/train_layer2.py \
    --config scripts/layer2/layer2_train_config_stage2_v7b.yaml

训练完成后，将检查点拆分为 LLM 和额外组件：

python scripts/ckpt/split_llm_extras.py \
    --checkpoint_path ${CHECKPOINT_DIR}/your-checkpoint/ \
    --output_dir ${CHECKPOINT_DIR}/your-checkpoint/

📊 评估

ChemBench4K（产物生成 / 逆合成 / 产率预测 / 分子描述生成）

# Evaluate all 5 tasks with logprob scoring
bash scripts/run/run_chembench_all_tasks.sh

# Or run individual tasks:
python scripts/eval/eval_layer2_chembench.py \
    --checkpoint_dir ${CHECKPOINT_DIR}/your-checkpoint \
    --task product \
    --output_dir eval_results/chembench/

MMLU 化学子集（5 个科目）

python scripts/eval/eval_mmlu_interns1mini_5subsets.py \
    --model_path ${MODEL_DIR}/your-model \
    --output_dir eval_results/mmlu/

ORD 反应预测（完整流程）

# Run Layer2-LLM integrated pipeline
bash scripts/layer2_llm/run_full_pipeline.sh

# Score predictions
python scripts/postprocess/score_only.py \
    --pred_dir eval_results/ord/

SMolInstruct（7 项分子任务）

# Automated multi-task evaluation with GPU scheduling
bash scripts/run/eval_smol_task_list.sh

药物优化（ADMET 评分）

# LLM-based drug optimization
python eval/drug_optim/eval_admet.py \
    --config eval/drug_optim/config/llm_cpt_sft.yaml

# Diffusion-based drug optimization
python eval/drug_optim/eval_diffusion.py \
    --config eval/drug_optim/config/diffusion_sft.yaml

📄 致谢

GVP-GNN — 用于分子结构编码的几何向量感知机
LDMol — 用于分子生成的潜在扩散模型
SMolInstruct — 分子指令微调基准
ChemBench — 化学基准测试套件

🥰 引用

@article{chen2026scicoremol,
  title={SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules},
  author={},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}

📧 联系方式

如果您有任何问题、建议或错误报告，请提交 issue 或发送邮件至：

chenyuxuan225@gmail.com

📜 许可证

本项目采用 MIT 和 Apache 2.0 双许可证授权。