Mero Mero

Gemma4 31B

01 概述

基于 Gemma 4 31B 微调的模型，专为创意任务设计。

这是谷歌推出的又一款使用难度较高但性能极其出色的模型。

该模型在叙事多样性方面略有提升，写作风格不那么华丽冗长。不过推理过程往往比原版稍长一些。智能水平似乎与原版相当。

同时支持思维链（thinking）和非思维链（non thinking）模式。

02 SillyTavern 设置

推荐角色扮演格式

动作纯文本

对话"引号内"

思想*星号内*

推荐采样参数

温度0.8 - 1.0

MinP0.05

指令模板

Gemma 4 - Think

Gemma 4 - NoThink

03 量化版本

GGUF

iMatrix

04 创建过程

创建流程：SFT（监督微调） > 模型合并

在约 4900 万 tokens 上进行 SFT。

尽管使用了 4900 万 tokens，但此数据集规模相当适中。可训练量大致在 1000-1500 万 tokens 左右。所有数据集都仅针对最后一轮对话进行训练，以忠实反映 Gemma 4 的聊天模板。

方法与 26B A4B MeroMero 非常相似。我使用自己的数据对模型进行了 2 个 epoch 的高强度训练，在测试了多个检查点后，选择了 1 个 epoch 的版本，该版本具有所需的风格且过拟合迹象最少。

我将此检查点与原始指令模型合并，这在保留微调带来的变化的同时，清理了任何剩余的过拟合问题。

使用 Axolotl 进行训练。

Mergekit 配置

models:
  - model: google/gemma-4-31B-it
  - model: ApocalypseParty/G4-31B-SFT-v3-1-1ep
merge_method: slerp
parameters:
  t: 0.5
base_model: google/gemma-4-31B-it
dtype: bfloat16

Axolotl 配置

base_model: google/gemma-4-31B-it
 
plugins:
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
  - axolotl.integrations.liger.LigerPlugin
liger_layer_norm: true
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_rms_norm_gated: true
strict: false
cut_cross_entropy: true
 
datasets:
  - path: zerofata/pretok
val_set_size: 0.02
output_dir: ./G4-31B-SFT-v3-1
 
sequence_len: 10756
pad_to_sequence_len: true
sample_packing: true
 
load_in_4bit: false
adapter: lora
lora_r: 64
lora_alpha: 64
peft_use_rslora: true
lora_dropout: 0.0
freeze_mm_modules: true
 
lora_target_modules: 'model.language_model.layers.[\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj'
 
wandb_project: G4-31B-SFT
wandb_name: G4-31B-SFT-v3-1
 
gradient_accumulation_steps: 1
micro_batch_size: 4
num_epochs: 2
optimizer: adamw_torch_fused
lr_scheduler: constant_with_warmup
learning_rate: 1e-5
max_grad_norm: 1.0
 
bf16: auto
tf32: true
 
logging_steps: 1
 
# FA2 不支持
sdp_attention: true
#flex_attention: true
#torch_compile: true
flash_attention: false
 
warmup_ratio: 0.1
evals_per_epoch: 4
saves_per_epoch: 2
weight_decay: 0.05
special_tokens:
 
fsdp_config:
  fsdp_version: 2
  offload_params: false
  cpu_ram_efficient_loading: false
  auto_wrap_policy: TRANSFORMER_BASED_WRAP
  transformer_layer_cls_to_wrap: Gemma4TextDecoderLayer
  state_dict_type: FULL_STATE_DICT
  sharding_strategy: FULL_SHARD
  reshard_after_forward: true
  activation_checkpointing: true

models:
  - model: google/gemma-4-31B-it
  - model: ApocalypseParty/G4-31B-SFT-v3-1-1ep
merge_method: slerp
parameters:
  t: 0.5
base_model: google/gemma-4-31B-it
dtype: bfloat16

base_model: google/gemma-4-31B-it plugins: - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin - axolotl.integrations.liger.LigerPlugin liger_layer_norm: true liger_rope: true liger_rms_norm: true liger_glu_activation: true liger_rms_norm_gated: true strict: false cut_cross_entropy: true datasets: - path: zerofata/pretok val_set_size: 0.02 output_dir: ./G4-31B-SFT-v3-1 sequence_len: 10756 pad_to_sequence_len: true sample_packing: true load_in_4bit: false adapter: lora lora_r: 64 lora_alpha: 64 peft_use_rslora: true lora_dropout: 0.0 freeze_mm_modules: true lora_target_modules: 'model.language_model.layers.[\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj' wandb_project: G4-31B-SFT wandb_name: G4-31B-SFT-v3-1 gradient_accumulation_steps: 1 micro_batch_size: 4 num_epochs: 2 optimizer: adamw_torch_fused lr_scheduler: constant_with_warmup learning_rate: 1e-5 max_grad_norm: 1.0 bf16: auto tf32: true logging_steps: 1 # FA2 不支持 sdp_attention: true #flex_attention: true #torch_compile: true flash_attention: false warmup_ratio: 0.1 evals_per_epoch: 4 saves_per_epoch: 2 weight_decay: 0.05 special_tokens: fsdp_config: fsdp_version: 2 offload_params: false cpu_ram_efficient_loading: false auto_wrap_policy: TRANSFORMER_BASED_WRAP transformer_layer_cls_to_wrap: Gemma4TextDecoderLayer state_dict_type: FULL_STATE_DICT sharding_strategy: FULL_SHARD reshard_after_forward: true activation_checkpointing: true