Needle

我们将 Gemini 3.1 提炼为一个仅含 2600 万参数的“Simple Attention Network”，您甚至可以在本地的 Mac/PC 上对其进行微调。在生产环境中，Needle 运行于 Cactus 之上，预填充速度可达 6000 tokens/秒，解码速度为 1200 tokens/秒。模型权重及数据集生成过程已完全开源，详见 Cactus-Compute/needle。

项目	详情
参数规模	2600 万
架构	编码器-解码器，纯注意力机制（无 FFN）
编码器	12 层，GQA（8H/4KV），RoPE，门控残差连接
解码器	8 层，自注意力 + 交叉注意力，门控残差连接
d_model	512
词汇表	8192（SentencePiece BPE）
归一化	ZCRMSNorm（零中心化，初始值=0）
精度	bfloat16（训练期间采用 INT4 量化感知训练）
预训练	在 16 块 TPU v6e 上训练 2000 亿 tokens（耗时 27 小时）
后训练	20 亿 tokens 的函数调用数据（耗时 45 分钟）

d=512, 8H/4KV, BPE=8192
                                  ┌──────────────┐
                                  │  Tool Call   │
                                  └──────┬───────┘
                                        ┌┴──────────┐
                                        │  Softmax  │
                                        └─────┬─────┘
                                        ┌─────┴─────┐
                                        │ Linear (T)│  <- tied
                                        └─────┬─────┘
                                        ┌─────┴─────┐
                                        │ ZCRMSNorm │
                                        └─────┬─────┘
                                     ┌────────┴────────┐
                                     │ Decoder x 8     │
                                     │┌───────────────┐│
                                     ││ ZCRMSNorm     ││
                                     ││ Masked Self   ││
                                     ││ Attn + RoPE   ││
                                     ││ Gated Residual││
                                     │├───────────────┤│
  ┌──────────────┐                   ││ ZCRMSNorm     ││
  │ Encoder x 12 │─────────────────────>Cross Attn    ││
  │              │                   ││ Gated Residual││
  │ ┌──────────┐ │                   │└───────────────┘│
  │ │ZCRMSNorm │ │                   └────────┬────────┘
  │ │Self Attn │ │                      ┌─────┴─────┐
  │ │ GQA+RoPE │ │                      │ Embedding │  <- shared
  │ │Gated Res │ │                      └─────┬─────┘
  │ │          │ │                    ┌───────┴────────┐
  │ │ (no FFN) │ │                    │[EOS]<tool_call>│
  │ └──────────┘ │                    │ + answer       │
  │              │                    └────────────────┘
  └──────┬───────┘
         │
    ┌────┴──────┐
    │ Embedding │
    └────┬──────┘
         │
    ┌────┴──────┐
    │   Text    │
    │  query    │
    └───────────┘

快速入门

git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground

在 http://127.0.0.1:7860 打开 Web UI，您可以在此测试并基于自有工具进行微调。权重将自动下载。

使用方法（Python）

from needle import load_checkpoint, generate, SimpleAttentionNetwork, get_tokenizer

params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
tokenizer = get_tokenizer()

result = generate(
    model, params, tokenizer,
    query="What's the weather in San Francisco?",
    tools='[{"name":"get_weather","parameters":{"location":"string"}}]',
    stream=False,
)
print(result)
# [{"name":"get_weather","arguments":{"location":"San Francisco"}}]

微调

通过 Web UI 或 CLI 基于您自己的工具进行微调：

# Web UI (generates data via Gemini, trains, evaluates, bundles result)
needle playground

# CLI (auto-downloads weights if not local)
needle finetune data.jsonl

链接

Needle - 训练、微调及推理代码
Cactus - 设备端运行时（预填充 6000 tok/s，解码 1200 tok/s）
Simple Attention Networks - 架构详情

许可证

MIT

引用

@misc{ndubuaku2026needle,
  title={Needle},
  author={Henry Ndubuaku and Jakub Mroz and Karen Mosoyan and Roman Shemet and Parkirat Sandhu and Satyajit Kumar and Noah Cylich and Justin H. Lee},
  year={2026},
  url={https://github.com/cactus-compute/needle}
}

Needle

项目	详情
参数规模	2600 万
架构	编码器-解码器，纯注意力机制（无 FFN）
编码器	12 层，GQA（8H/4KV），RoPE，门控残差连接
解码器	8 层，自注意力 + 交叉注意力，门控残差连接
d_model	512
词汇表	8192（SentencePiece BPE）
归一化	ZCRMSNorm（零中心化，初始值=0）
精度	bfloat16（训练期间采用 INT4 量化感知训练）
预训练	在 16 块 TPU v6e 上训练 2000 亿 tokens（耗时 27 小时）
后训练	20 亿 tokens 的函数调用数据（耗时 45 分钟）

d=512, 8H/4KV, BPE=8192
                                  ┌──────────────┐
                                  │  Tool Call   │
                                  └──────┬───────┘
                                        ┌┴──────────┐
                                        │  Softmax  │
                                        └─────┬─────┘
                                        ┌─────┴─────┐
                                        │ Linear (T)│  <- tied
                                        └─────┬─────┘
                                        ┌─────┴─────┐
                                        │ ZCRMSNorm │
                                        └─────┬─────┘
                                     ┌────────┴────────┐
                                     │ Decoder x 8     │
                                     │┌───────────────┐│
                                     ││ ZCRMSNorm     ││
                                     ││ Masked Self   ││
                                     ││ Attn + RoPE   ││
                                     ││ Gated Residual││
                                     │├───────────────┤│
  ┌──────────────┐                   ││ ZCRMSNorm     ││
  │ Encoder x 12 │─────────────────────>Cross Attn    ││
  │              │                   ││ Gated Residual││
  │ ┌──────────┐ │                   │└───────────────┘│
  │ │ZCRMSNorm │ │                   └────────┬────────┘
  │ │Self Attn │ │                      ┌─────┴─────┐
  │ │ GQA+RoPE │ │                      │ Embedding │  <- shared
  │ │Gated Res │ │                      └─────┬─────┘
  │ │          │ │                    ┌───────┴────────┐
  │ │ (no FFN) │ │                    │[EOS]<tool_call>│
  │ └──────────┘ │                    │ + answer       │
  │              │                    └────────────────┘
  └──────┬───────┘
         │
    ┌────┴──────┐
    │ Embedding │
    └────┬──────┘
         │
    ┌────┴──────┐
    │   Text    │
    │  query    │
    └───────────┘

使用方法（Python）

from needle import load_checkpoint, generate, SimpleAttentionNetwork, get_tokenizer

params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
tokenizer = get_tokenizer()

result = generate(
    model, params, tokenizer,
    query="What's the weather in San Francisco?",
    tools='[{"name":"get_weather","parameters":{"location":"string"}}]',
    stream=False,
)
print(result)
# [{"name":"get_weather","arguments":{"location":"San Francisco"}}]

@misc{ndubuaku2026needle, title={Needle}, author={Henry Ndubuaku and Jakub Mroz and Karen Mosoyan and Roman Shemet and Parkirat Sandhu and Satyajit Kumar and Noah Cylich and Justin H. Lee}, year={2026}, url={https://github.com/cactus-compute/needle} }