阶跃星辰StepFun/Step-3.7-Flash-GGUF
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

[模型页面]:https://static.stepfun.com/blog/step-3.7-flash/

1. 简介

stepfun-ai/Step-3.7-Flash 的 GGUF 量化版本。

Step-3.7-Flash 是由 StepFun-ai 开发的 1980 亿参数稀疏混合专家(Mixture-of-Experts)视觉语言模型,每个 token 激活约 110 亿参数,吞吐量高达 400 t/s。它将 1960 亿参数的语言主干与 18 亿参数的视觉编码器相结合,实现原生图像理解,支持 256K 上下文窗口,并提供三种可选推理级别(低/中/高)以平衡速度、成本和深度。专为智能体工作负载构建——工具调用、多步推理、代码和数学——并原生支持多语言。

随语言模型量化文件一同提供的还有一个单独的 mmproj 投影器,用于多模态推理。凭借 128 GB 的统一内存(Mac Studio、DGX Spark、Ryzen AI Max+ 395 等),您可以私下部署 Step-3.7-Flash:Q4 及以下量化版本可在 256K 上下文下以高精度全速运行。

2. 文件

文件量化方式大小说明
Step-3.7-flash-BF16.ggufBF16394 GB全精度参考版本。
Step-3.7-flash-Q8_0.ggufQ8_0209 GB近乎无损。不使用 imatrix。
Step-3.7-flash-Q4_K_S.ggufQ4_K_S112 GBimatrix 校准。质量/大小平衡。
Step-3.7-flash-IQ4_XS.ggufIQ4_XS105 GBimatrix 校准。比 Q4_K_S 略小,质量相当。
Step-3.7-flash-Q3_K_L.ggufQ3_K_L103 GBimatrix 校准。大幅减小尺寸。
Step-3.7-flash-Q3_K_M.ggufQ3_K_M94 GBimatrix 校准。当需要在单个 64-96 GB 设备上运行时使用;低比特宽度下预计会有适度的质量损失。
mmproj-Step-3.7-flash-f16.ggufF164 GB视觉投影器。与上述任何语言模型量化版本搭配使用以处理图像输入。

3. 快速开始

构建 llama.cpp 并运行:

# 1. Clone and build
git clone https://github.com/stepfun-ai/llama.cpp.git
cd llama.cpp
git checkout -b step3.7 origin/step3.7
cmake -B build -DLLAMA_BUILD_TOOLS=ON -DLLAMA_BUILD_SERVER=ON
cmake --build build --config Release -j$(nproc)

# 2. Test performance (benchmark)
./build/bin/llama-batched-bench \
  -m Step-3.7-flash-Q4_K_S.gguf \
  -c 32768 -b 2048 -ub 2048 \
  -npp 0,2048,8192,16384,32768 -ntg 128 -npl 1

# 3. Text-only inference
./build/bin/llama-cli \
  -m Step-3.7-flash-Q4_K_S.gguf \
  -c 32768 -ngl 99 -fa on \
  -p "Write a Python function to compute the n-th Fibonacci number."

# 4. With vision (image + text)
./build/bin/llama-mtmd-cli \
  -m Step-3.7-flash-Q4_K_S.gguf \
  --mmproj mmproj-Step-3.7-flash-f16.gguf \
  -c 32768 -ngl 99 -fa on \
  --image path/to/image.jpg \
  -p "Describe this image."

# 5. OpenAI-compatible server (text + vision)
./build/bin/llama-server \
  -m Step-3.7-flash-Q4_K_S.gguf \
  --mmproj mmproj-Step-3.7-flash-f16.gguf \
  -c 32768 -ngl 99 -fa on \
  --host 0.0.0.0 --port 8080

有关完整的 CLI/服务器选项,请参阅 llama.cpp README。

4. 性能

Apple Mac Studio(M4 max,128 GB 统一内存)

Step-3.7-flash-Q4_K_S

./llama-batched-bench -m Step-3.7-flash-Q4_K_S.gguf -c 262150 -b 2048 -ub 1024 -npp 0,2048,8192,16384,32768,65536,131072,262144 -ntg 128 -npl 1
PPTGPLN_KVT_PP 秒S_PP 令牌/秒T_TG 秒S_TG 令牌/秒总时间 秒总速度 令牌/秒
012811280.0000.002.50051.202.50051.20
2048128121764.873420.282.63948.517.512289.68
81921281832020.292403.702.75746.4323.049360.97
1638412811651242.854382.322.92443.7745.779360.69
3276812813289695.168344.323.22339.7298.391334.34
65536128165664233.885280.213.90932.74237.794276.14
1310721281131200635.499206.255.75922.23641.258204.60
26214412812622722362.488110.9613.1889.712375.677110.40

Step-3.7-flash-IQ4_XS

./llama-batched-bench -m Step-3.7-flash-IQ4_XS.gguf -c 262150 -b 2048 -ub 1024 -npp 0,2048,8192,16384,32768,65536,131072,262144 -ntg 128 -npl 1
PPTGPLN_KVT_PP 秒S_PP 令牌/秒T_TG 秒S_TG 令牌/秒T 秒S 令牌/秒
012811280.0000.002.58249.582.58249.58
2048128121764.835423.562.67947.787.514289.60
81921281832019.954410.552.80345.6622.757365.60
1638412811651242.142388.782.95743.2945.098366.13
3276812813289693.489350.503.28838.9396.777339.91
65536128165664227.088288.593.94532.44231.033284.22
1310721281131200635.047206.405.79122.10640.838204.73
26214412812622722170.271120.7913.0709.792183.342120.12

Step-3.7-flash-Q3_K_L

./llama-batched-bench -m Step-3.7-flash-Q3_K_L.gguf -c 262272 -b 2048 -ub 1024 -npp 0,2048,8192,16384,32768,65536,131072,262144 -ntg 128 -npl 1
PPTGBN_KVT_PP 秒S_PP 令牌/秒T_TG 秒S_TG 令牌/秒T 秒S 令牌/秒
012811280.0000.003.59035.663.59035.66
2048128121765.263389.153.70234.578.965242.72
81921281832021.789375.973.81733.5325.606324.92
1638412811651245.819357.583.97732.1849.796331.59
32768128132896100.827324.994.30829.71105.135312.89
65536128165664242.172270.624.97725.72247.149265.69
1310721281131200659.645198.706.76418.92666.409196.88
26214412812622722200.370119.1414.0089.142214.378118.44

NVIDIA DGX Spark(GB10,128 GB 统一内存)

Step-3.7-flash-Q4_K_S

./llama-batched-bench -m Step-3.7-flash-Q4_K_S.gguf -c 131300 -b 2048 -ub 1024 -npp 0,2048,8192,16384,32768,65536,131072 -ntg 128 -npl 1
PPTGBN_KVT_PP 秒S_PP 令牌/秒T_TG 秒S_TG 令牌/秒T 秒S 令牌/秒
012811280.0000.005.15724.825.15724.82
2048128121768.021255.334.90726.0812.929168.31
81921281832010.866753.895.16924.7616.035518.86
1638412811651229.389557.496.21520.6035.603463.78
3276812813289652.501624.146.93118.4759.432553.50
65536128165664112.321583.477.76916.48120.090546.79
1310721281131200281.479465.669.83413.02291.313450.37

Step-3.7-flash-IQ4_XS

./llama-batched-bench -m Step-3.7-flash-IQ4_XS.gguf -c 262272 -b 2048 -ub 1024 -npp 0,2048,8192,16384,32768,65536,131072,262144 -ntg 128 -npl 1
PPTGPLN_KVT_PP 秒S_PP 令牌/秒T_TG 秒S_TG 令牌/秒T 秒S 令牌/秒
012811280.0000.005.36823.855.36823.85
2048128121764.250481.875.31124.109.561227.58
81921281832012.531653.735.81722.0118.348453.46
1638412811651224.474669.445.91521.6430.389543.35
3276812813289651.976630.446.53119.6058.508562.25
65536128165664116.305563.487.93416.13124.239528.53
1310721281131200298.746438.7410.26312.47309.009424.58
2621441281262272924.872283.4414.8628.61939.734279.09

Step-3.7-flash-Q3_K_L

./llama-batched-bench -m Step-3.7-flash-Q3_K_L.gguf -c 262272 -b 2048 -ub 1024 -npp 0,2048,8192,16384,32768,65536,131072,262144 -ntg 128 -npl 1
PPTGPLN_KVT_PP 秒S_PP 令牌/秒T_TG 秒S_TG 令牌/秒T 秒S 令牌/秒
012811280.0000.005.94721.525.94721.52
2048128121764.145494.085.62322.769.768222.77
81921281832014.889550.205.79922.0720.688402.17
1638412811651229.374557.786.14020.8535.513464.95
3276812813289654.957596.256.74418.9861.702533.15
65536128165664129.827504.798.34715.33138.174475.23
1310721281131200315.402415.5710.78011.87326.182402.23
2621441281262272910.215288.0015.5688.22925.783283.30

AMD Ryzen AI Max+ 395(Strix Halo,128 GB 统一内存)

Step-3.7-flash-Q4_K_S

llama-batched-bench.exe -m Step-3.7-flash-Q4_K_S.gguf -c 65664 -b 2048 -ub 1024 -npp 0,2048,8192,16384,32768,65536 -ntg 128 -npl 1
PPTGBN_KVT_PP 秒S_PP 令牌/秒T_TG 秒S_TG 令牌/秒T 秒S 令牌/秒
012811280.0000.004.87826.244.87826.24
2048128121769.367218.635.13424.9314.501150.06
81921281832043.540188.155.50823.2449.048169.63
16384128116512111.814146.535.94721.53117.761140.22
32768128132896357.81991.586.77918.88364.59890.23
655361281656641342.50148.828.49515.071350.99648.60

Step-3.7-flash-IQ4_XS

./llama-batched-bench -m Step-3.7-flash-IQ4_XS.gguf -c 65664 -b 2048 -ub 1024 -npp 0,2048,8192,16384,32768,65536 -ntg 128 -npl 1
PPTGBN_KVT_PP 秒S_PP 令牌/秒T_TG 秒S_TG 令牌/秒T 秒S 令牌/秒
012811280.0000.005.93121.585.93121.58
2048128121768.143251.506.19420.6714.337151.78
81921281832039.899205.326.52119.6346.420179.23
16384128116512105.098155.896.89118.57111.989147.44
32768128132896338.64596.767.79316.42346.43994.95
655361281656641310.82050.009.48913.491320.30949.73

Step-3.7-flash-Q3_K_L

./llama-batched-bench -m Step-3.7-flash-Q3_K_L.gguf -c 262272 -b 2048 -ub 1024 -ctk q8_0 -ctv q8_0 -npp 0,2048,8192,16384,32768,65536,131072,262144 -ntg 128 -npl 1
PPTGBN_KVT_PP 秒S_PP 令牌/秒T_TG 秒S_TG 令牌/秒T 秒S 令牌/秒
012811280.0000.005.01525.535.01525.53
20481281217610.246199.885.07325.2315.319142.04
81921281832037.229220.055.34123.9642.570195.44
1638412811651279.234206.785.48923.3284.723194.89
32768128132896179.697182.355.81022.03185.507177.33
65536128165664436.593150.116.57719.46443.169148.17
13107212811312001262.377103.839.12414.031271.501103.19
26214412812622723487.92175.1611.39111.243499.31274.95

5. 致谢

本版本的发布离不开以下作者和社区的贡献:

  • bartowski — 提供了 calibration_datav5,这是社区标准的 imatrix 校准锚点,被无数 GGUF 版本所采用。仅用于校准目的;尚未验证此资源的许可证。
  • eaddario — 提供了 imatrix-calibration 数据集(MIT 许可证),其多语言/代码/数学拆分构成了本版本领域平衡的基础。
  • NousResearch — 提供了 hermes-function-calling-v1(Apache-2.0 许可证),用于代理/工具调用校准覆盖。
  • ggml-org / llama.cpp — 提供了完整的量化和推理工具链(MIT 许可证)。

6. 许可协议

本仓库中的 GGUF 量化文件是 stepfun-ai/Step-3.7-Flash 的衍生作品,并采用与原项目相同的 Apache 2.0 许可协议发布。

组件许可协议
基础模型权重 (stepfun-ai/Step-3.7-Flash)Apache-2.0
校准数据集 (eaddario/imatrix-calibration)MIT
校准数据集 (NousResearch/hermes-function-calling-v1)Apache-2.0
量化工具链 (llama.cpp)MIT

所有校准数据集均保留其原始许可协议,且仅严格用于量化校准目的。