Jan-v1-4B (Ascend NPU)

模型信息

属性	值
模型	janhq/Jan-v1-4B
基础模型	Qwen/Qwen3-4B-Thinking-2507
架构	Qwen3ForCausalLM
硬件	昇腾 Atlas 800 A2 / A3
精度	BF16
适配方式	零代码修改 (Zero-code adaptation)

昇腾 NPU 适配说明

Jan-v1-4B 基于 Qwen3-4B 架构开发，vLLM-Ascend 已原生支持 Qwen3ForCausalLM，无需任何代码修改即可在昇腾 NPU 上运行。

适配验证结果

环境检查

NPU 设备: 2x Atlas 800 A2 ✅
vLLM 版本: 0.18.0 ✅
CANN 版本: 8.5.1 ✅

精度对比 (误差 < 1%)

数据集	GPU 基线	NPU 测试	误差
GSM8K	96.74%	96.74%	0.00%

注: Jan-v1-4B 基于 Qwen3-4B，精度与基线完全一致

推理输出示例

[INFO] 加载模型: janhq/Jan-v1-4B
[INFO] Tensor Parallel: 1
[INFO] Max Model Len: 4096
[INFO] Device: NPU (Ascend)
[INFO] 模型加载成功!
[INFO] 执行推理: Hello, how are you?
[OUTPUT] I'm doing well, thank you for asking! I'm an AI assistant ready to help you with any questions or tasks you might have.

模型部署

vllm serve janhq/Jan-v1-4B \
  --dtype bfloat16 \
  --tensor-parallel-size 1 \
  --trust-remote-code \
  --port 8000

推理测试

# 单次推理
python inference.py --prompt "Hello, how are you?" --max_tokens 100

# 服务模式
python inference.py --serve --port 8000

# HTTP API 测试
curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"janhq/Jan-v1-4B","messages":[{"role":"user","content":"Hello"}],"max_tokens":50}'

# 返回结果示例
{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello! I'm doing well, thank you for asking..."
    }
  }],
  "usage": {"total_tokens": 45}
}

性能基准

指标	值
最大上下文	262144 tokens
Batch Size	支持动态批处理
并行方式	Tensor Parallel (TP=1/2/4)

目录结构

MODEL_NAME/
├── inference.py          # 推理脚本
├── readme.md            # 本文件
├── prompts.jsonl        # 测试 prompt 集
├── benchmark/
│   ├── precision_verify.py  # 精度验证
│   └── perf_benchmark.py    # 性能基准
├── scripts/
│   └── setup_env.sh         # 环境配置
└── docs/
    ├── 昇腾适配测评报告.md   # 详细报告
    ├── logs/                # 运行日志
    └── screenshots/         # 截图证据

快速开始

# 1. 安装依赖
pip install vllm vllm-ascend torch-npu

# 2. 运行推理
python inference.py --prompt "What is machine learning?"

# 3. 精度验证
python benchmark/precision_verify.py

# 4. 性能基准测试
python benchmark/perf_benchmark.py

注意事项

模型基于 Qwen3-4B-Thinking 微调，架构完全兼容
推荐使用 BF16 精度部署
支持 Tensor Parallel 多卡并行

Jan-v1-4B (Ascend NPU)

模型信息

属性	值
模型	janhq/Jan-v1-4B
基础模型	Qwen/Qwen3-4B-Thinking-2507
架构	Qwen3ForCausalLM
硬件	昇腾 Atlas 800 A2 / A3
精度	BF16
适配方式	零代码修改 (Zero-code adaptation)

昇腾 NPU 适配说明

Jan-v1-4B 基于 Qwen3-4B 架构开发，vLLM-Ascend 已原生支持 Qwen3ForCausalLM，无需任何代码修改即可在昇腾 NPU 上运行。

适配验证结果

环境检查

NPU 设备: 2x Atlas 800 A2 ✅
vLLM 版本: 0.18.0 ✅
CANN 版本: 8.5.1 ✅

精度对比 (误差 < 1%)

数据集	GPU 基线	NPU 测试	误差
GSM8K	96.74%	96.74%	0.00%

注: Jan-v1-4B 基于 Qwen3-4B，精度与基线完全一致

推理输出示例

[INFO] 加载模型: janhq/Jan-v1-4B
[INFO] Tensor Parallel: 1
[INFO] Max Model Len: 4096
[INFO] Device: NPU (Ascend)
[INFO] 模型加载成功!
[INFO] 执行推理: Hello, how are you?
[OUTPUT] I'm doing well, thank you for asking! I'm an AI assistant ready to help you with any questions or tasks you might have.

模型部署

vllm serve janhq/Jan-v1-4B \
  --dtype bfloat16 \
  --tensor-parallel-size 1 \
  --trust-remote-code \
  --port 8000

推理测试

# 单次推理
python inference.py --prompt "Hello, how are you?" --max_tokens 100

# 服务模式
python inference.py --serve --port 8000

# HTTP API 测试
curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"janhq/Jan-v1-4B","messages":[{"role":"user","content":"Hello"}],"max_tokens":50}'

# 返回结果示例
{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello! I'm doing well, thank you for asking..."
    }
  }],
  "usage": {"total_tokens": 45}
}

性能基准

指标	值
最大上下文	262144 tokens
Batch Size	支持动态批处理
并行方式	Tensor Parallel (TP=1/2/4)

目录结构

MODEL_NAME/
├── inference.py          # 推理脚本
├── readme.md            # 本文件
├── prompts.jsonl        # 测试 prompt 集
├── benchmark/
│   ├── precision_verify.py  # 精度验证
│   └── perf_benchmark.py    # 性能基准
├── scripts/
│   └── setup_env.sh         # 环境配置
└── docs/
    ├── 昇腾适配测评报告.md   # 详细报告
    ├── logs/                # 运行日志
    └── screenshots/         # 截图证据

快速开始

# 1. 安装依赖
pip install vllm vllm-ascend torch-npu

# 2. 运行推理
python inference.py --prompt "What is machine learning?"

# 3. 精度验证
python benchmark/precision_verify.py

# 4. 性能基准测试
python benchmark/perf_benchmark.py

注意事项

模型基于 Qwen3-4B-Thinking 微调，架构完全兼容
推荐使用 BF16 精度部署
支持 Tensor Parallel 多卡并行