Qwen3.5-27B 是阿里云通义千问团队开发的大型多模态语言模型,具有强大的语言理解和生成能力,同时支持图像理解功能。该模型基于 Transformer 架构,在海量数据上进行预训练,支持中文和英文等多种语言。
Qwen3.5-27B 是 Qwen 系列的旗舰版本,参数量为 270 亿,在语言理解、数学推理、代码生成等任务上表现出色。模型采用分组查询注意力机制(Grouped Query Attention, GQA)来提高推理效率,并通过大规模预训练和指令微调获得了强大的零样本(Zero-shot)和少样本(Few-shot)学习能力。
该版本集成了视觉理解能力,可以同时处理文本和图像输入,支持多模态对话、图像描述、视觉问答等应用场景。模型支持最长 32768 个 token 的上下文长度,能够满足长文本分析和复杂对话需求。
| 参数 | 说明 |
|---|---|
| 参数量 | 270 亿 |
| 架构 | Decoder-only Transformer |
| 注意力机制 | Grouped Query Attention (GQA) |
| 最大上下文长度 | 32768 tokens |
| 预训练数据规模 | 海量高质量多语言数据 |
| 支持语言 | 中文、英文等 |
模型采用 decoder-only 的 Transformer 架构,是当前大语言模型的主流架构选择。该架构通过堆叠多层自注意力机制和前馈网络,逐层提取和强化输入的语义表示。
模型主要包含以下核心组件:
┌─────────────────────────────────────────────┐
│ Input Text/Images │
└─────────────────┬───────────────────────────┘
▼
┌─────────────────────────────────────────────┐
│ Embedding Layer │
│ (Token → Vector Representation) │
└─────────────────┬───────────────────────────┘
▼
┌─────────────────────────────────────────────┐
│ Transformer Blocks × N │
│ ┌───────────────────────────────────────┐ │
│ │ Multi-Head Self-Attention │ │
│ └───────────────────────────────────────┘ │
│ ┌───────────────────────────────────────┐ │
│ │ Feed-Forward Network (SwiGLU) │ │
│ └───────────────────────────────────────┘ │
│ ... × L │
└─────────────────┬───────────────────────────┘
▼
┌─────────────────────────────────────────────┐
│ Output Layer │
│ (Vector → Probability Distribution) │
└─────────────────────────────────────────────┘| 组件 | 版本 |
|---|---|
| transformers | 2.9.0+cpu |
| torch | 2.9.0+cpu |
| torchvision | 0.24.02.0.dev0 |
| torch_npu | 2.9.0 |
| vllm | 0.16.0rc2.dev55+g65bb4942b.empty |
| vllm_ascend | 0.14.0rc2.dev119+g52aa9c006 |
| 驱动固件 | 25.2.0 |
| CANN | 8.5.0 |
通过镜像链接下载对应服务器的镜像版本。
镜像下载注意,当前仓库中包含的对象比较多:Qwen3.5-397B-A17B-w8a8-mtp 权重+A2/A3-arm/x86 的镜像,可以通过下面的方法下载自己想要的包:
GIT_LFS_SKIP_SMUDGE=1 git clone https://modelers.cn/Eco-Tech/Qwen3.5-397B-A17B-w8a8-mtp.git
cd Qwen3.5-397B-A17B-w8a8-mtp
git lfs pull vllm-image/Vllm-ascend-Qwen3_5-A2-Ubuntu-v0.tar下载后通过 docker load 解压镜像包,如:
cd vllm-image
docker load -i Vllm-ascend-Qwen3_5-A2-Ubuntu-v0.tar # A2#!/bin/sh
image_id=""
docker run -itd --privileged --name=qwen3.5 --net=host \
--shm-size 100g \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci4 \
--device=/dev/davinci5 \
--device=/dev/davinci6 \
--device=/dev/davinci7 \
--device=/dev/davinci_manager \
--device=/dev/hisi_hdc \
--device /dev/devmm_svm \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
-v /usr/local/sbin:/usr/local/sbin \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /data:/data \
$image_id \
bashBF16 权重链接:https://modelers.cn/models/Qwen-AI/Qwen3.5-27B
export PYTORCH_NPU_ALLOC_CONF="expandable_segments:True"
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_BUFFSIZE=1024
export OMP_NUM_THREADS=10
export OMP_PROC_BIND=false
weight_path=""
vllm serve $weight_path \
--served-model-name "qwen3.5" \
--host 0.0.0.0 \
--port 8010 \
--data-parallel-size 1 \
--tensor-parallel-size 4 \
--max-model-len 32768 \
--max-num-batched-tokens 16384 \
--max-num-seqs 128 \
--gpu-memory-utilization 0.94 \
--trust-remote-code \
--async-scheduling \
--allowed-local-media-path / \
--mm-processor-cache-gb 0 \
--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
--additional-config '{"enable_cpu_binding":true, "multistream_overlap_shared_expert": true}'服务拉起后,用 curl 命令测试服务可用性。
curl http://localhost:8010/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5",
"prompt": "介绍一下你自己,用中文回答",
"max_tokens": 200,
"temperature": 0
}'curl http://localhost:8010/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustrate?"}
]}
]
}'输入 1024×1024 像素 × 1 图 + 1024 token,输出 2048 token,并发 50
MODEL_NAME="qwen3.5"
MODEL_PATH=""
INPUT_LEN=1024
OUTPUT_LEN=2048
vllm bench serve --model $MODEL_NAME --host 127.0.0.1 --port 8010 --dataset-name random-mm --random-input-len $INPUT_LEN --random-output-len $OUTPUT_LEN --tokenizer $MODEL_PATH --result-dir="./qwen35-x27b-n" --backend openai-chat --endpoint /v1/chat/completions --random-mm-bucket-config '{(1024, 1024, 1): 1.0}' --ignore-eos --random-mm-limit-mm-per-prompt '{"image": 1, "video": 0}' --num-prompts 150 --max-concurrency 50=============== Serving Benchmark Result ============
Successful requests: 150
Failed requests: 0
Maximum request concurrency: 50
Benchmark duration (s): 424.99
Total input tokens: 153600
Total generated tokens: 307200
Request throughput (req/s): 0.35
Output token throughput (tok/s): 722.84
Peak output token throughput (tok/s): 1100.00
Peak concurrent requests: 73.00
Total token throughput (tok/s): 1084.26
---------------Time to First Token----------------
Mean TTFT (ms): 26513.11
Median TTFT (ms): 18135.54
P99 TTFT (ms): 64009.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 56.21
Median TPOT (ms): 56.86
P99 TPOT (ms): 61.38
---------------Inter-token Latency----------------
Mean ITL (ms): 56.45
Median ITL (ms): 48.19
P99 ITL (ms): 91.13
==================================================输入 2048 token,输出 2048 token,并发 50
MODEL_NAME="qwen3.5"
MODEL_PATH=""
INPUT_LEN=2048
OUTPUT_LEN=2048
vllm bench serve --backend vllm --model $MODEL_NAME --host localhost --port 8010 --dataset-name random --random-input-len $INPUT_LEN --random-output-len $OUTPUT_LEN --tokenizer $MODEL_PATH --result-dir="./qwen3-27b-xn" --num-prompts=150 --max-concurrency 50 --ignore-eos=============== Serving Benchmark Result ============
Successful requests: 150
Failed requests: 0
Maximum request concurrency: 50
Benchmark duration (s): 380.09
Total input tokens: 307200
Total generated tokens: 307200
Request throughput (req/s): 0.39
Output token throughput (tok/s): 808.22
Peak output token throughput (tok/s): 1250.00
Peak concurrent requests: 66.00
Total token throughput (tok/s): 1616.45
---------------Time to First Token----------------
Mean TTFT (ms): 14234.08
Median TTFT (ms): 13699.46
P99 TTFT (ms): 28759.76
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 54.92
Median TPOT (ms): 55.39
P99 TPOT (ms): 60.08
---------------Inter-token Latency----------------
Mean ITL (ms): 55.62
Median ITL (ms): 44.38
P99 ITL (ms): 89.15
==================================================