z-lab/Qwen3.5-9B-PARO

Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

ParoQuant 是目前最先进的大语言模型 INT4 量化方法。它在接近 AWQ 速度运行的同时，弥合了与 FP16 精度之间的差距。支持 NVIDIA GPU（vLLM、Transformers）和 Apple Silicon（MLX）。更多信息请参见 https://github.com/z-lab/paroquant。

z-lab/Qwen3.5-9B-PARO 是采用 ParoQuant 量化的 4 位版本 Qwen/Qwen3.5-9B。可在 Hugging Face 合集中查看其他 ParoQuant 模型。

快速开始

安装

# NVIDIA GPU (CUDA 12.9)
pip install "paroquant[vllm]"

# NVIDIA GPU (CUDA 13.0)
pip install "paroquant[vllm]" "vllm==0.19.1" \
  --extra-index-url https://wheels.vllm.ai/0.19.1/cu130 \
  --extra-index-url https://download.pytorch.org/whl/cu130

# Apple Silicon
pip install "paroquant[mlx]"

交互式聊天

python -m paroquant.cli.chat --model z-lab/Qwen3.5-9B-PARO

OpenAI 兼容 API 服务器

对于 vLLM，您可以直接使用 vllm serve 来部署 ParoQuant 模型：

vllm serve z-lab/Qwen3.5-9B-PARO --port 8000

对于其他框架：

python -m paroquant.cli.serve --model z-lab/Qwen3.5-9B-PARO --port 8000

对于 MLX，如果你希望加载 VLM 组件并使用模型的多模态功能，请添加 --vlm。对于 vLLM，VLM 组件默认会加载，可通过服务器参数 --language-model-only 跳过加载。

[!NOTE] 此检查点中的视觉组件以原始精度存储，仅语言组件量化为 4 位；因此，该模型大小大于完全量化的模型。如果不使用多模态功能，为获得最佳效率，请避免加载 VLM 组件。

Docker（NVIDIA GPU）

[!NOTE] 以下命令将本地缓存目录映射到容器，以便在多次运行中持久化内核缓存。如需禁用此行为，请移除 -v ...。

# Interactive chat
docker run --pull=always --rm -it --gpus all --ipc=host \
  -v $HOME/.cache/paroquant:/root/.cache/paroquant \
  ghcr.io/z-lab/paroquant:chat --model z-lab/Qwen3.5-9B-PARO

# API server (port 8000)
docker run --pull=always --rm -it --gpus all --ipc=host -p 8000:8000 \
  -v $HOME/.cache/paroquant:/root/.cache/paroquant \
  ghcr.io/z-lab/paroquant:serve --model z-lab/Qwen3.5-9B-PARO

引用

@inproceedings{liang2026paroquant,
  title     = {{ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference}},
  author    = {Liang, Yesheng and Chen, Haisheng and Zhang, Zihan and Han, Song and Liu, Zhijian},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}