vLLM Ascend/Qwen3-Coder-Next
模型介绍文件和版本Pull Requests讨论分析
下载使用量0

Qwen3-Coder-Next

简介

Qwen3-Coder-Next是一个高稀疏性的混合专家模型(MoE)。相较于Qwen3 的混合专家(MoE)架构,该模型在混合注意力机制(线性注意力与传统的Full Attention的混合注意力机制)等方面做出了关键性改进,提升了模型在长上下文、大参数量场景下的训练与推理效率。

The Qwen3-Coder-Next model is supported in vllm-ascend:v0.14.0rc1.

Qwen3-Coder-Next已在vllm-ascend:v0.14.0rc1版本镜像支持。

权重获取

从 AtomGit AI 或者huggingface下载权重。

部署

启动docker容器

# Update the vllm-ascend image
# For Atlas A2 machines:
# export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
# For Atlas A3 machines:
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|-a3
docker run --rm \
--shm-size=1g \
--name qwen3-coder-next \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-p 8000:8000 \
-it $IMAGE bash

需要确保你的环境中有Triton Ascend以运行该模型。

pip install triton-ascend==3.2.0

推理

离线推理

执行以下离线脚本,给模型输入四条prompt:

import os
os.environ["VLLM_USE_MODELSCOPE"] = "True"
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

from vllm import LLM, SamplingParams


def main():
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    # Create a sampling params object.
    sampling_params = SamplingParams(max_tokens=100, temperature=0.0)
    # Create an LLM.
    llm = LLM(model="/path/to/model/Qwen3-Coder-Next/",
            tensor_parallel_size=4,
            trust_remote_code=True,
            max_model_len=10000,
            gpu_memory_utilization=0.8,
            max_num_seqs=4,
            max_num_batched_tokens = 4096,
            compilation_config={
            "cudagraph_mode": "FULL_DECODE_ONLY",},
        )

    # Generate texts from the prompts.
    outputs = llm.generate(prompts, sampling_params)
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


if __name__ == "__main__":
    main()

在线推理

执行以下脚本启动一个在线的服务:

vllm serve /path/to/model/Qwen3-Coder-Next/ --tensor-parallel-size 4 --max-model-len 32768 --gpu-memory-utilization 0.8 --max-num-batched-tokens 4096 --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'

然后执行以下脚本向模型发送一条请求:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
        "prompt": "The future of AI is",
        "path": "/path/to/model/Qwen3-Coder-Next/",
        "max_tokens": 100,
        "temperature": 0
        }'

执行结束后,你可以看到模型回答如下:

Prompt: 'The future of AI is', Generated text: ' not just about building smarter machines, but about creating systems that can collaborate with humans in meaningful, ethical, and sustainable ways. As AI continues to evolve, it will increasingly shape how we live, work, and interact — and the decisions we make today will determine whether this future is one of shared prosperity or deepening inequality.\n\nThe rise of generative AI, for example, has already begun to transform creative industries, education, and scientific research. Tools like ChatGPT, Midjourney, and'

声明

1)当前仅为尝鲜体验,性能优化中。 2)本代码仓提到的数据集和模型仅作为示例,这些数据集和模型仅供您用于非商业目的,如您使用这些数据集和模型来完成示例,请您特别注意应遵守对应数据集和模型的License,如您因使用数据集或模型而产生侵权纠纷,华为不承担任何责任。 3)如您在使用本代码仓的过程中,发现任何问题(包括但不限于功能问题、合规问题),请在本代码仓提交issue,我们将及时审视并解答。