加载docker镜像
启动容器
docker run --privileged \
--name glm41v_int8 \
--device /dev/davinci0 \
--device /dev/davinci1 \
--device /dev/davinci2 \
--device /dev/davinci3 \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
--network host \
-v /dev/shm:/dev/shm \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64 \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /home:/home \
-it glm4.1v:1017 /bin/bash可仿照
-v /home:/home \,新增需要挂载的宿主机目录
/home/glm41v_int8,并将其加入白名单:mkdir /home/glm41v_int8
export HUB_WHITE_LIST_PATHS=/home/glm41v_int8然后,执行以下python程序,将模型权重下载至/home/glm41v_int8
from openmind_hub import snapshot_download
snapshot_download(
repo_id="MindSpore-Lab/GLM-4.1V-9B-Thinking-golden-stick-8bit",
local_dir="/home/glm41v_int8",
local_dir_use_symlinks=False
)export VLLM_MS_MODEL_BACKEND=Native
export ASCEND_TOTAL_MEMORY_GB=40
export MS_ENABLE_LCCL=off
export MS_ENABLE_INTERNAL_BOOST=off
export ASCEND_RT_VISIBLE_DEVICES=6,7 # 设置所占用的300I卡
export MS_ALLOC_CONF=enable_vmm:true
export ASCEND_CUSTOM_OPP_PATH=/usr/local/python3.11.13/lib/python3.11/site-packages/ms_custom_ops/vendors/customize/
# 可修改`--port`所指定的端口号(默认为8140)、`--tensor-parallel-size`所指定的TP并行数量(默认为2)
vllm-mindspore serve /home/glm41v_int8/ --port 8140 --limit_mm_per_prompt='{"video":"0"}' --disable-mm-preprocessor-cache --disable-log-requests --disable-uvicorn-access-log --tensor-parallel-size 2 --gpu-memory-utilization 0.90 --max-num-batched-tokens 32768 --block_size 128 --quantization smoothquant > log.txt 2>&1 &可以通过tail -f log.txt命令查看启动进度,当显示以下信息时,已启动成功:
INFO: Waiting for application startup.
INFO: Application startup complete.60.10.230.191服务器为例,查看服务端口可访问性curl http://60.10.230.191:8140/v1/models正常其将返回如下信息
{"object":"list","data":[{"id":"/home/glm41v_int8/","object":"model","created":1760782194,"owned_by":"vllm","root":"/home/glm41v_int8/","parent":null,"max_model_len":65536,"permission":[{"id":"modelperm-7552ce4e0f2d4f5888ee9775794f068a","object":"model_permission","created":1760782194,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}然后,发送服务测试请求:
curl http://60.10.230.191:8140/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/home/glm41v_int8/",
"prompt": "456*123等于多少?<answer>要计算 \$ 456 \\times 123 \$,可通过*
*竖式乘法**或**分步拆分计算**完成,以下是详细过程: \n\n\n### 方法一:竖式乘法 \n将 \$ 123 \$ 按位拆分为百位、十位、个位,分别与 \$ 456 \$ 相乘后累加: \n1. 先计算 \$ 123 \\times 6 \$(个位): \n \
$ 6 \\times 6 = 36 \$(个位写 6,向十位进 3), \n \$ 6 \\times 5 = 30 + 3 = 33 \$(十位写 3,向百位进 3), \n \$ 6 \\times 4 = 24 + 3 = 27 \$(百位写 27), \n 所以 \$ 123 \\times 6 = 738 \$
。 \n\n2. 再计算 \$ 123 \\times 50 \$(十位,注意乘后加 1 个 0): \n 先算 \$ 123 \\times 5 = 615 \$,再在末尾加 1 个 0,得 \$ 6150 \$。 \n\n3. 最后计算 \$ 123 \\times 400 \$(百位,注意乘后加 2 个
0): \n 先算 \$ 123 \\times 4 = 492 \$,再在末尾加 2 个 0,得 \$ 49200 \$。 \n\n4. 累加所有结果: \n \$ 738 + 6150 = 6888 \$, \n \$ 6888 + 49200 = 56088 \$。 \n\n\n### 方法二:分步拆分计算
\n将 \$ 123 \$ 拆分为 \$ 100 + 20 + 3 \$,分别与 \$ 456 \$ 相乘后累加: \n- \$ 456 \\times 100 = 45600 \$ \n- \$ 456 \\times 20 = 456 \\times 2 \\times 10 = 912 \\times 10 = 9120 \$ \n- \$ 456 \
\times 3 = 1368 \$(计算过程:\$ 3 \\times 6 = 18 \$,\$ 3 \\times 50 = 150 \$,\$ 3 \\times 400 = 1200 \$,累加得 \$ 1200 + 150 + 18 = 1368 \$) \n\n将结果累加: \n\$ 45600 + 9120 = 54720 \$, \n
\$ 54720 + 1368 = 56088 \$。 \n\n\n综上,\$ 456 \\times 123 = 56088 \$。 \n最终答案: \n<|begin_of_box|>56088<|end_of_box|></answer>","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_l
ogprobs":null}],"usage":{"prompt_tokens":9,"total_tokens":1033,"completion_tokens":1024,"prompt_tokens_details":null},"kv_transfer_params":null}