202250311-WINDOWS本地4G显存Docker运行vLLM

2025/3/12 17:06:21 来源：https://blog.csdn.net/weixin_46449024/article/details/146188403 浏览: 次关键词：202250311-WINDOWS本地4G显存Docker运行vLLM

前置：

需要去huggingface注册账号获取token：HUGGING_FACE_HUB_TOKEN

运行vLLM

docker run  --name LocalvLLM_qwen1.5B_Int4 --runtime nvidia --gpus all      -v D:/vLLM/.cache/huggingface:/root/.cache/huggingface      --env "HUGGING_FACE_HUB_TOKEN=changeme"      --env "HUGGINGFACE_CO_URL_HOME= https://hf-mirror.com/"      --env "_HF_DEFAULT_ENDPOINT=https://hf-mirror.com"      --env "HF_ENDPOINT=https://hf-mirror.com"      -p 8000:8000      --ipc=host      vllm/vllm-openai:latest      --model Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4 --gpu-memory-utilization=1 --max-model-len 4096

测试：

curl http://localhost:8000/v1/completions     -H "Content-Type: application/json"     -d '{"model": "Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4","prompt": "San Francisco is a","max_tokens": 7,"temperature": 0}'

{"id":"cmpl-e6c75e13fd784f08b764aee18f325f65","object":"text_completion","created":1741695843,"model":"Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4","choices":[{"index":0,"text":" city with a rich history and culture","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":4,"total_tokens":11,"completion_tokens":7,"prompt_tokens_details":null}}

*显存不足，可以通过参数减少最大上下文并采用量化版本。

参考资料：

202250311-WINDOWS本地4G显存Docker运行vLLM

vllm减小显存 | vllm小模型大显存问题_gpu-memory-utilization-CSDN博客

相关资讯

热文排行

最新新闻

推荐新闻

热搜词