
vLLM(Very Large Language Model Serving)是由加州大学伯克利分校团队开发的高性能、低延迟大语言模型(LLM)推理和服务框架。其核心创新在于PagedAttention技术,通过将注意力键值(KV)缓存分页管理,显著提升显存利用率并降低碎片化问题,使吞吐量比传统框架(如Hugging Face Transformers)提升24倍。该框架支持连续批处理、动态显存分配和多GPU并行推理,能够高效处理8k+长上下文请求,并兼容OpenAI API接口,开发者可快速部署Hugging Face模型。通过集成FP8、AWQ等量化技术,vLLM在保证推理精度的同时大幅降低资源消耗,目前已成为企业级AI部署(如DeepSeek-R1 671B模型分布式集群)的首选方案。
中文文档:https://vllm.hyper.ai/docs/
新建虚拟环境,命令如下:
conda create --name vllm python=3.9
然后,激活虚拟环境,命令如下:
conda activate vllm
安装pytorch,命令如下:
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.1 -c pytorch -c nvidia
其他的应该不用安装。关于pytorch的版本,其实现在不用严格指定,即使本地没有安装cuda,pytorch也会自动安装的。
我们去魔搭社区下载模型。官网:https://www.modelscope.cn/my/overview 安装modelscope,执行命令:
pip install modelscope
新建脚本,将下面的复制进去并运行。
#模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B')
等这个代码执行完成就得到了/DeepSeek-R1-Distill-Qwen-1.5B模型文件。
有两种部署方法,第一种使用vllm serve,我们使用1.5B的模型举例,执行命令:
vllm serve deepseek/DeepSeek-R1-Distill-Qwen-1.5B --tensor-parallel-size 1 --max-model-len 32768 --enforce-eager --port 11111 --api-key token-abc123
测试代码:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11111/v1",
api_key="token-abc123",
)
completion = client.chat.completions.create(
model="deepseek/DeepSeek-R1-Distill-Qwen-1.5B",
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(completion.choices[0].message)
测试结果:
ChatCompletionMessage(content='Alright, the user said "Hello!" and I should respond warmly.\n\nI\'ll greet them and offer my help.\n\nMake sure to keep it friendly and open-ended.\n</think>\n\nHello! How can I assist you today?', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[], reasoning_content=None)
第二种,使用vllm.entrypoints.openai.api_server,执行命令:
python -m vllm.entrypoints.openai.api_server --model deepseek/DeepSeek-R1-Distill-Qwen-1.5B --served-model-name "openchat"
另起一个终端,在命令行里输入:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openchat",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'
返回结果:
{"id":"cmpl-50e672d51948481bb573112f2b0562e5","object":"text_completion","created":1739353907,"model":"openchat","choices":[{"index":0,"text":" city in the United States, and","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7,"prompt_tokens_details":null}}

有时候,有些显卡被占用了,需要指定显卡部署,执行命令:
export CUDA_VISIBLE_DEVICES=4,5&&vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --dtype auto --api-key token-abc123 --tensor-parallel-size 2 --port 11112 --cpu-offload-gb 20还有一种情况,显卡被占用了,还有一些空间,这时候就要通过--gpu-memory-utilization,来控制显存的占用,执行命令如下:
export CUDA_VISIBLE_DEVICES=4,5&&vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --dtype auto --api-key token-abc123 --tensor-parallel-size 2 --port 11112 --cpu-offload-gb 20 --gpu-memory-utilization 0.6