vLLM 是一款专为大语言模型推理加速而设计的框架,实现了 KV 缓存内存几乎零浪费,解决了内存管理瓶颈问题。
更多 vLLM 中文文档及教程可访问 →https://vllm.hyper.ai/
源代码:vllm-project/vllm
from vllm import LLM, SamplingParams# Sample prompts.
# 提示示例prompts = ["Hello, my name is","The president of the United States is","The capital of France is","The future of AI is",
]
# Create a sampling params object.
# 创建 sampling params 对象
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)# Create an LLM.
# 创建一个 LLM
llm = LLM(model="meta-llama/Llama-2-13b-chat-hf", cpu_offload_gb=10)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
# 从提示中生成文本。输出是一个 RequestOutput 列表,包含提示、生成文本和其他信息outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
# 打印输出
for output in outputs:prompt = output.promptgenerated_text = output.outputs[0].textprint(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")