vllm源码(一)

声明一个vllm_engine，它的过程是这样的。

vllm.LLM()

LLM会提供一个llm_engine，它用类方法from_engine_args包装一个engine_args数据类，返回一个LLMEngine实例类。
from_engine_args的内容比较丰富，负责生成控制不同过程的config、模型加载等等。

LLMEngine.from_engine_args()

engine_config = engine_args.create_engine_config(usage_context)
executor_class = cls._get_executor_cls(engine_config)
# Create the LLM engine.
engine = cls(vllm_config=engine_config,executor_class=executor_class,log_stats=not engine_args.disable_log_stats,usage_context=usage_context,stat_loggers=stat_loggers,
)

一共三行，每一行都是重量级。

第一行

engine_config = engine_args.create_engine_config(usage_context)

它返回一个VllmConfig类，这个类是个config集合：
在这里插入图片描述
可以看到有和model有关的，有cache的，管理并行的，device的…
这里列出各个config的值，以便有直观感受。当前是很普通的模型加载，当有其他设置时，config会不太一样。

model_config:
模型的基础信息
{'model': '/mnt/public/open_source_model/Qwen2.5/Qwen2.5-7B-Instruct', 'tokenizer': '/mnt/public/open_source_model/Qwen2.5/Qwen2.5-7B-Instruct', 'tokenizer_mode': 'auto', 'trust_remote_code': True, 'allowed_local_media_path': '', 'seed': 0, 'revision': None, 'code_revision': None, 'rope_scaling': None, 'rope_theta': None, 'tokenizer_revision': None, 'quantization': None, 'quantization_param_path': None, 'enforce_eager': False, 'max_seq_len_to_capture': 8192, 'max_logprobs': 20, 'disable_sliding_window': False, 'skip_tokenizer_init': False, 'hf_config': Qwen2Config {"_name_or_path": "/mnt/public/open_source_model/Qwen2.5/Qwen2.5-7B-In..._window": false,"vocab_size": 152064
}
, 'hf_text_config': Qwen2Config {"_name_or_path": "/mnt/public/open_source_model/Qwen2.5/Qwen2.5-7B-In..._window": false,"vocab_size": 152064
}
, 'encoder_config': None, 'hf_image_processor_config': {}, 'dtype': torch.bfloat16, 'use_async_output_proc': True, 'mm_processor_kwargs': None, 'mm_cache_preprocessor': False, 'max_model_len': 32768, 'served_model_name': '/mnt/public/open_source_model/Qwen2.5/Qwen2.5-7B-Instruct', 'multimodal_config': None, 'is_attention_free': False, 'is_hybrid': False, 'has_inner_state': False, 'override_neuron_config': None, 'supported_tasks': {'score', 'embed', 'classify', 'generate', 'reward'}, 'task': 'generate', 'pooler_config': None, 'logits_processor_pattern': None}cache_config:
控制kv cache的
{'block_size': 16, 'gpu_memory_utilization': 0.5, 'swap_space_bytes': 4294967296, 'num_gpu_blocks_override': None, 'cache_dtype': 'auto', 'is_attention_free': False, 'sliding_window': None, 'enable_prefix_caching': False, 'cpu_offload_gb': 0, 'num_gpu_blocks': None, 'num_cpu_blocks': None}parallel_config:
设置的pp和tp会在这里体现
{'pipeline_parallel_size': 1, 'tensor_parallel_size': 1, 'worker_use_ray': False, 'max_parallel_loading_workers': None, 'disable_custom_all_reduce': False, 'tokenizer_pool_config': None, 'ray_workers_use_nsight': False, 'placement_group': None, 'distributed_executor_backend': None, 'worker_cls': 'vllm.worker.worker.Worker', 'sd_worker_cls': 'auto', 'rank': 0, 'world_size': 1}scheduler_config:
调度器的config。调度器是控制推理的主角。
对于inference显存不足的情况，vllm有两种模式，由preemption_mode设置。在一个batch推理时，如果显存爆了，首先遵循先来先计算的原则，然后对后来的数据有两个解决办法：一个是重计算，即直接把已经推理的全删掉，回头重新计算；另一个是把计算好的卸载到cpu上，回头再加载回来。这两个方法是加载/卸载速度和计算速度的权衡。默认为重计算，因为它开销更低，但是在使用beam search时，重计算消耗就多了，所以仅可以使用重载。
{'runner_type': 'generate', 'max_num_batched_tokens': 32768, 'max_num_seqs': 256, 'max_model_len': 32768, 'num_lookahead_slots': 0, 'delay_factor': 0.0, 'enable_chunked_prefill': False, 'is_multimodal_model': False, 'preemption_mode': None, 'num_scheduler_steps': 1, 'multi_step_stream_outputs': True, 'send_delta_data': False, 'policy': 'fcfs', 'chunked_prefill_enabled': False}device_config:
略
{'device_type': 'cuda', 'device': device(type='cuda')}load_config:
后面要加载模型，加载模型之前先决定model_loader,再由loader加载模型。load_format就是决定用哪个loader的变量。
{'load_format': <LoadFormat.AUTO: 'auto'>, 'download_dir': None, 'model_loader_extra_config': None, 'ignore_patterns': ['original/**/*']}lora_config:
没有设置enable_lora，所以是None
Nonespeculative_config:
投机解码也可以设置，但是如果你没给投机模型的路径，就会返回None。
这里使用了一个类的静态方法SpeculativeConfig.maybe_create_spec_config返回实例化的SpeculativeConfig。
Nonedecoding_config
解码也有多种方式，默认为陈天奇团队2024年提出的xgrammar，这个方法让上下文无关文法可以0额外成本地应用在可控文本生成上。(具体我也不懂)
{'guided_decoding_backend': 'xgrammar'}observability_config
不熟
{'otlp_traces_endpoint': None, 'collect_model_forward_time': False, 'collect_model_execute_time': False}prompt_adapter_config
不熟
Nonequant_config:
量化设置
Nonecompilation_config:{'level': 0, 'debug_dump_path': '', 'cache_dir': '', 'backend': '', 'custom_ops': [], 'splitting_ops': ['vllm.unified_attention', 'vllm.unified_attention_with_output'], 'use_inductor': True, 'candidate_compile_sizes': [], 'inductor_compile_config': {}, 'inductor_passes': {}, 'use_cudagraph': False, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': None, 'cudagraph_copy_inputs': False, 'pass_config': PassConfig(dump_graph_stages=[], dump_graph_dir=PosixPath('.'), enable_fusion=True, enable_reshape=True), 'compile_sizes': [], 'capture_sizes': [256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, ...], 'max_capture_size': 256, 'bs_to_padded_graph_size': [0, 1, 2, 4, 4, 8, 8, 8, 8, 16, 16, 16, 16, 16, 16, 16, 16, 24, 24, ...], 'enabled_custom_ops': Counter(), 'disabled_custom_ops': Counter(), 'compilation_time': 0.0, 'inductor_hash_cache': <function PrivateAttr at 0x7f452ce88550>, 'static_forward_context': {}}kv_transfer_config:
只看到了成员变量，没有看到任何设置的地方。疑似是占位的。
Noneinstance_id
'6a1e7'

最后所有config回包到VllmConfig里，它不做什么改动，只做为一个config集合的数据类。

第二行

executor_class = cls._get_executor_cls(engine_config)

这句用于get一个executor，它是一个类，基类叫ExecutorBase，最朴素的继承类叫GPUExecutor。具体选择哪一个executor，要看parallel_config.distributed_executor_backend。我这里是None，所以返回了GPUExecutor，除了它还有TPUExecutor， CPUExecutor，HPUExecutor，RayXPUExecutor，OpenVINOExecutor等等。

第三行

engine = cls(vllm_config=engine_config,executor_class=executor_class,log_stats=not engine_args.disable_log_stats,usage_context=usage_context,stat_loggers=stat_loggers,
)

cls是类方法的默认传参，是LLMEngine本身。因此这句是实例化了一个LLMEngine。
看看LLMEngine的__init__()

LLMEngine的init()

一上来就把船进来的vllm_config解包了
在这里插入图片描述
然后选择是否加载tokenizer，推理可以直接塞input_ids，所以可以选择不加载
然后声明了一个计数器，只负责+1；提取了一些生成用的config，具体有这些值：

{'do_sample': True, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8, 'repetition_penalty': 1.05, 'pad_token_id': 151643, 'bos_token_id': 151643, 'eos_token_id': [151645, 151643], 'transformers_version': '4.45.2'}

又用InputPreprocessor类包装了model_config和tokenizer，类提供了一些前处理函数，即正式推理前的围绕tokenize的函数。model_config只提供一些基础信息用于判断合理性，只用到了model_config.hf_config和model_config.is_encoder_decoder_model，具体值可以在上面的vllmconfig里查看。

在这里插入图片描述
下面这一段先声明了一个self.input_registry

它默认传进来一个已经实例化的InputRegistry类，看起来是管理输入的，类的解释如下：通过目标模型分派数据处理的注册表。说实话还是不太懂。

值的一提的是，它在类函数create_input_processor里的代码functools.partial(self.process_input, model_config)给self.process_input这个类函数提前塞了传参model_config。
在这里插入图片描述

我找到了vllm官网对该过程的解释【链接】，第三条是现在所在的位置，看样子该方法与InputPreprocessor同属前处理，但是InputPreprocessor用于常规处理，，InputRegistry还能做到给多模态embedding添加占位符这种操作。具体可以扫一眼两个类的方法，就能看出区别。
在这里插入图片描述
接下来开始运行GPUExecutor，它统管模型加载、Worker和ModelRunner准备。上图4、5正是实际使用该模块时的操作。现在来看看GPUExecutor是怎么初始化的。

这里executor_class就是传进来的GPUExecutor，然后把VllmConfig送进去了。GPUExecutor是继承ExecutorBase的实现，除了__init__之外的方法都要实现，__init__也仅仅是把vllmconfig解包为各个config，然后转为成员变量，最后调用_init_executor()。看GPUExecutor对__init__executor()的实现：
在这里插入图片描述
这里每一行都是重量级。第一行创建了一个Worker，self._create_worker()如下：

挺好奇local_rank和rank是什么东西，这里没有传任何参数进来，用的全是默认值。self._get_worker_kwargs收集的传参如下：

{'vllm_config': VllmConfig(model_config=<vllm.config.ModelConfig object at 0x7f9bf55bdf30>, cache_con...transfer_config=None, instance_id='8579c'), 'local_rank': 0, 'rank': 0, 'distributed_init_method': 'tcp://10.233.72.12:43169', 'is_driver_worker': True}