Researchers have introduced an algorithm named PagedAttention, which draws inspiration from the virtual memory and paging systems of operating systems. A key issue with serving LLMs is the management of the key-value cache (KV cache) memory. When batching multiple requests for these models, the KV cache memory can limit throughput if not managed efficiently.
PagedAttention addresses this by partitioning the KV cache of a request into blocks stored in non-contiguous memory spaces. This technique mirrors those used by operating systems for memory management.
Building on PagedAttention, the vLLM serving system was developed. It efficiently manages KV cache memory, reducing wastage and improving throughput. Evaluations suggest that vLLM can increase the throughput of LLMs by 2 to 4 times, particularly with longer sequences and intricate decoding algorithms.
The vLLM’s KV Cache Manager manages KV blocks similarly to operating system pages in virtual memory. By optimizing at the kernel level, vLLM adapts to the memory access patterns of PagedAttention and supports multiple decoding algorithms.
In tests against systems like FasterTransformer and Orca, vLLM showed improved performance, especially with datasets of longer sequences. In applications like chatbots, vLLM processed twice the number of requests compared to Orca benchmarks.
In summary, the PagedAttention algorithm and the vLLM serving system offer an efficient solution to serving LLMs. By using principles from operating system memory management, this approach presents a more streamlined method for handling the memory requirements of LLMs. Check paper.