LLM Serving

Serving Large Language Models (LLMs) is fundamentally different from serving traditional machine learning models or web APIs. The computational intensity, memory requirements, and sequential nature of token generation introduce a unique set of challenges that require specialized architectures.

In this chapter, we will explore the state-of-the-art techniques for high-throughput, low-latency LLM serving, focusing on Continuous Batching and PagedAttention.

1. The LLM Serving Challenge

When you serve a standard REST API, a request takes a few milliseconds. When you serve an LLM, a single request (generating 500 tokens) can lock up a GPU for seconds.

Key Metrics

  1. Time to First Token (TTFT): How long the user waits before seeing the first word. Critical for perceived latency.
  2. Tokens Per Second (TPS): The speed of generation for a single user.
  3. Throughput (Total TPS): The total number of tokens generated across all users per second. Critical for cost efficiency.

The Memory Bottleneck

LLMs are memory-bound, not just compute-bound.

  • Model Weights: A 7B parameter model in FP16 takes ~14GB VRAM.
  • KV Cache: Stores attention keys/values for past tokens to avoid re-computation. This grows linearly with sequence length and batch size.

[!WARNING] Naive serving with Python (e.g., Hugging Face generate() inside FastAPI) is inefficient because it processes requests sequentially or uses static batching, leaving the GPU underutilized while waiting for memory transfers or shorter sequences to finish.

2. Serving Architectures

1. Naive Python Server (FastAPI + PyTorch)

Simple to build but impossible to scale.

  • Pros: Easy to debug, full control.
  • Cons: No continuous batching, poor memory management, Global Interpreter Lock (GIL) issues.

2. Specialized Serving Engines (vLLM, TGI, Triton)

Built in C++/Rust/CUDA to maximize GPU utilization.

  • vLLM: Pioneered PagedAttention and offers state-of-the-art throughput.
  • Text Generation Inference (TGI): Hugging Face’s production server (Rust).
  • TensorRT-LLM: NVIDIA’s highly optimized library for H100s.

3. Core Innovation: Continuous Batching

In traditional Static Batching, the batch size is fixed. If one request generates 10 tokens and another generates 100, the GPU waits for the 100-token request to finish before starting the next batch. This is called “bubbles” or fragmentation.

Continuous Batching (aka Iteration-level scheduling) inserts new requests into the batch token-by-token. As soon as one sequence finishes, a new one takes its slot immediately.

Interactive: Continuous Batching Simulator

Visualizing how continuous batching eliminates GPU idle time compared to static batching.

Static Batching

Idle Time (Bubbles): 0%

Continuous Batching

Throughput Gain: 0%

4. Core Innovation: PagedAttention

Standard attention mechanisms require contiguous memory for the KV cache. Because we don’t know how long a generated sequence will be, we have to pre-allocate a large chunk of contiguous VRAM (e.g., for 2048 tokens). If the user only generates 50 tokens, the rest is wasted. This leads to memory fragmentation.

PagedAttention (introduced by vLLM) solves this by splitting the KV cache into blocks (pages), just like an Operating System manages virtual memory.

  • Blocks: KV cache is divided into fixed-size blocks (e.g., 16 tokens).
  • Non-Contiguous: Blocks can be stored anywhere in VRAM.
  • Block Table: Maps logical tokens to physical blocks.

This allows vLLM to achieve near-zero memory waste, enabling much larger batch sizes.

5. Serving with vLLM

vLLM is the current industry standard for high-performance serving. It supports most popular models (Llama 3, Mistral, Qwen).

Python Implementation

Here is how to run an OpenAI-compatible server using vLLM’s Python API.

# Install: pip install vllm
from vllm import LLM, SamplingParams

# 1. Initialize the engine
# trust_remote_code=True is often needed for newer architectures
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    tensor_parallel_size=1,  # Number of GPUs
    gpu_memory_utilization=0.90, # Reserve 90% of VRAM
    max_model_len=4096
)

# 2. Define sampling parameters
# High throughput = usually greedy (temperature=0) or low temp
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=100
)

# 3. Prepare prompts
prompts = [
    "Explain quantum computing to a 5-year old.",
    "Write a SQL query to find top 5 users by spend.",
]

# 4. Generate (This uses Continuous Batching automatically!)
outputs = llm.generate(prompts, sampling_params)

# 5. Process outputs
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Deployment Architecture

For a robust production system, you shouldn’t just run a Python script.

User Load Balancer vLLM Replica 1 GPU: A100 (80GB) vLLM Replica 2 GPU: A100 (80GB) Model Registry (S3 / HuggingFace)
  1. Load Balancer (Nginx/K8s Ingress): Distributes requests.
  2. vLLM Container: Running the OpenAI-compatible API server (python -m vllm.entrypoints.openai.api_server).
  3. Model Loading: Models are pulled from S3 or Hugging Face Hub at startup.
  4. Autoscaling: Scale replicas based on queue depth or GPU utilization.

6. Summary

  • Don’t use vanilla Python: It’s too slow for production traffic.
  • Use vLLM: It handles continuous batching and memory management (PagedAttention) automatically.
  • Optimize VRAM: VRAM is your most expensive resource. Wasting it on fragmentation limits your batch size and kills throughput.

In the next chapter, we’ll dive into Quantization and Flash Attention to fit larger models into smaller GPUs.