LLM Inference Optimization: Popular Terms (1)

1. Introduction

Recently, I’ve been reviewing the different techniques I’ve come across in LLM inference, and I decided to organize them into a blog post.

2. Key Terms and Optimization Methods

LLM inference can be optimized at several levels — from smarter scheduling algorithms, to tweaks in model structure, all the way down to GPU kernels. Below is a glossary of popular techniques.

General Terms

Prefill phase – Processing the input prompt and generating the very first token.
Decode phase – Generating subsequent tokens one by one.
TTFT (Time to First Token) – How long it takes to produce the first output token.
TTIT (Time to Intermediate Token) – The time taken to generate intermediate tokens.

2.1 Algorithm-Level Techniques

These methods don’t change the model’s fundamental architecture, but optimize how the model is used in inference time.

2.1.1 Continuous Batching (In-flight Batching)

What it is: A scheduling strategy where incoming requests are batched on the fly so GPUs stay busy. Unlike traditional static batching, where a batch is processed as a whole and all requests must wait for the slowest one to finish, continuous batching allows the system to immediately start processing the next request as soon as any request in the current batch completes.

Benefit: Increase GPU utilization and throughput.

Reading Materials

Blog from baseten: link
Blog from anyscale: link

2.1.2 Prefill-Decode Disaggregation

What it is: Instead of handling the prefill phase and the decode phase in the same service, PD-Disagg separates them separately. One set of GPUs handles the prefill phase, while another set handles the decode phase, with fast data transfer of intermediate results between them.

Benefit: Improve the overall throughput, especially for the cases with TTFT and TTIT constraints.

Reading Materials:

DistServe: https://arxiv.org/abs/2401.09670

2.1.3 Parallel Decoding

What it is: Parallel Decoding is a technique to break the usual one-token-at-a-time generation bottleneck by predicting multiple tokens in parallel. It uses a “guess and verify” approach: a faster drafter generates several next tokens and then be verified with the original LLM model in parallel, accepting matching tokens and rejecting others.

Benefit: Reduce latency.

Reading Materials:

Speculative Decoding: https://arxiv.org/pdf/2211.17192
https://arxiv.org/pdf/2302.07863
Eagle3: https://arxiv.org/pdf/2503.01840

2.1.4 Chunked Prefill

What it is: Splits a long input prompt into smaller chunks for faster processing. This prevents the prefill phase from becoming a bottleneck, enables more parallelization with decode phase tokens, and increases GPU utilization.

Benefit: Better throughput and more stable TTIT.

Reading Materials:

Blog from Nvidia: link
SARATHI: https://arxiv.org/pdf/2308.16369

2.1.5 Paged Attention

What it is: Partitions the model’s attention KV cache into smaller fixed-size blocks or “pages” instead of one contiguous large block.

Benefit: Reduce memory waste and free up more memory, enabling support for longer prompts and larger batch sizes.

Reading Materials:

PagedAttention: https://arxiv.org/abs/2309.06180
vLLM Paged Attention: link

2.1.6 Prefix Cache

What it is: A caching strategy that reuses computation for repeated prompt prefixes across multiple requests.

Benefit: Reduce latency.

Reading Materials:

Blog from BentoML: link
SGLang: https://arxiv.org/pdf/2312.07104
MARCON: https://arxiv.org/pdf/2411.19379

2.1.7 Quantization

What it is: Reducing the precision used for model parameters (and, in some cases, activations or the KV cache)—for example, switching from 16-bit floating-point to 8-bit integer representations—to save resources while maintaining reasonable accuracy.

Benefit: Reduces memory requirements and improves both latency and throughput

Trade-offs: May slightly reduce model accuracy

Reading Materials:

2.2 Model Structure Changes

I don’t have the time to dive deep into these yet, so for now I’ll just leave some terms here. If I get more familiar with them, I plan to write a follow-up article.

Grouped Query Attention (GQA)
Mixture of Experts (MoE)
Distillation

2.3 Kernel-Level Optimizations

Same as above — just dropping the terms for now. I’ll revisit these in more detail later.

Flash Attention
Fused Kernels

3. Final Thoughts

Writing down and organizing these terms really helped me revisit concepts I often hear but hadn’t fully sorted out. Hopefully, this post also adds something useful to your own knowledge network :)

LLM Inference Optimization: Popular Terms (1)

[Lifan]

2025/07/26

1. Introduction

2. Key Terms and Optimization Methods

2.1 Algorithm-Level Techniques

2.1.1 Continuous Batching (In-flight Batching)

2.1.2 Prefill-Decode Disaggregation

2.1.3 Parallel Decoding

2.1.4 Chunked Prefill

2.1.5 Paged Attention

2.1.6 Prefix Cache

2.1.7 Quantization

2.2 Model Structure Changes

2.3 Kernel-Level Optimizations

3. Final Thoughts