LLM Inference Optimization: Popular Terms (1)

[Lifan]

2025/07/26

1. Introduction

Recently, I’ve been reviewing the different techniques I’ve come across in LLM inference, and I decided to organize them into a blog post.

2. Key Terms and Optimization Methods

LLM inference can be optimized at several levels — from smarter scheduling algorithms, to tweaks in model structure, all the way down to GPU kernels. Below is a glossary of popular techniques.

General Terms

2.1 Algorithm-Level Techniques

These methods don’t change the model’s fundamental architecture, but optimize how the model is used in inference time.

2.1.1 Continuous Batching (In-flight Batching)

What it is: A scheduling strategy where incoming requests are batched on the fly so GPUs stay busy. Unlike traditional static batching, where a batch is processed as a whole and all requests must wait for the slowest one to finish, continuous batching allows the system to immediately start processing the next request as soon as any request in the current batch completes.

Benefit: Increase GPU utilization and throughput.

Reading Materials

2.1.2 Prefill-Decode Disaggregation

What it is: Instead of handling the prefill phase and the decode phase in the same service, PD-Disagg separates them separately. One set of GPUs handles the prefill phase, while another set handles the decode phase, with fast data transfer of intermediate results between them.

Benefit: Improve the overall throughput, especially for the cases with TTFT and TTIT constraints.

Reading Materials:

2.1.3 Parallel Decoding

What it is: Parallel Decoding is a technique to break the usual one-token-at-a-time generation bottleneck by predicting multiple tokens in parallel. It uses a “guess and verify” approach: a faster drafter generates several next tokens and then be verified with the original LLM model in parallel, accepting matching tokens and rejecting others.

Benefit: Reduce latency.

Reading Materials:

2.1.4 Chunked Prefill

What it is: Splits a long input prompt into smaller chunks for faster processing. This prevents the prefill phase from becoming a bottleneck, enables more parallelization with decode phase tokens, and increases GPU utilization.

Benefit: Better throughput and more stable TTIT.

Reading Materials:

2.1.5 Paged Attention

What it is: Partitions the model’s attention KV cache into smaller fixed-size blocks or “pages” instead of one contiguous large block.

Benefit: Reduce memory waste and free up more memory, enabling support for longer prompts and larger batch sizes.

Reading Materials:

2.1.6 Prefix Cache

What it is: A caching strategy that reuses computation for repeated prompt prefixes across multiple requests.

Benefit: Reduce latency.

Reading Materials:

2.1.7 Quantization

What it is: Reducing the precision used for model parameters (and, in some cases, activations or the KV cache)—for example, switching from 16-bit floating-point to 8-bit integer representations—to save resources while maintaining reasonable accuracy.

Benefit: Reduces memory requirements and improves both latency and throughput

Trade-offs: May slightly reduce model accuracy

Reading Materials:

2.2 Model Structure Changes

I don’t have the time to dive deep into these yet, so for now I’ll just leave some terms here. If I get more familiar with them, I plan to write a follow-up article.

2.3 Kernel-Level Optimizations

Same as above — just dropping the terms for now. I’ll revisit these in more detail later.

3. Final Thoughts

Writing down and organizing these terms really helped me revisit concepts I often hear but hadn’t fully sorted out. Hopefully, this post also adds something useful to your own knowledge network :)