Why is the Context Window limited in LLMs?
The Real Reasons Behind Context Limits
Let's understand in simple words.
The context window is the maximum amount of text an LLM can read at once. For example, GPT-4 has a context window of 128K tokens, which is roughly the size of a 300-page book.
So, why can't we just make it infinite?
The first reason is the cost of attention.
In a Transformer, every token must look at every other token. This is called self-attention.
Let's say we have 1000 tokens. That means 1000 × 1000 = 1,000,000 comparisons.
Now, if we have 10,000 tokens, that means 10,000 × 10,000 = 100,000,000 comparisons.
When we make the context 10x longer, the compute grows 100x, not 10x. This is called quadratic cost.
The second reason is memory.
For every token, the model keeps some data in GPU memory. This is called the KV Cache. The longer the context, the bigger the cache.
A very long context can need hundreds of GBs of GPU memory. GPUs do not have that much memory.
The third reason is training data.
LLMs are trained on text of certain lengths. If we ask the model to handle a context much longer than what it saw during training, the quality drops. The model gets confused.
So, the context window is limited because of three reasons:
Quadratic compute cost of attention
KV Cache memory on GPUs
Training data length limits
This is why researchers are working on long-context techniques like sparse attention, sliding windows, and KV Cache compression.
Now, we have understood why context windows have a limit.
That’s it for now.
Thanks
Amit Shekhar
Founder, Outcome School


