#02 KV Cache 原理 · Agent Harness 工程师知识地图

02KV Cache 原理

L1P0五步章 · 技术章

Why · 为什么要学

你用 Sonnet 跑一个 coding agent,每次输入 30K token(system prompt + tool 定义 + 历史)。月跑 10 万次,账单 ~$9000。这些 token 99% 是重复的,推理服务器每次都从头算 K/V。理解 KV Cache 是后面所有"为什么 Prompt Caching 省 90%、为什么 batch size 受限、vLLM 怎么省内存"的物理基础——不懂这个,上层优化都只是抄公式。

1 ·核心要点

Transformer 推理时,每个 token 通过 attention 看历史所有 token。具体:每个 token 有自己的 Query (Q),attention 把 Q 和历史所有 token 的 Key (K) 做点积得分,加权它们的 Value (V)。

不缓存:每生成 1 个 token,前面 N 个 token 的 K/V 都重算。第 N+1 步是 O(N),N 步累加 O(N²)。

缓存 K/V:历史 K/V 算过一次就存住。每生成新 token,只算它自己的 K/V,attention 直接读 cache。每步 O(1) 额外算力,N 步累加 O(N)。

内存占用公式:

cache_size = layers × heads × seq_len × head_dim × 2(K+V) × dtype_bytes

例:Llama 70B(80 层、64 头、head_dim=128、fp16)
每 token cache ≈ 80 × 64 × 128 × 2 × 2 ≈ 2.6 MB
1 万 token context = 26 GB ← 接近 H100 一半显存

KV Cache 决定了三件事:

· 推理成本:历史 K/V 是大头,cache 命中即跳过

· Batch size 上限:cache 吃显存,显存装得下几个 sequence 就并发几个

· 长 context 退化:超长 context 的 cache 比模型权重还大,GPU 内存搬运成本主导推理速度

vLLM 的 PagedAttention 把 cache 按固定 page 管,解决内存碎片;Anthropic/OpenAI 的 Prompt Caching 在 inference server 把公共前缀 K/V 持久化跨请求复用。两者都是 KV Cache 概念的工程产物。

2 ·最小代码示例

# 概念演示(伪代码):不缓存 vs 缓存的成本结构
def generate_no_cache(prompt, max_new=100):
    generated = []
    for step in range(max_new):
        # 每步把 prompt + 已生成的 N token 全部跑一遍 attention
        all_tokens = prompt + generated  # 长度 N
        attn = attention_full(all_tokens)  # O(N²)
        next_tok = sample(attn)
        generated.append(next_tok)

def generate_with_cache(prompt, max_new=100):
    kv_cache = compute_kv(prompt)  # 算一次,O(prompt_len)
    generated = []
    for step in range(max_new):
        # 每步只算新 token 的 K/V,attention 读 cache
        new_kv = compute_kv_single(generated[-1] if generated else prompt[-1])
        kv_cache.extend(new_kv)
        attn = attention_with_cache(new_kv.Q, kv_cache)  # O(N)
        generated.append(sample(attn))

实际验证:用 Anthropic API 发同一 prompt 两次,看 response 的 usage.cache_creation_input_tokens(首次写入)和 cache_read_input_tokens(后续命中)——这是 inference server 层的"持久化版 KV Cache",直接观测到效果。

3 ·工程权衡

何时设计要利用 KV Cache(通过 Prompt Caching)

system prompt 大且稳定(≥ 1024 token,Anthropic 最小 cache 单元)
5 分钟内同一 prompt 会被反复请求(TTL 限制)
Agent 场景:tool 定义 + system prompt 占输入 60%+,这部分永远不变