EXPERIMENT REPORT

实验四：超长上下文与 KV Cache 的物理极限

2025-12-27 10 MIN READ

1. 实验四：基础实验 (Foundation)

注：

本实验涉及极限显存压力测试，建议使用 RTX 4090 (24GB) 或 MTT S4000 (48GB) 进行对比。

Happy Path：使用 vLLM 的 PagedAttention 管理显存。

Sad Path：使用 HuggingFace 原生实现，观察 OOM（显存溢出）。

时间设定：2026 年，vLLM 已成为工业界标配，但理解其底层原理依然是架构师的基本功。

1.1 第一阶段：环境准备与理论对齐

目标：准备支持 FlashAttention 的环境，并理解为什么 Native PyTorch 会炸显存。

理论背景（架构师视角）：
- O(N^2) 的诅咒：Attention 计算复杂度随长度平方增长。
- KV Cache 线性增长：显存占用 = 2 * L * H * S * B * Bytes。对于 Qwen2.5-7B，32K 上下文产生的 KV Cache 高达数 GB。
- 碎片化陷阱：原生 PyTorch 要求 KV Cache 在物理显存上连续。这就像停车，原生实现要求必须有一条能停 100 辆车的连续空地，哪怕停车场有 200 个分散车位也停不进去（OOM）。

环境安装：

hljs bash

# 安装 vLLM (工业级推理) 和 监控工具
pip install vllm nvitop requests

# 如果是摩尔线程环境 (MUSA)
# pip install vllm-musa musa-smi

1.2 第二阶段：模型准备

目标：下载 Qwen2.5-7B-Instruct，这是 2025-2026 年最具代表性的长文本基座模型。

下载脚本：
hljs bash
1
```
touch download_qwen.py
```

编辑代码：

hljs python

from modelscope import snapshot_download
# 下载到数据盘，避免系统盘爆炸
model_dir = snapshot_download('Qwen/Qwen2.5-7B-Instruct', cache_dir='/root/autodl-tmp')
print(f"Model Path: {model_dir}")

执行：
hljs bash
1
```
python download_qwen.py
```

1.3 第三阶段：构建实验代码 (The Experiment)

目标：编写一个“自杀式”脚本，对比 Native 实现的脆弱性和 vLLM 的鲁棒性。

创建主实验脚本：
hljs bash
1
```
touch exp4_kv_limits.py
```

编写代码：

hljs python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import time
import os

# 路径请替换为实际路径
MODEL_PATH = "/root/autodl-tmp/Qwen/Qwen2.5-7B-Instruct" 

def run_sad_path_native():
    print("\n=== [Sad Path] Running Native HuggingFace Implementation ===")
    print("警告：此过程极大概率触发 OOM 或严重降速")
    
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH, 
        device_map="auto", 
        torch_dtype=torch.bfloat16,
        trust_remote_code=True
    )
    
    # 构造一个 10K 长度的 Prompt (模拟长文档)
    long_prompt = "摩尔线程 " * 5000 
    inputs = tokenizer(long_prompt, return_tensors="pt").to("cuda")
    print(f"Input Token Length: {inputs.input_ids.shape[1]}")
    
    try:
        # 强制生成 2000 token，观察显存暴涨
        t0 = time.time()
        output = model.generate(
            **inputs, 
            max_new_tokens=2000, 
            use_cache=True  # 开启 KV Cache
        )
        print(f"Success! Time: {time.time() - t0:.2f}s")
    except torch.cuda.OutOfMemoryError:
        print(">>> 捕获异常: CUDA Out Of Memory! 显存碎片化导致分配失败。")
    except Exception as e:
        print(f">>> Error: {e}")

if __name__ == "__main__":
    # 这里只运行 Sad Path，Happy Path (vLLM) 建议通过命令行启动服务测试
    run_sad_path_native()

1.4 第四阶段：执行与观测

启动监控（分屏）：
- N 卡用户：nvitop (推荐) 或 watch -n 0.5 nvidia-smi
- 摩尔线程用户：watch -n 0.5 musa-smi
运行 Sad Path：
hljs bash
1
```
python exp4_kv_limits.py
```

运行 Happy Path (vLLM 命令行)：

启动 vLLM API Server，让 PagedAttention 接管显存。

hljs bash

python -m vllm.entrypoints.api_server \
    --model /root/autodl-tmp/Qwen/Qwen2.5-7B-Instruct \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code

在另一个终端发送请求（4000 token context + 2000 output）：

hljs bash

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/autodl-tmp/Qwen/Qwen2.5-7B-Instruct",
        "prompt": "Moore Threads " * 2000,
        "max_tokens": 2000,
        "temperature": 0
    }'

实验结果示例：
- Native (Sad Path): 显存占用呈阶梯状上升，通常在生成几百个 token 后直接报错 OOM。
- vLLM (Happy Path): 启动时显存直接占满 90%（预分配），生成过程中显存占用不增加，且生成速度（TPS）稳定。

1.5 第五阶段：实验结果分析指南

核心差异分析：

Native 失败原因：PyTorch 的 Caching Allocator 机制。每次 KV Cache 增长，都需要申请新的显存块。如果是长文本，旧的显存块释放后，因为碎片化无法被再次利用，最终虽然总剩余显存够，但没有足够大的连续块，触发 OOM。
vLLM 成功逻辑：PagedAttention 将 KV Cache 切分成 Block（如 16 token 一块）。不需要连续物理显存，逻辑上连续即可（类似操作系统的页表）。

2. 实验四：进阶实验 (Advanced)

老鸟锐评：真正搞垮线上服务的，往往不是标准的 OOM，而是“Swap 颠簸”和“Beam Search 爆炸”。

2.1 进阶 1：Swap 颠簸与 PCIe 带宽瓶颈

缺失点：基础实验只测了显存够用的情况。如果显存真不够了，vLLM 会把 KV Cache 换出到 CPU 内存。
操作：人为限制显存，迫使系统发生 Swap。
观察：
- N 厂/M 厂现象：一旦发生 Swap，生成速度（TPS）将从 50+ 骤降至 1-2 TPS。
- 硬件瓶颈：此时性能取决于 PCIe 带宽。如果你用的是 PCIe 3.0 x8 的插槽（很多廉价服务器为了堆卡这么干），你会看到严重的卡顿。

执行步骤

创建文件 exp4_swap_thrashing.sh

hljs bash

# 强制将显存限制在 40%，逼迫 vLLM 在长文本时进行 Swap
python -m vllm.entrypoints.api_server \
    --model /root/autodl-tmp/Qwen/Qwen2.5-7B-Instruct \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.4 \
    --swap-space 4 \
    --trust-remote-code

压力测试脚本 exp4_stress.py

hljs python

import requests
import time
from concurrent.futures import ThreadPoolExecutor

def send_request(idx):
    # 构造 8K 长度的 prompt，并发 5 个请求，必炸显存
    prompt = "Test " * 8000
    data = {
        "model": "/root/autodl-tmp/Qwen/Qwen2.5-7B-Instruct",
        "prompt": prompt,
        "max_tokens": 500,
        "temperature": 0
    }
    t0 = time.time()
    print(f"Req {idx} sent...")
    res = requests.post("http://localhost:8000/v1/completions", json=data)
    t1 = time.time()
    print(f"Req {idx} finished. Time: {t1-t0:.2f}s")

with ThreadPoolExecutor(max_workers=5) as executor:
    for i in range(5):
        executor.submit(send_request, i)

运行并观察 vLLM 后台日志：
- 寻找关键字：Swapping out X blocks...
- 观察 TPS 变化。

2.2 进阶 2：Beam Search 的显存倍增效应

原理：Beam Search (波束搜索) 会同时维护 num_beams 个候选序列。KV Cache 消耗直接乘以 num_beams。
操作：在请求中设置 "n": 4, "best_of": 4, "use_beam_search": true。
老鸟经验：很多开发者在 Demo 阶段只用 Greedy Search，一上线为了效果开了 Beam Search，结果并发量直接跌了 4 倍，甚至直接 OOM。

执行步骤 (直接修改 curl 请求即可测试)：

hljs bash

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/root/autodl-tmp/Qwen/Qwen2.5-7B-Instruct",
        "prompt": "Explain Quantum Physics",
        "max_tokens": 500,
        "n": 4,
        "best_of": 4, 
        "use_beam_search": true
    }'

3. 实验总结与核心知识点

3.1【核心结论】

显存容量（VRAM Size）决定了你能跑多长的文本（Length），显存带宽（Bandwidth）决定了你跑得有多快（Speed），而 PCIe 带宽决定了你在 Swap 时的生死（Survival）。

3.2【技术解剖：PagedAttention 与国产化】

虚拟化思想：PagedAttention 本质上是将 OS 的虚拟内存管理引入 GPU。它解决了物理显存碎片化问题，将显存利用率从 <40% 提升到了 >90%。
国产卡策略：在摩尔线程 MUSA 架构上，我们通过 MUSIFY 工具链完美适配了 vLLM。但我们要承认，在 PCIe 4.0/5.0 的信号完整性以及 Host-to-Device 的拷贝效率上，国产平台与 NVIDIA 还有差距。因此，我们的策略是 “大显存换空间” —— MTT S4000 标配 48GB 显存，就是为了减少 Swap 发生的概率。只要显存够大，就不需要 Swap，也就规避了 PCIe 的短板。

3.3【关键概念 (Knowledge Points)】

KV Cache 公式：2 * L * H * Layers。如果不量化 (FP16)，这个数字在 32K context 下会大得惊人。
显存碎片化 (Fragmentation)：导致 Native PyTorch 提前 OOM 的罪魁祸首。
Swap Thrashing：当 GPU 显存耗尽，数据在 CPU/GPU 间疯狂搬运，导致计算单元空转。

老鸟一句话总结：别光盯着算力看。在长文本时代，买卡主要看显存大小。48G 的卡跑 32K 文本，就是比 24G 的卡稳，这和是不是 N 卡没关系，这是物理规律。