EXPERIMENT REPORT

实验一：显存解剖学与精度边界测试

2025-11-21 10 MIN READ

1. 实验一：基础实验

注：

本实验所有机器均属于AutoDL 平台www.autodl.com，执行前需自行注册账号并完成相关认证。

Happy Path：正常情况

Sad Path：异常情况

1.1 第一阶段：实例租赁与环境准备

目标：获取一台 24GB 显存的机器（RTX 4090 为佳），并配置 PyTorch 环境。

租赁实例：
- 算力选型：在 AutoDL 算力市场，选择 RTX 4090 (24GB)。如果为了对比测试 OOM，也可以选 RTX 3090 (24GB) 或更小的卡（如 A4000 16GB，更容易触发 OOM）。
- 镜像选择：强烈建议选择基础镜像 Miniconda -> PyTorch -> 2.1.x 或 2.3.x -> Python 3.10 -> CUDA 11.8 或 12.x。
- 启动：开机后，点击“JupyterLab”进入控制台。
打开终端 (Terminal)：

在 JupyterLab 页面底部点击“终端”图标。
安装依赖库：

你需要安装 transformers、accelerate（用于推理）、bitsandbytes（用于 Int4 量化）以及 modelscope（用于国内高速下载模型）。
hljs bash
1
2
3
4
5
```
# 升级 pip
pip install --upgrade pip

# 安装核心依赖
pip install transformers accelerate bitsandbytes modelscope protobuf sentencepiece
```

1.2 第二阶段：模型下载（解决国内下载慢的问题）

目标：将 Qwen2.5-7B-Instruct 快速下载到本地硬盘（autodl-tmp 数据盘）。

创建下载脚本：

在终端输入以下命令创建一个下载脚本：
hljs bash
1
```
touch download_model.py
```

编辑并运行下载代码：

双击左侧文件列表中的 download_model.py，粘贴以下代码：

hljs python

from modelscope import snapshot_download

# 将模型下载到 autodl-tmp 目录下，防止系统盘爆满
model_dir = snapshot_download('qwen/Qwen2.5-7B-Instruct', cache_dir='/root/autodl-tmp')
print(f"Model downloaded to: {model_dir}")

执行下载：
hljs bash
1
```
python download_model.py
```
记录下终端输出的模型最终路径（例如 /root/autodl-tmp/qwen/Qwen2-5-7B-Instruct），下一步代码要用。

1.3 第三阶段：构建实验代码 (The Experiment)

目标：将逻辑转化为可执行的 Python 脚本，包含 Happy Path (BF16), Sad Path (FP32), 和 Int4 测试。

创建主实验脚本：
hljs bash
1
```
touch vram_anatomy.py
```

编写代码（请将 model_path 替换为你上一步下载的实际路径）：

hljs python

100

101

102

103

104

105

106

107

108

109

110

111

112

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import gc
import os

# ================= 配置区域 =================
# 【重要】请替换为你实际下载的模型路径
MODEL_PATH = "/root/autodl-tmp/qwen/Qwen2.5-7B-Instruct" 
# ===========================================

def print_memory_stats(step_name):
    """
    专家级显存监控函数：精准捕获 Allocated, Reserved 和 Peak
    """
    if not torch.cuda.is_available():
        print("CUDA not available.")
        return

    # 强制同步 GPU，确保读数准确
    torch.cuda.synchronize()
    
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved = torch.cuda.memory_reserved() / 1024**3
    peak = torch.cuda.max_memory_allocated() / 1024**3
    
    print(f"\n[📊 显存快照 - {step_name}]")
    print(f"  ├── Allocated (实际权重/KV): {allocated:.2f} GB")
    print(f"  ├── Reserved  (驱动预留池):  {reserved:.2f} GB")
    print(f"  └── Peak      (历史峰值):    {peak:.2f} GB")
    print("-" * 40)

def run_experiment(dtype_str, scenario_name):
    print(f"\n{'='*20} 🧪 实验场景: {scenario_name} ({dtype_str}) {'='*20}")
    
    # 显存清理：确保之前的实验不影响本次
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    print_memory_stats("Init (Empty Cache)")

    try:
        # 1. 加载 Tokenizer
        tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
        
        # 2. 加载模型 (Static VRAM)
        print(f"Loading model with {dtype_str}...")
        
        if dtype_str == "int4":
            # Int4 量化加载
            model = AutoModelForCausalLM.from_pretrained(
                MODEL_PATH, 
                device_map="cuda", 
                load_in_4bit=True,
                trust_remote_code=True
            )
        else:
            # BF16 或 FP32 加载
            torch_dtype = torch.float32 if dtype_str == "fp32" else torch.bfloat16
            model = AutoModelForCausalLM.from_pretrained(
                MODEL_PATH, 
                device_map="cuda", 
                torch_dtype=torch_dtype,
                trust_remote_code=True
            )
            
        print_memory_stats("Model Loaded (Static Weights)")

        # 3. 推理测试 (KV Cache 压力)
        # 构造长文本以增加 KV Cache 压力
        input_text = "显存管理是高性能计算的核心，" * 20 
        inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
        
        print("Starting Inference (Generating 200 tokens)...")
        with torch.no_grad():
            output = model.generate(**inputs, max_new_tokens=200)
            
        print_memory_stats("After Inference (Dynamic KV Cache)")
        
    except RuntimeError as e:
        if "out of memory" in str(e):
            print(f"\n🚨 触发 OOM (预期行为): {e}")
            print(">>> 捕获 OOM 现场显存摘要 <<<")
            # 打印 PyTorch 内部详细的显存图谱
            print(torch.cuda.memory_summary(abbreviated=True))
        else:
            print(f"\n❌ 发生非 OOM 错误: {e}")
    except Exception as e:
        print(f"\n❌ 未知错误: {e}")
    finally:
        # 激进清理
        try:
            del model
            del tokenizer
        except:
            pass
        gc.collect()
        torch.cuda.empty_cache()
        print(f"实验 {scenario_name} 结束，清理完成。\n")

if __name__ == "__main__":
    # --- 实验 1-1: Happy Path (BF16) ---
    # 预期：占用约 15-16GB，顺利运行
    run_experiment("bf16", "Happy Path - BF16舒适区")

    # --- 实验 1-2: Quantization Path (Int4) ---
    # 预期：占用极低 (~5-6GB)，验证量化优势
    run_experiment("int4", "Quantization - Int4极致压缩")

    # --- 实验 1-3: Sad Path (FP32) ---
    # 预期：直接 OOM，验证物理边界 (28GB > 24GB)
    # 注意：这通常会最后跑，因为 OOM 有时会导致 CUDA 上下文不稳定
    run_experiment("fp32", "Sad Path - FP32精度越界")

hljs output

root@autodl-container:~/autodl-tmp/MTS-LLM-TestBench/exp-01# python vram_anatomy.py

==================== 🧪 实验场景: Happy Path - BF16舒适区 (bf16) ====================

[📊 显存快照 - Init (Empty Cache)]
  ├── Allocated (实际权重/KV): 0.00 GB
  ├── Reserved  (驱动预留池):  0.00 GB
  └── Peak      (历史峰值):    0.00 GB
----------------------------------------
Loading model with bf16...
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.05it/s]

[📊 显存快照 - Model Loaded (Static Weights)]
  ├── Allocated (实际权重/KV): 14.19 GB
  ├── Reserved  (驱动预留池):  14.21 GB
  └── Peak      (历史峰值):    14.19 GB
----------------------------------------
Starting Inference (Generating 200 tokens)...

[📊 显存快照 - After Inference (Dynamic KV Cache)]
  ├── Allocated (实际权重/KV): 14.20 GB
  ├── Reserved  (驱动预留池):  14.26 GB
  └── Peak      (历史峰值):    14.23 GB
----------------------------------------
实验 Happy Path - BF16舒适区 结束，清理完成。


==================== 🧪 实验场景: Quantization - Int4极致压缩 (int4) ====================

[📊 显存快照 - Init (Empty Cache)]
  ├── Allocated (实际权重/KV): 0.01 GB
  ├── Reserved  (驱动预留池):  0.02 GB
  └── Peak      (历史峰值):    0.01 GB
----------------------------------------
Loading model with int4...
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.13s/it]

[📊 显存快照 - Model Loaded (Static Weights)]
  ├── Allocated (实际权重/KV): 5.46 GB
  ├── Reserved  (驱动预留池):  6.71 GB
  └── Peak      (历史峰值):    6.59 GB
----------------------------------------
Starting Inference (Generating 200 tokens)...

[📊 显存快照 - After Inference (Dynamic KV Cache)]
  ├── Allocated (实际权重/KV): 5.46 GB
  ├── Reserved  (驱动预留池):  6.72 GB
  └── Peak      (历史峰值):    6.59 GB
----------------------------------------
实验 Quantization - Int4极致压缩 结束，清理完成。


==================== 🧪 实验场景: Sad Path - FP32精度越界 (fp32) ====================

[📊 显存快照 - Init (Empty Cache)]
  ├── Allocated (实际权重/KV): 0.01 GB
  ├── Reserved  (驱动预留池):  0.02 GB
  └── Peak      (历史峰值):    0.01 GB
----------------------------------------
Loading model with fp32...
Loading checkpoint shards:  75%|████████████████████████████████████████████████████████████████████████████▌                         | 3/4 [00:05<00:01,  1.87s/it]

🚨 触发 OOM (预期行为): CUDA out of memory. Tried to allocate 2.03 GiB. GPU 0 has a total capacity of 23.52 GiB of which 1.21 GiB is free. Including non-PyTorch memory, this process has 22.30 GiB memory in use. Of the allocated memory 21.75 GiB is allocated by PyTorch, and 131.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
>>> 捕获 OOM 现场显存摘要 <<<
|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 1            |        cudaMalloc retries: 1         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  22276 MiB |  22392 MiB | 173115 MiB | 150839 MiB |
|---------------------------------------------------------------------------|
| Active memory         |  22276 MiB |  22392 MiB | 173115 MiB | 150839 MiB |
|---------------------------------------------------------------------------|
| Requested memory      |  22276 MiB |  22392 MiB | 172539 MiB | 150263 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |  22408 MiB |  22408 MiB |  43854 MiB |  21446 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory | 135006 KiB |  20320 MiB | 147544 MiB | 147412 MiB |
|---------------------------------------------------------------------------|
| Allocations           |     277    |     277    |  658046    |  657769    |
|---------------------------------------------------------------------------|
| Active allocs         |     277    |     277    |  658046    |  657769    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       3    |       3    |      98    |      95    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       4    |       4    |  279988    |  279984    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

实验 Sad Path - FP32精度越界 结束，清理完成。

1.4 第四阶段：执行与观测

你需要同时看代码输出和系统监控。

开启系统级监控（分屏）：在 JupyterLab 中，新建第二个终端窗口。输入以下命令，实时观察 GPU 显存变化：
hljs bash
1
```
watch -n 0.5 nvidia-smi
```
运行实验：回到第一个终端，运行 Python 脚本：
hljs bash
1
```
python vram_anatomy.py
```
实验结果示例：点击示例代码右上角的执行按钮，查看示例结果

1.5 第五阶段：实验结果分析指南

当脚本运行时，请重点观察以下数据，验证“理论背景”

BF16 阶段 (Happy Path)：
- Static VRAM: 观察 Model Loaded 时的 Allocated。7B 模型 BF16 应该在 14.5GB 左右。
- KV Cache: 观察 After Inference 相比 Model Loaded 增加了多少。增加的部分就是 KV Cache 和激活值。
- Reserved: 此时 Reserved 通常略大于 Allocated，这是健康的。
Int4 阶段 (Quantization)：
- Static VRAM: 应该大幅下降到 5GB - 6GB 左右。这是在消费级显卡部署的关键。
FP32 阶段 (Sad Path)：
- 脚本会尝试加载模型。
- 现象：你会看到加载过程变慢，nvidia-smi 中的显存占用迅速飙升接近 24GB (24576MiB)。
- 结果：Python 抛出 RuntimeError: CUDA out of memory。
- 专家点：仔细看脚本最后打印的 memory_summary，它会告诉你试图分配多少内存（比如 Tried to allocate 200MB），但剩余显存不足。

2. 实验一：进阶实验

基础实验主要关注静态权重和简单推理，忽略了几个导致生产环境崩溃的隐形杀手。

2.1 进阶 1：KV Cache 的线性爆炸 (Sequence Length Stress)

缺失点：基础实验只测了权重，没测长文本。现在的模型都支持 32k/128k 上下文，KV Cache 的显存占用公式是 2 * Layers * Hidden_Dim * Seq_Len * Batch_Size * Bytes。
操作：固定 Batch Size = 1，从 Seq_Len = 1k 开始，每次翻倍输入长度，直到 OOM。
观察：
- N 厂现象：随着长度增加，显存线性增长，直到 Flash Attention 优化介入。
- 国产挑战：我们的驱动在处理超长序列的大块连续显存分配时，Page Table（页表）的开销是否会导致额外的 Latency？

执行步骤

创建文件exp1_kv_cache.py

目标：测试 KV Cache 随序列长度变化的线性爆炸现象，直到触发 OOM。

hljs python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import gc

# ================= 配置 =================
MODEL_PATH = "/root/autodl-tmp/qwen/Qwen2.5-7B-Instruct"
# =======================================

def test_kv_explosion():
    print(f"\n{'='*10} 进阶实验 1: KV Cache 线性爆炸测试 {'='*10}")
    
    # 1. 清理环境
    gc.collect()
    torch.cuda.empty_cache()
    
    # 2. 加载模型
    print(f"正在加载模型 (BF16)...")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH, 
        device_map="cuda", 
        torch_dtype=torch.bfloat16
    )
    model.eval()
    
    base_mem = torch.cuda.memory_allocated() / 1024**3
    print(f"模型静态占用: {base_mem:.2f} GB")

    seq_len = 1024
    try:
        while True:
            print(f"\n>>> 测试序列长度: {seq_len}")
            
            # 构造 dummy input
            input_ids = torch.randint(0, 1000, (1, seq_len)).to("cuda")
            
            torch.cuda.synchronize()
            t0_mem = torch.cuda.memory_allocated()
            
            # 前向传播计算 KV Cache
            with torch.no_grad():
                _ = model(input_ids)
            
            torch.cuda.synchronize()
            t1_mem = torch.cuda.memory_allocated()
            
            growth_mb = (t1_mem - t0_mem) / 1024**2
            total_gb = t1_mem / 1024**3
            
            print(f"  KV Cache 增量: {growth_mb:.2f} MB")
            print(f"  当前总显存:    {total_gb:.2f} GB")
            
            # 简单预警
            if total_gb > 23.0:
                print("⚠️  警告：显存即将耗尽...")
            
            # 长度翻倍
            seq_len *= 2
            
            # 清理本次推理的缓存，为下一次腾出空间（我们测的是单次最大能力，不是累积）
            # 但注意：模型内部的 cache 如果没关可能会累积，这里重新生成 input 是为了测单次峰值
            
    except RuntimeError as e:
        if "out of memory" in str(e):
            print(f"\n💥 成功触发 OOM！")
            print(f"💀 崩溃时的序列长度: {seq_len}")
            print(f"💀 崩溃时显存占用: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
        else:
            print(f"❌ 其他错误: {e}")

if __name__ == "__main__":
    test_kv_explosion()

hljs output

root@autodl-container:~/autodl-tmp/MTS-LLM-TestBench/exp-01# python exp1_kv_cache.py 

========== 进阶实验 1: KV Cache 线性爆炸测试 ==========
正在加载模型 (BF16)...
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.15it/s]
模型静态占用: 14.19 GB

>>> 测试序列长度: 1024
  KV Cache 增量: 362.12 MB
  当前总显存:    14.54 GB

>>> 测试序列长度: 2048
  KV Cache 增量: 352.50 MB
  当前总显存:    14.89 GB

>>> 测试序列长度: 4096
  KV Cache 增量: 705.50 MB
  当前总显存:    15.58 GB

>>> 测试序列长度: 8192
  KV Cache 增量: 1412.00 MB
  当前总显存:    16.95 GB

>>> 测试序列长度: 16384

💥 成功触发 OOM！
💀 崩溃时的序列长度: 16384
💀 崩溃时显存占用: 17.94 GB

运行文件
hljs bash
1
```
python exp1_kv_cache.py
```
观察结果：在页面中点击示例代码右上角的执行按钮，查看示例结果

2.2 进阶 2：显存碎片化与重分配 (Fragmentation & Reallocation)

基础实验缺失点：PyTorch 的 Caching Allocator 是把双刃剑。它不释放显存给 OS 是为了加速，但会导致碎片。
操作：
1. 申请一系列不同大小的 Tensor（如 10MB, 20MB, 50MB...）。
2. 随机释放掉中间的 Tensor（制造空洞）。
3. 尝试申请一个大小等于“空洞总和”的 Tensor。
预期：物理显存够，但申请失败。因为 GPU 需要连续物理地址（或虚拟地址连续但物理页打散，取决于 MMU 实现）。
技术解剖：这是考验驱动显存管理单元（MMU）和 PyTorch 适配程度的时刻。MUSA 架构在这一点上投入了大量精力做页表合并。

执行步骤

创建文件exp2_fragmentation.py

目标：模拟显存碎片化。在总显存充足的情况下，制造“有空位但塞不进”的场景。

hljs python

import torch
import gc

def test_fragmentation():
    print(f"\n{'='*10} 进阶实验 2: 显存碎片化模拟 {'='*10}")
    
    # 1. 彻底清理
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()
    
    if not torch.cuda.is_available():
        print("无 GPU 可用")
        return

    print("Step 1: 申请 10 个 500MB 的连续 Tensor (共5GB)...")
    blocks = []
    # 500 MB roughly
    block_size = 500 * 1024 * 1024 
    try:
        for i in range(10):
            t = torch.empty(block_size, dtype=torch.uint8, device="cuda")
            blocks.append(t)
    except RuntimeError:
        print("显存不足以完成初始化，请检查环境。")
        return

    print(f"  Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
    print(f"  Reserved:  {torch.cuda.memory_reserved()/1024**3:.2f} GB")

    print("\nStep 2: 释放奇数索引的块 (制造空洞)...")
    # 释放第 1, 3, 5, 7, 9 块
    # 剩下第 0, 2, 4, 6, 8 块
    # 理论上释放了 2.5GB 显存
    kept_blocks = []
    for i in range(10):
        if i % 2 == 0:
            kept_blocks.append(blocks[i])
        else:
            # 释放引用，Python GC 会回收，PyTorch Caching Allocator 会标记为空闲
            pass
    blocks = kept_blocks 
    
    # 强制 GC 确保引用已断开
    gc.collect() 
    
    print(f"  释放后 Allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
    print(f"  当前 Reserved:    {torch.cuda.memory_reserved()/1024**3:.2f} GB")
    print("  (注意观察：Reserved 通常不会减少，因为 PyTorch 依然持有这些物理内存)")

    print("\nStep 3: 尝试申请一个 2GB 的大块...")
    print("  理论上空闲空间 > 2.5GB，但物理上被分割成了 5 个 500MB 的碎片。")
    print("  如果 PyTorch 不找 OS 申请新内存，这里应该失败或者触发重整。")
    
    try:
        # 申请 2GB
        large_block = torch.empty(2 * 1024 * 1024 * 1024, dtype=torch.uint8, device="cuda")
        
        print("\n✅ 申请成功！")
        print("  专家解读：PyTorch 成功处理了请求。原因可能是：")
        print("  1. 显卡剩余显存非常大，PyTorch 直接申请了新的物理页 (Reserved 变大了)。")
        print("  2. PyTorch 触发了碎片整理机制 (cudaMallocRetrying)。")
        print(f"  当前 Reserved: {torch.cuda.memory_reserved()/1024**3:.2f} GB (对比 Step 2 看是否增长)")
        
    except RuntimeError as e:
        print(f"\n❌ 申请失败！触发碎片化 OOM: {e}")
        print("  专家解读：这就是典型的碎片化 OOM。空闲总量够，但连续块不够。")

if __name__ == "__main__":
    test_fragmentation()

hljs output

root@autodl-container:~/autodl-tmp/MTS-LLM-TestBench/exp-01# python exp2_fragmentation.py 

========== 进阶实验 2: 显存碎片化模拟 ==========
Step 1: 申请 10 个 500MB 的连续 Tensor (共5GB)...
  Allocated: 4.88 GB
  Reserved:  4.88 GB

Step 2: 释放奇数索引的块 (制造空洞)...
  释放后 Allocated: 2.93 GB
  当前 Reserved:    4.88 GB
  (注意观察：Reserved 通常不会减少，因为 PyTorch 依然持有这些物理内存)

Step 3: 尝试申请一个 2GB 的大块...
  理论上空闲空间 > 2.5GB，但物理上被分割成了 5 个 500MB 的碎片。
  如果 PyTorch 不找 OS 申请新内存，这里应该失败或者触发重整。

✅ 申请成功！
  专家解读：PyTorch 成功处理了请求。原因可能是：
  1. 显卡剩余显存非常大，PyTorch 直接申请了新的物理页 (Reserved 变大了)。
  2. PyTorch 触发了碎片整理机制 (cudaMallocRetrying)。
  当前 Reserved: 6.88 GB (对比 Step 2 看是否增长)

运行文件
hljs bash
1
```
python exp2_fragmentation.py
```
观察结果：在页面中点击示例代码右上角的执行按钮，查看示例结果

2.3 进阶 3：PCIe 带宽溢出 (Offloading Penalty)

操作：使用 accelerate 库，允许 device_map="auto" 将部分层卸载到 CPU RAM。
观察：当显存用完，模型开始吃内存时，推理速度（Tokens/s）的断崖式下跌。
锐评：这时候显存大小不是瓶颈，PCIe 带宽（通常是 PCIe 4.0/5.0 x16）成了那根细细的吸管。

执行步骤

创建文件exp3_offloading.py

目标：对比全 GPU 推理与 CPU Offload 推理的速度，验证 PCIe 带宽瓶颈。

hljs python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
import gc

# ================= 配置 =================
MODEL_PATH = "/root/autodl-tmp/qwen/Qwen2.5-7B-Instruct"
# =======================================

def clean():
    gc.collect()
    torch.cuda.empty_cache()

def test_offloading():
    print(f"\n{'='*10} 进阶实验 3: PCIe 带宽瓶颈测试 (Offload Penalty) {'='*10}")
    
    clean()
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
    # 构造较长输入以放大生成过程的时间占比
    input_text = "人工智能在科学研究中的应用前景非常广阔，" * 10
    inputs = tokenizer(input_text, return_tensors="pt")
    
    # -------------------------------------------------
    # 场景 A: 正常全 GPU 模式 (The Happy Path)
    # -------------------------------------------------
    print("\n>>> [Case A] 全 GPU 模式 (Standard)...")
    try:
        model = AutoModelForCausalLM.from_pretrained(
            MODEL_PATH, device_map="cuda", torch_dtype=torch.bfloat16
        )
        
        # 预热
        print("  Warmup...")
        _ = model.generate(**inputs.to("cuda"), max_new_tokens=5)
        
        # 测速
        print("  开始生成 (50 tokens)...")
        t0 = time.time()
        _ = model.generate(**inputs.to("cuda"), max_new_tokens=50)
        t_gpu = time.time() - t0
        
        speed_gpu = 50 / t_gpu
        print(f"  ✅ GPU 耗时: {t_gpu:.2f}s | 速度: {speed_gpu:.2f} tokens/s")
        
        del model
        clean()
        
    except Exception as e:
        print(f"GPU 测试失败: {e}")
        return

    # -------------------------------------------------
    # 场景 B: 强制 CPU Offload (The Bottleneck Path)
    # -------------------------------------------------
    print("\n>>> [Case B] 强制 CPU Offload 模式 (模拟显存不足)...")
    print("  说明: 强制限制 GPU 只用 5GB，迫使模型权重切分到 CPU RAM。")
    
    try:
        # max_memory 限制 0 号卡只能用 5GB，剩下的去 CPU
        model_offload = AutoModelForCausalLM.from_pretrained(
            MODEL_PATH, 
            device_map="auto", 
            max_memory={0: "5GiB", "cpu": "64GiB"}, 
            torch_dtype=torch.bfloat16
        )
        
        # 注意：这里不能手动 .to("cuda")，要让 accelerate 库自动管理设备间传输
        inputs_cpu = tokenizer(input_text, return_tensors="pt")
        
        print("  开始生成 (50 tokens) - 这可能很慢，请耐心等待...")
        t0 = time.time()
        _ = model_offload.generate(**inputs_cpu, max_new_tokens=50)
        t_offload = time.time() - t0
        
        speed_offload = 50 / t_offload
        print(f"  🐢 Offload 耗时: {t_offload:.2f}s | 速度: {speed_offload:.2f} tokens/s")
        
        # -------------------------------------------------
        # 结果对比
        # -------------------------------------------------
        print(f"\n{'='*30}")
        print(f"📊 最终对比报告")
        print(f"  全 GPU 速度: {speed_gpu:.2f} t/s")
        print(f"  Offload速度: {speed_offload:.2f} t/s")
        drop_rate = (1 - speed_offload/speed_gpu) * 100
        print(f"  📉 性能损耗: {drop_rate:.2f}%")
        print(f"  结论: 当显存溢出时，PCIe 带宽导致推理速度出现数量级下降。")
        
    except Exception as e:
        print(f"Offload 测试失败: {e}")

if __name__ == "__main__":
    test_offloading()

hljs output

root@autodl-container:~/autodl-tmp/MTS-LLM-TestBench/exp-01# python exp3_offloading.py 

========== 进阶实验 3: PCIe 带宽瓶颈测试 (Offload Penalty) ==========

>>> [Case A] 全 GPU 模式 (Standard)...
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.20it/s]
  Warmup...
  开始生成 (50 tokens)...
  ✅ GPU 耗时: 1.31s | 速度: 38.23 tokens/s

>>> [Case B] 强制 CPU Offload 模式 (模拟显存不足)...
  说明: 强制限制 GPU 只用 5GB，迫使模型权重切分到 CPU RAM。
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  4.67it/s]
Some parameters are on the meta device because they were offloaded to the cpu.
  开始生成 (50 tokens) - 这可能很慢，请耐心等待...
/root/miniconda3/lib/python3.12/site-packages/transformers/generation/utils.py:2534: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cuda') before running `.generate()`.
  warnings.warn(

  🐢 Offload 耗时: 81.95s | 速度: 0.61 tokens/s

==============================
📊 最终对比报告
  全 GPU 速度: 38.23 t/s
  Offload速度: 0.61 t/s
  📉 性能损耗: 98.40%
  结论: 当显存溢出时，PCIe 带宽导致推理速度出现数量级下降。

运行文件
hljs bash
1
```
python exp3_offloading.py
```
观察结果：在页面中点击示例代码右上角的执行按钮，查看示例结果

3. 实验总结与核心知识点

3.1【核心结论】

显存不是一个简单的“水桶”，而是一个高度动态的“物流仓库”。权重是固定库存，KV Cache 是流动货物，而碎片化是导致仓库有空位却塞不进大件货物的元凶。

3.2【技术解剖：显存三态】

Allocated (已分配)：真正被 Tensor 数据占据的空间。
Reserved (已预留)：PyTorch 从驱动申请了但暂时闲置的空间。
Fragmented (碎片)：存在于 Reserved 中，但因地址不连续而无法使用的空间。

3.3【关键概念 (Knowledge Points)】

OOM (Out Of Memory)：分为“真 OOM”（物理显存真没了）和“假 OOM”（碎片化导致无法分配连续块）。
KV Cache：Transformer 推理的显存杀手。FP16 下，每 Token 带来的显存增量是恒定的，上下文越长，它吃得越多，甚至超过模型权重本身。
Quantization (量化)：Int4 不仅仅是省显存，更是为了利用 Tensor Core 的整数计算单元（如果你硬件支持）来提升吞吐。
Caching Allocator：PyTorch 的显存管理大管家。理解它，你才能理解为什么 nvidia-smi 显示占满了，但程序还能跑（只要 Cache 里有空位）。

在 2026 年，算力（FLOPS）往往是过剩的，显存带宽（Bandwidth）和容量（Capacity）才是大模型推理的真正货币。我们拼命做大显存位宽和容量，就是为了让用户在跑 70B 模型时，不至于因为少了 1GB 显存而被迫切成 4-bit 量化。