显存使用分析（PyTorch）-工具盒子

我们一直使用 PyTorch 进行模型训练，有时会出现显存不足的情况。除了找到对应的解决办法，比如：累加梯度、使用自动混合精度，还应该了解训练时，显存究竟在哪些环节被大量占用。主要有以下四个环节：

CUDA 运行内存
模型的固定参数
模型的前向计算
模型的反向计算
优化方法统计量
CUDA 运行内存 {#title-0} =======================

CUDA（Compute Unified Device Architecture,，计算统一设备架构），是显卡厂商 NVIDIA 推出的运算平台。通过它我们就利用 GPU 的处理能力，大幅提升计算性能。

CUDA 对我们来说，本质是一套在 GPU 硬件设备上运行的软件程序，我们的计算任务需要在该软件平台基础上运行才能利用到 GPU 的运算能力。既然是软件程序，所以 CUDA 运行起来时也会占用一部分的显存，至于占用多大，这得看 CUDA 的版本，有的占 600M 左右，有的会占到 1G 以上。

首先，我们先了解下 PyTorch 的内存使用机制。GPU 显存相当于我们全部可用的资源，掌握 C/C++ 的同学会知道，频繁的资源申请和释放操作，比如 C 的 malloc/free ，C++ 的 new/delete 会非常降低系统的性能。为了减少此类的操作，就有了资源池的概念。其思想是：预先从去全部可用资源中申请较大一块资源，当用户程序需要资源时，从资源池中申请，这就跳过了复杂的、耗时的系统调用过程，资源回收时，将资源放到资源池中。当资源池用尽时，再从可用资源中申请。这样提高了程序在资源使用这个环节的效率。

PyTorch 为张量分配内存资源也是使用这种方法，先申请较大的内存，张量需要需要内存时从内存池获取，不用时，归还到内存池。所以，如果 PyTorch 不使用这种资源缓存的机制，那么运行效率将会非常慢。

我们接下来，通过一段代码来验证下，CUDA 软件平台运行时，会占用部分显存，先安装一个库：

pip install pynvml

import torch
import pynvml
初始化 pynvml 库
pynvml.nvmlInit()
convert = lambda x: int(x / 1024 / 1024)
获得显卡设备对象
device_object = pynvml.nvmlDeviceGetHandleByIndex(0)
查看显存资源
def show_usage():
# 获得显存信息
device_memory = pynvml.nvmlDeviceGetMemoryInfo(device_object)
# 全部可用显存
total = convert(device_memory.total)
# 已经使用显存
used = convert(device_memory.used)
# 剩余可用显存
free = convert(device_memory.free)
print('总共:', total, '使用:', used, '剩余:', free)
1. CUDA 初始化会占用部分显存
def test01():
show_usage()
# 如果张量创建在 CPU 是不会占用显存，并且也不会初始化 CUDA
torch.tensor(0.0, device='cpu')
show_usage()
torch.tensor(0.0, device='cuda')
# 清空缓存
torch.cuda.empty_cache()
show_usage()
if name == 'main':
test01()

程序输出结果：

总共: 5932 使用: 0 剩余: 5932
总共: 5932 使用: 0 剩余: 5932
总共: 5932 使用: 586 剩余: 5346

上面代码如果不清空缓存，输出结果 588，而不是 586。588 = 586 + PyTorch 缓存。另外，我们创建的 cuda 张量并没有建立引用，所以创建之后会被自动回收，此时清理缓存才是 586，否则的话仍然是 588. 这是因为每次向 cuda 设备创建张量，都会分配 512 的倍数的显存。

import torch
def test02():
# 0
print(torch.cuda.memory_allocated())
a = torch.tensor(0.0, device='cuda')
# 512
print(torch.cuda.memory_allocated())
# 1024
b = torch.tensor(0.0, device='cuda')
print(torch.cuda.memory_allocated())
# 1536
c = torch.tensor(0.0, device='cuda')
print(torch.cuda.memory_allocated())

if name == 'main':
test02()

程序输出结果：

torch.cuda.memory_allocated 可以获得目前分配的内存数量。

模型的固定参数 {#title-1} =====================

这一部分也是比较容易理解的，加载模型就是加载模型参数。所以，模型的参数会占用一部分的显存。默认情况下， PyTorch 中的参数使用的是 float32 类型。请看下面的代码：

import torch
import torch.nn as nn
def test01():
print(torch.cuda.memory_allocated())
linear = nn.Linear(in_features=1, out_features=1, bias=False).cuda()
print(torch.cuda.memory_allocated())
if name == 'main':
test01()

程序输出结果：

0
512

我们前面创建的线性层不带偏置，只有一个参数，占用的显存应该是 4 字节，为什么这里是 512 字节？原因是 PyTorch 分配显存时是按照 512 倍数分配，也就是按块分配。为啥这样？不怕显存浪费？这也是从效率角度考虑的，按块分配便于内存管理，尽可能避免内存碎片。

import torch
import torch.nn as nn
def test02():
print(torch.cuda.memory_allocated())
linear = nn.Linear(in_features=128, out_features=1, bias=False).cuda()
print(torch.cuda.memory_allocated())
if name == 'main':
test02()

输出结果仍然是 512 字节，如果把 in_features 128 换成 129，那么就会分配 1024 字节的显存。注意一个参数的大小是 4 字节。

思考：下面的模型占用多大显存？

import torch
import torch.nn as nn
class Net(nn.Module):
def __init__(self):
    super(Net, self).__init__()
    self.linear1 = nn.Linear(1, 1, bias=False)
    self.linear2 = nn.Linear(1, 1, bias=False)
def forward(self, inputs):
inputs = self.linear1(inputs)
inputs = self.linear2(inputs)
return inputs

def test03():
print(torch.cuda.memory_allocated())
model = Net().cuda()
print(torch.cuda.memory_allocated())
if name == 'main':
test03()

程序输出结果是：

0
1024

前向和反向计算 {#title-2} =====================

网络模型在进行前向计算时会保存中间结果，为啥要保存？就是反向计算求梯度时需要用到这些中间结果。反向计算后得到的梯度值是需要显存来存储，所以，正向和反向计算都会占用显存。

另外，输入的 batch_size 越大，占用的显存越大。

import torch
import torch.nn as nn
def test():
print(torch.cuda.memory_allocated())
model = nn.Linear(1, 1).cuda()
print(torch.cuda.memory_allocated())
前向计算
5120 = 1024 + 4096（1024 个输入大小）
inputs = torch.randn(size=(1024, 1)).cuda()
print(torch.cuda.memory_allocated())
正向计算需要缓存中间计算结果(outputs)
注意：用变量承接相当于缓存了中间结果
9216 = 5120 + 4096（1024 个缓存结果）
outputs = model(inputs)
print(torch.cuda.memory_allocated())
计算损失
9728 = 9216 + 512 缓存损失结果
loss = torch.mean(outputs)
print(torch.cuda.memory_allocated())
反向计算
10752 = 9728 + 512 保存梯度值
loss.backward()
print(torch.cuda.memory_allocated())

if name == 'main':
test()

程序执行结果：

反向传播之后，可以释放 outputs、loss 这些变量。

优化方法统计量 {#title-3} =====================

不同的优化方法中会存在一些统计量。例如：对于 SGD 会记录每个参数的历史移动平均梯度动量，Adam 优化方法中会记录每个参数的一阶、二阶梯度动量。这些在训练过程中，也是需要占用一定的显存，并且参数量越大，这些优化方法占用的显存就越大。

import torch
import torch.nn as nn
import torch.optim as optim
def test():
# 0
print(torch.cuda.memory_allocated())
512
model = nn.Linear(1, 1, bias=False).cuda()
print(torch.cuda.memory_allocated())
1024
inputs = torch.randn(size=(1, 1)).cuda()
print(torch.cuda.memory_allocated())
1536
outputs = model(inputs)
print(torch.cuda.memory_allocated())
2048
loss = torch.mean(outputs)
print(torch.cuda.memory_allocated())
2560
loss.backward()
print(torch.cuda.memory_allocated())
3584
optimizer = optim.Adam(model.parameters(), lr=1e-3)
optimizer.step()
print(torch.cuda.memory_allocated())

if name == 'main':
test()

程序执行结果：

SGD 如果设置 momentum 的话，内部会对每个参数记录一个历史梯度。Adam 则记录的数据较多一些。所以，Adam 的显存占用会更多一些。

51工具盒子

显存使用分析（PyTorch）

初始化 pynvml 库

获得显卡设备对象

查看显存资源

1. CUDA 初始化会占用部分显存

前向计算

5120 = 1024 + 4096（1024 个输入大小）

正向计算需要缓存中间计算结果(outputs)

注意：用变量承接相当于缓存了中间结果

9216 = 5120 + 4096（1024 个缓存结果）

计算损失

9728 = 9216 + 512 缓存损失结果

反向计算

10752 = 9728 + 512 保存梯度值

512

1024

1536

2048

2560

3584

厉飞雨

相关推荐

最新文章

猜你喜欢

快捷分类