在CentOS上实现PyTorch的多GPU训练,你需要遵循以下步骤:
安装CUDA和cuDNN:
配置环境变量:
~/.bashrc或/etc/profile.d/cuda.sh文件中来实现。安装PyTorch:
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
这里的cu113应该替换为你安装的CUDA版本。编写多GPU训练代码:
torch.nn.DataParallel或torch.nn.parallel.DistributedDataParallel来实现多GPU训练。DataParallel简单易用,适合小型网络和数据集。它会自动将模型复制到所有GPU上,并在每个GPU上处理不同的数据批次。DistributedDataParallel更适合大型网络和数据集,它提供了更好的性能和扩展性。下面是一个使用DataParallel的简单示例:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
# 定义一个简单的卷积神经网络
class ConvNet(nn.Module):
def __init__(self):
super(ConvNet, self).__init__()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
# 检查是否有多个GPU
if torch.cuda.device_count() > 1:
print(f"Let's use {' '.join(map(str, range(torch.cuda.device_count())))} GPUs!")
model = nn.DataParallel(ConvNet())
else:
print("Sorry, this computer doesn't have multiple GPUs.")
model.cuda() # 将模型发送到GPU
# 加载数据
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = DataLoader(trainset, batch_size=64, shuffle=True)
# 定义损失函数和优化器
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
# 训练模型
for epoch in range(10): # 多次循环遍历数据集
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# 获取输入数据
inputs, labels = data
inputs, labels = inputs.cuda(), labels.cuda() # 将输入数据发送到GPU
# 梯度清零
optimizer.zero_grad()
# 前向传播 + 反向传播 + 优化
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# 打印统计信息
running_loss += loss.item()
if i % 2000 == 1999: # 每2000个小批量打印一次
print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
running_loss = 0.0
print('Finished Training')
运行多GPU训练:
python运行你的训练脚本,PyTorch会自动检测并使用所有可用的GPU。请注意,多GPU训练可能需要更多的内存和计算资源,因此确保你的硬件配置足够支持多GPU操作。此外,根据你的具体需求和模型复杂度,可能需要对代码进行适当的调整。