在CentOS上配置PyTorch多GPU训练需要以下几个步骤:
首先,确保你的系统已经安装了NVIDIA GPU驱动,并且已经安装了CUDA和cuDNN。
sudo yum install epel-release
sudo yum install nvidia-driver-latest-dkms
sudo reboot
下载CUDA Toolkit:
wget https://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-10.2.89-1.x86_64.rpm
sudo yum localinstall cuda-repo-rhel7-10.2.89-1.x86_64.rpm
sudo yum clean all
sudo yum install cuda
配置环境变量:
echo 'export PATH=/usr/local/cuda-10.2/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
下载cuDNN库(需要注册NVIDIA开发者账号):
解压并安装:
tar -xzvf cudnn-10.2-linux-x64-v8.0.5.39.tgz
sudo cp cuda/include/cudnn*.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*
使用pip安装PyTorch,确保选择与CUDA版本兼容的PyTorch版本。
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu102
验证CUDA和PyTorch是否正确安装并能检测到GPU。
import torch
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))
在PyTorch中,可以使用torch.nn.DataParallel或torch.nn.parallel.DistributedDataParallel来进行多GPU训练。
DataParallelimport torch
import torch.nn as nn
from torch.utils.data import DataLoader
# 假设你有一个模型和一个数据加载器
model = YourModel()
model = nn.DataParallel(model)
model.to('cuda')
# 数据加载器
train_loader = DataLoader(your_dataset, batch_size=your_batch_size, shuffle=True)
for data, target in train_loader:
data, target = data.to('cuda'), target.to('cuda')
output = model(data)
loss = nn.CrossEntropyLoss()(output, target)
loss.backward()
optimizer.step()
DistributedDataParallelDistributedDataParallel通常用于更复杂的分布式训练场景。
初始化分布式环境:
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
def train(rank, world_size):
dist.init_process_group(backend='nccl', init_method='env://', world_size=world_size, rank=rank)
model = YourModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
optimizer = torch.optim.SGD(ddp_model.parameters(), lr=0.01)
# 数据加载器
train_sampler = torch.utils.data.distributed.DistributedSampler(your_dataset, num_replicas=world_size, rank=rank)
train_loader = DataLoader(your_dataset, batch_size=your_batch_size, sampler=train_sampler)
for epoch in range(num_epochs):
train_sampler.set_epoch(epoch)
for data, target in train_loader:
data, target = data.to(rank), target.to(rank)
output = ddp_model(data)
loss = nn.CrossEntropyLoss()(output, target)
loss.backward()
optimizer.step()
if __name__ == '__main__':
world_size = torch.cuda.device_count()
mp.spawn(train, args=(world_size,), nprocs=world_size, join=True)
运行脚本时需要设置环境变量:
export MASTER_ADDR='localhost'
export MASTER_PORT='12345'
python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE YOUR_TRAINING_SCRIPT.py
通过以上步骤,你应该能够在CentOS上成功配置并运行PyTorch的多GPU训练。