Multi-GPU¶
Set the accelerator
and devices
arguments to train on a single machine (node) with
multiple GPUs. By default, Lightly Train uses all available GPUs on the current
node for training. The following example shows how to train with two GPUs:
Important
Always run your code inside an if __name__ == "__main__":
block when using multiple
GPUs!
import lightly_train
if __name__ == "__main__":
lightly_train.train(
out="out/my_experiment",
data="my_data_dir",
model="torchvision/resnet50",
accelerator="gpu", # Accelerator type
devices=2, # Number of GPUs
)
lightly-train train out="out/my_experiment" data="my_data_dir" model="torchvision/resnet50" accelerator="gpu" devices=2
Tip
Set devices=[1, 3]
to train on GPUs 1 and 3 specifically. When using the command line
interface, set devices="[1,3]"
instead.
Adjusting Parameters¶
Parameters such as the batch size (batch_size
), learning rate (optim_args.lr
), and
the number of dataloader workers (num_workers
) are automatically adjusted
based on the number of GPUs. You do not need to modify these parameters manually when
changing the number of GPUs.
Tip
The batch size (batch_size
) is the global batch size across all GPUs. Setting
batch_size=256
with devices=2
will result in a batch size of 128 per GPU.
Tip
The number of workers (num_workers
) is the number of dataloader workers per GPU.
Setting num_workers=8
with devices=2
results in 16 dataloader workers in total.
The total number of dataloader workers should not exceed the number of CPU cores on the
node to avoid training slowdowns.
SLURM¶
Use the following setup to train on a SLURM-managed cluster with multiple GPUs:
Create a SLURM script (my_train_slurm.sh
) that looks as follows:
#!/bin/bash -l
#SBATCH --nodes=1 # Number of nodes
#SBATCH --gres=gpu:2 # Number of GPUs
#SBATCH --ntasks-per-node=2 # Must match the number of GPUs
#SBATCH --cpus-per-task=12 # Number of CPU cores per GPU; must be larger than the
# number of dataloader workers (num_workers) if num_workers
# is set in the training function. Otherwise, num_workers is
# automatically set to cpus-per-task - 1.
#SBATCH --mem=0 # Use all available memory
# IMPORTANT: Do not set --ntasks as it is automatically inferred from --ntasks-per-node.
# Activate your virtual environment.
# The command might differ depending on your setup.
# For conda environments, use `conda activate my_env`.
source .venv/bin/activate
# On your cluster you might need to set the network interface:
# export NCCL_SOCKET_IFNAME=^docker0,lo
# Might need to load the latest CUDA version:
# module load NCCL/2.4.7-1-cuda.10.0
# Run the training script.
srun python my_train_script.py
Then create a Python script (my_train_script.py
) that calls lightly_train.train()
:
# my_train_script.py
import lightly_train
if __name__ == "__main__":
lightly_train.train(
out="out/my_experiment",
data="my_data_dir",
model="torchvision/resnet50",
# The following arguments are automatically set based on the SLURM configuration:
# accelerator="gpu",
# devices=2,
# num_workers=11,
)
Finally, submit the training job to the SLURM cluster with:
sbatch my_train_slurm.sh
Create a SLURM script (my_train_slurm.sh
) that looks as follows:
#!/bin/bash -l
#SBATCH --nodes=1 # Number of nodes
#SBATCH --gres=gpu:2 # Number of GPUs
#SBATCH --ntasks-per-node=2 # Must match the number of GPUs
#SBATCH --cpus-per-task=12 # Number of CPU cores per GPU; must be larger than the
# number of dataloader workers (num_workers) if num_workers
# is set in the training function. Otherwise, num_workers is
# automatically set to cpus-per-task - 1.
#SBATCH --mem=0 # Use all available memory
# IMPORTANT: Do not set --ntasks as it is automatically inferred from --ntasks-per-node.
# Activate your virtual environment.
# The command might differ depending on your setup.
# For conda environments, use `conda activate my_env`.
source .venv/bin/activate
# On your cluster you might need to set the network interface:
# export NCCL_SOCKET_IFNAME=^docker0,lo
# Might need to load the latest CUDA version:
# module load NCCL/2.4.7-1-cuda.10.0
# Start the training.
srun lightly-train train out="out/my_experiment" data="my_data_dir" model="torchvision/resnet50"
# The following arguments are automatically set based on the SLURM configuration:
# accelerator="gpu"
# devices=2
# num_workers=11
Then submit the training job to the SLURM cluster with:
sbatch my_train_slurm.sh