Multi-GPU¶

Set the accelerator and devices arguments to train on a single machine (node) with multiple GPUs. By default, Lightly Train uses all available GPUs on the current node for training. The following example shows how to train with two GPUs:

Python

Important

Always run your code inside an if __name__ == "__main__": block when using multiple GPUs!

import lightly_train

if __name__ == "__main__":
    lightly_train.train(
        out="out/my_experiment",
        data="my_data_dir",
        model="torchvision/resnet50",
        accelerator="gpu",      # Accelerator type
        devices=2,              # Number of GPUs
    )

Command Line

lightly-train train out="out/my_experiment" data="my_data_dir" model="torchvision/resnet50" accelerator="gpu" devices=2

Tip

Set devices=[1, 3] to train on GPUs 1 and 3 specifically. When using the command line interface, set devices="[1,3]" instead.

Adjusting Parameters¶

Parameters such as the batch size (batch_size), learning rate (optim_args.lr), and the number of dataloader workers (num_workers) are automatically adjusted based on the number of GPUs. You do not need to modify these parameters manually when changing the number of GPUs.

Tip

The batch size (batch_size) is the global batch size across all GPUs. Setting batch_size=256 with devices=2 will result in a batch size of 128 per GPU.

Tip

The number of workers (num_workers) is the number of dataloader workers per GPU. Setting num_workers=8 with devices=2 results in 16 dataloader workers in total. The total number of dataloader workers should not exceed the number of CPU cores on the node to avoid training slowdowns.

SLURM¶

Use the following setup to train on a SLURM-managed cluster with multiple GPUs:

Python

Create a SLURM script (my_train_slurm.sh) that looks as follows:

#!/bin/bash -l

#SBATCH --nodes=1               # Number of nodes
#SBATCH --gres=gpu:2            # Number of GPUs
#SBATCH --ntasks-per-node=2     # Must match the number of GPUs
#SBATCH --cpus-per-task=12      # Number of CPU cores per GPU; must be larger than the 
                                # number of dataloader workers (num_workers) if num_workers 
                                # is set in the training function. Otherwise, num_workers is 
                                # automatically set to cpus-per-task - 1.
#SBATCH --mem=0                 # Use all available memory

# IMPORTANT: Do not set --ntasks as it is automatically inferred from --ntasks-per-node.

# Activate your virtual environment.
# The command might differ depending on your setup.
# For conda environments, use `conda activate my_env`.
source .venv/bin/activate

# On your cluster you might need to set the network interface:
# export NCCL_SOCKET_IFNAME=^docker0,lo

# Might need to load the latest CUDA version:
# module load NCCL/2.4.7-1-cuda.10.0

# Run the training script.
srun python my_train_script.py      

Then create a Python script (my_train_script.py) that calls lightly_train.train():

# my_train_script.py
import lightly_train

if __name__ == "__main__":
    lightly_train.train(
        out="out/my_experiment",
        data="my_data_dir",
        model="torchvision/resnet50",
        # The following arguments are automatically set based on the SLURM configuration:
        # accelerator="gpu",
        # devices=2,
        # num_workers=11,
    )

Finally, submit the training job to the SLURM cluster with:

sbatch my_train_slurm.sh

Command Line

Create a SLURM script (my_train_slurm.sh) that looks as follows:

#!/bin/bash -l

#SBATCH --nodes=1               # Number of nodes
#SBATCH --gres=gpu:2            # Number of GPUs
#SBATCH --ntasks-per-node=2     # Must match the number of GPUs
#SBATCH --cpus-per-task=12      # Number of CPU cores per GPU; must be larger than the 
                                # number of dataloader workers (num_workers) if num_workers 
                                # is set in the training function. Otherwise, num_workers is 
                                # automatically set to cpus-per-task - 1.
#SBATCH --mem=0                 # Use all available memory

# IMPORTANT: Do not set --ntasks as it is automatically inferred from --ntasks-per-node.

# Activate your virtual environment.
# The command might differ depending on your setup.
# For conda environments, use `conda activate my_env`.
source .venv/bin/activate

# On your cluster you might need to set the network interface:
# export NCCL_SOCKET_IFNAME=^docker0,lo

# Might need to load the latest CUDA version:
# module load NCCL/2.4.7-1-cuda.10.0

# Start the training.
srun lightly-train train out="out/my_experiment" data="my_data_dir" model="torchvision/resnet50"

# The following arguments are automatically set based on the SLURM configuration:
# accelerator="gpu"
# devices=2
# num_workers=11

Then submit the training job to the SLURM cluster with:

sbatch my_train_slurm.sh