Distributed Training

LightlySSL supports training your model on multiple GPUs using Pytorch Lightning and Distributed Data Parallel (DDP) training. You can find reference implementations for all our models in the Models section.

Training with multiple gpus is also available from the command line: Train a model using the CLI

For details on distributed training we recommend the following pages:

There are different levels of synchronization for distributed training. One can for example just sync the gradients in the backpropagation step. But we can also sync special layers such as batch norm such that they get the statistics from all the batches across the GPUs. Some of the additional synchronization might improve the final model accuracy at the cost of longer training due to the communication overhead.

We did some simple experiments we share here:

Distributed Training Benchmarks

Dataset: Cifar10
Batch size: 512
Epochs: 100

Distributed training is done with DDP using Pytorch Lightning and the batch size is divided by the number of GPUs.

For distributed training we also evaluate whether Synchronized BatchNorm helps and what happens if we gather features from all gpus before calculating the loss (Gather Distributed).

Synchronized BatchNorm affects all models
Gather Distributed only has an effect on SimCLR, BarlowTwins and SwaV.

Single GPU
Model	Test Accuracy	GPUs	Time	Peak GPU usage
MoCo	0.77	1	329 min	11.9 GBytes
SimCLR	0.79	1	208 min	11.9 GBytes
SimSiam	0.68	1	199 min	12.0 GBytes
BarlowTwins	0.64	1	197 min	7.6 GBytes
BYOL	0.76	1	232 min	7.8 GBytes
SwaV	0.77	1	199 min	7.5 GBytes

Multi-GPU with Synchronized BatchNorm and Gather Distributed
Model	Test Accuracy	GPUs	Time	Speedup	Peak GPU usage
MoCo	0.77	4	105 min	3.13x	2.2 GBytes
SimCLR	0.75	4	77 min	2.70x	2.1 GBytes
SimSiam	0.67	4	79 min	2.51x	2.3 GBytes
BarlowTwins	0.71	4	93 min	2.03x	2.3 GBytes
BYOL	0.75	4	91 min	2.55x	2.3 GBytes
SwaV	0.77	4	78 min	2.55x	2.3 GBytes

Multi-GPU with Gather Distributed
Model	Test Accuracy	GPUs	Time	Speedup	Peak GPU usage
MoCo	0.76	4	89 min	3.69x	2.2 GBytes
SimCLR	0.77	4	73 min	2.75x	2.1 GBytes
SimSiam	0.67	4	75 min	2.65x	2.3 GBytes
BarlowTwins	0.71	4	82 min	2.40x	2.3 GBytes
BYOL	0.76	4	91 min	2.55x	2.3 GBytes
SwaV	0.75	4	74 min	2.69x	2.3 GBytes

Observations

4 gpus are 2-3x faster than 1 gpu
With 4 gpus a single epoch takes <40 sec which means that a lot of time is spent between epochs (starting workers, doing evaluation). The benefit from using more gpus could therefore be even greater with a larger dataset.
The slowdown from Sync BatchNorm is pretty low