Benchmarks
We show benchmarks of the different models for self-supervised learning and their performance on public datasets.
We have benchmarks we regularly update for these datasets:
ImageNet
We use the ImageNet1k ILSVRC2012 split provided here: https://image-net.org/download.php.
Self-supervised training of a SimCLR model for 100 epochs with total batch size 256 takes about two days on two GeForce RTX 4090 GPUs. You can reproduce the results with the code at benchmarks/imagenet/resnet50.
Model |
Backbone |
Batch Size |
Epochs |
Linear Top1 |
Linear Top5 |
Finetune Top1 |
Finetune Top5 |
KNN Top1 |
KNN Top5 |
Tensorboard |
Checkpoint |
---|---|---|---|---|---|---|---|---|---|---|---|
SimCLR |
Res50 |
256 |
100 |
63.2 |
85.3 |
N/A |
N/A |
44.9 |
74.2 |
||
SwAV |
Res50 |
256 |
100 |
67.2 |
88.1 |
75.4 |
92.7 |
49.5 |
78.6 |
ImageNette
We use the ImageNette dataset provided here: https://github.com/fastai/imagenette
For our benchmarks we use the 160px version and resize the input images to 128 pixels. Training a single model for 800 epochs on a A6000 GPU takes about 3-5 hours.
Model |
Batch Size |
Epochs |
KNN Test Accuracy |
Runtime |
GPU Memory |
---|---|---|---|---|---|
BarlowTwins |
256 |
800 |
0.852 |
298.5 Min |
4.0 GByte |
BYOL |
256 |
800 |
0.887 |
214.8 Min |
4.3 GByte |
DCL |
256 |
800 |
0.861 |
189.1 Min |
3.7 GByte |
DCLW |
256 |
800 |
0.865 |
192.2 Min |
3.7 GByte |
DINO (Res18) |
256 |
800 |
0.888 |
312.3 Min |
6.6 GByte |
FastSiam |
256 |
800 |
0.873 |
299.6 Min |
7.3 GByte |
MAE (ViT-S) |
256 |
800 |
0.610 |
248.2 Min |
4.4 GByte |
MSN (ViT-S) |
256 |
800 |
0.828 |
515.5 Min |
14.7 GByte |
Moco |
256 |
800 |
0.874 |
231.7 Min |
4.3 GByte |
NNCLR |
256 |
800 |
0.884 |
212.5 Min |
3.8 GByte |
PMSN (ViT-S) |
256 |
800 |
0.822 |
505.8 Min |
14.7 GByte |
SimCLR |
256 |
800 |
0.889 |
193.5 Min |
3.7 GByte |
SimMIM (ViT-B32) |
256 |
800 |
0.343 |
446.5 Min |
9.7 GByte |
SimSiam |
256 |
800 |
0.872 |
206.4 Min |
3.9 GByte |
SwaV |
256 |
800 |
0.902 |
283.2 Min |
6.4 GByte |
SwaVQueue |
256 |
800 |
0.890 |
282.7 Min |
6.4 GByte |
SMoG |
256 |
800 |
0.788 |
232.1 Min |
2.6 GByte |
TiCo |
256 |
800 |
0.856 |
177.8 Min |
2.5 GByte |
VICReg |
256 |
800 |
0.845 |
205.6 Min |
4.0 GByte |
VICRegL |
256 |
800 |
0.778 |
218.7 Min |
4.0 GByte |
You can reproduce the benchmarks using the following script:
imagenette_benchmark.py
CIFAR-10
Cifar10 consists of 50k training images and 10k testing images. We train the self-supervised models from scratch on the training data. At the end of every epoch we embed all training images and use the features for a kNN classifier with k=200 on the test set. The reported kNN test accuracy is the max accuracy over all epochs the model reached. All experiments use the same ResNet-18 backbone and we disable the gaussian blur augmentation due to the small image sizes.
Note
The ResNet-18 backbone in this benchmark is slightly different from the torchvision variant as it starts with a 3x3 convolution and has no stride and no MaxPool2d. This is a typical variation used for cifar10 benchmarks of SSL methods.
Model |
Batch Size |
Epochs |
KNN Test Accuracy |
Runtime |
GPU Memory |
---|---|---|---|---|---|
BarlowTwins |
128 |
200 |
0.842 |
375.9 Min |
1.7 GByte |
BYOL |
128 |
200 |
0.869 |
121.9 Min |
1.6 GByte |
DCL |
128 |
200 |
0.844 |
102.2 Min |
1.5 GByte |
DCLW |
128 |
200 |
0.833 |
100.4 Min |
1.5 GByte |
DINO |
128 |
200 |
0.840 |
120.3 Min |
1.6 GByte |
FastSiam |
128 |
200 |
0.906 |
164.0 Min |
2.7 GByte |
Moco |
128 |
200 |
0.838 |
128.8 Min |
1.7 GByte |
NNCLR |
128 |
200 |
0.834 |
101.5 Min |
1.5 GByte |
SimCLR |
128 |
200 |
0.847 |
97.7 Min |
1.5 GByte |
SimSiam |
128 |
200 |
0.819 |
97.3 Min |
1.6 GByte |
SwaV |
128 |
200 |
0.812 |
99.6 Min |
1.5 GByte |
SMoG |
128 |
200 |
0.743 |
192.2 Min |
1.2 GByte |
BarlowTwins |
512 |
200 |
0.819 |
153.3 Min |
5.1 GByte |
BYOL |
512 |
200 |
0.868 |
108.3 Min |
5.6 GByte |
DCL |
512 |
200 |
0.840 |
88.2 Min |
4.9 GByte |
DCLW |
512 |
200 |
0.824 |
87.9 Min |
4.9 GByte |
DINO |
512 |
200 |
0.813 |
108.6 Min |
5.0 GByte |
FastSiam |
512 |
200 |
0.788 |
146.9 Min |
9.5 GByte |
Moco (*) |
512 |
200 |
0.847 |
112.2 Min |
5.6 GByte |
NNCLR (*) |
512 |
200 |
0.815 |
88.1 Min |
5.0 GByte |
SimCLR |
512 |
200 |
0.848 |
87.1 Min |
4.9 GByte |
SimSiam |
512 |
200 |
0.764 |
87.8 Min |
5.0 GByte |
SwaV |
512 |
200 |
0.842 |
88.7 Min |
4.9 GByte |
SMoG |
512 |
200 |
0.686 |
110.0 Min |
3.4 GByte |
BarlowTwins |
512 |
800 |
0.859 |
517.5 Min |
7.9 GByte |
BYOL |
512 |
800 |
0.910 |
400.9 Min |
5.4 GByte |
DCL |
512 |
800 |
0.874 |
334.6 Min |
4.9 GByte |
DCLW |
512 |
800 |
0.871 |
333.3 Min |
4.9 GByte |
DINO |
512 |
800 |
0.848 |
405.2 Min |
5.0 GByte |
FastSiam |
512 |
800 |
0.902 |
582.0 Min |
9.5 GByte |
Moco (*) |
512 |
800 |
0.899 |
417.8 Min |
5.4 GByte |
NNCLR (*) |
512 |
800 |
0.892 |
335.0 Min |
5.0 GByte |
SimCLR |
512 |
800 |
0.879 |
331.1 Min |
4.9 GByte |
SimSiam |
512 |
800 |
0.904 |
333.7 Min |
5.1 GByte |
SwaV |
512 |
800 |
0.884 |
330.5 Min |
5.0 GByte |
SMoG |
512 |
800 |
0.800 |
415.6 Min |
3.2 GByte |
(*): Increased size of memory bank from 4096 to 8192 to avoid too quickly changing memory bank due to larger batch size.
We make the following observations running the benchmark:
Self-Supervised models benefit from larger batch sizes and longer training.
All models need around 3-4h to complete the 200 epoch benchmark and 11-13h for the 800 epoch benchmark.
Memory consumption is roughly the same for all models.
Some models, like MoCo or SwaV, learn quickly in the beginning and then plateau. Other models, like SimSiam or NNCLR, take longer to warm up but then catch up when training for 800 epochs. This can also be seen in the figure below.

kNN accuracy on test set of models trained for 800 epochs with batch size 512.
Interactive plots of the 800 epoch accuracy and training loss are hosted on tensorboard.
You can reproduce the benchmarks using the following script:
cifar10_benchmark.py
Imagenet100
Imagenet100 is a subset of the popular ImageNet-1k dataset. It consists of 100 classes with 1300 training and 50 validation images per class. We train the self-supervised models from scratch on the training data. At the end of every epoch we embed all training images and use the features for a kNN classifier with k=20 on the test set. The reported kNN test accuracy is the max accuracy over all epochs the model reached. All experiments use the same ResNet-18 backbone and with the default ImageNet-1k training parameters from the respective papers.
Model |
Batch Size |
Epochs |
KNN Test Accuracy |
Runtime |
GPU Memory |
---|---|---|---|---|---|
BarlowTwins |
256 |
200 |
0.465 |
1319.3 Min |
11.3 GByte |
BYOL |
256 |
200 |
0.439 |
1315.4 Min |
12.9 GByte |
DINO |
256 |
200 |
0.518 |
1868.5 Min |
17.4 GByte |
FastSiam |
256 |
200 |
0.559 |
1856.2 Min |
22.0 GByte |
Moco |
256 |
200 |
0.560 |
1314.2 Min |
13.1 GByte |
NNCLR |
256 |
200 |
0.453 |
1198.6 Min |
11.8 GByte |
SimCLR |
256 |
200 |
0.469 |
1207.7 Min |
11.3 GByte |
SimSiam |
256 |
200 |
0.534 |
1175.0 Min |
11.1 GByte |
SwaV |
256 |
200 |
0.678 |
1569.2 Min |
16.9 GByte |
You can reproduce the benchmarks using the following script:
imagenet100_benchmark.py
Next Steps
Now that you understand the performance of the different lightly methods how about looking into a tutorial to implement your favorite model?