lightly.models

The lightly.models package provides model implementations.

Note that the high-level building blocks will be deprecated with lightly version 1.3.0. Instead, use low-level building blocks to build the models yourself.

Example implementations for all models can be found here: Model Examples

The package contains an implementation of the commonly used ResNet and adaptations of the architecture which make self-supervised learning simpler.

The package also hosts the Lightly model zoo - a list of downloadable ResNet checkpoints.

.resnet

Custom ResNet Implementation

Note that the architecture we present here differs from the one used in torchvision. We replace the first 7x7 convolution by a 3x3 convolution to make the model faster and run better on smaller input image resolutions.

Furthermore, we introduce a resnet-9 variant for extra small models. These can run for example on a microcontroller with 100kBytes of storage.

class lightly.models.resnet.BasicBlock(in_planes: int, planes: int, stride: int = 1, num_splits: int = 0)

Implementation of the ResNet Basic Block.

in_planes: Number of input channels.

planes: Number of channels.

stride: Stride of the first convolutional.

forward(x: Tensor) → Tensor

Forward pass through basic ResNet block.

Parameters: x – Tensor of shape bsz x channels x W x H
Returns: Tensor of shape bsz x channels x W x H

class lightly.models.resnet.Bottleneck(in_planes: int, planes: int, stride: int = 1, num_splits: int = 0)

Implementation of the ResNet Bottleneck Block.

in_planes: Number of input channels.

planes: Number of channels.

stride: Stride of the first convolutional.

forward(x: Tensor) → Tensor

Forward pass through bottleneck ResNet block.

Parameters: x – Tensor of shape bsz x channels x W x H
Returns: Tensor of shape bsz x channels x W x H

class lightly.models.resnet.ResNet(block: type[lightly.models.resnet.BasicBlock] = <class 'lightly.models.resnet.BasicBlock'>, layers: ~typing.List[int] = [2, 2, 2, 2], num_classes: int = 10, width: float = 1.0, num_splits: int = 0)

ResNet implementation.

[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun Deep Residual Learning for Image Recognition. arXiv:1512.03385

block: ResNet building block type.

layers: List of blocks per layer.

num_classes: Number of classes in final softmax layer.

width: Multiplier for ResNet width.

forward(x: Tensor) → Tensor

Forward pass through ResNet.

Parameters: x – Tensor of shape bsz x channels x W x H
Returns: Output tensor of shape bsz x num_classes

lightly.models.resnet.ResNetGenerator(name: str = 'resnet-18', width: float = 1, num_classes: int = 10, num_splits: int = 0) → ResNet

Builds and returns the specified ResNet.

Parameters

name – ResNet version from resnet-{9, 18, 34, 50, 101, 152}.
width – ResNet width.
num_classes – Output dim of the last layer.
num_splits – Number of splits to use for SplitBatchNorm (for MoCo model). Increase this number to simulate multi-gpu behavior. E.g. num_splits=8 simulates a 8-GPU cluster. num_splits=0 uses normal PyTorch BatchNorm.

Returns

ResNet as nn.Module.

Examples

>>> # binary classifier with ResNet-34
>>> from lightly.models import ResNetGenerator
>>> resnet = ResNetGenerator('resnet-34', num_classes=2)

.zoo

Lightly Model Zoo

lightly.models.zoo.checkpoints() → List[str]

Returns the Lightly model zoo as a list of checkpoints.

Checkpoints:

ResNet-9:: SimCLR with width = 0.0625 and num_ftrs = 16
ResNet-9:: SimCLR with width = 0.125 and num_ftrs = 16
ResNet-18:: SimCLR with width = 1.0 and num_ftrs = 16
ResNet-18:: SimCLR with width = 1.0 and num_ftrs = 32
ResNet-34:: SimCLR with width = 1.0 and num_ftrs = 16
ResNet-34:: SimCLR with width = 1.0 and num_ftrs = 32

Returns: A list of available checkpoints as URLs.

The lightly.models.modules package provides reusable modules.

This package contains reusable modules such as the NNmemoryBankModule which can be combined with any lightly model.

.nn_memory_bank

Nearest Neighbour Memory Bank Module

class lightly.models.modules.nn_memory_bank.NNMemoryBankModule(size: Union[int, Sequence[int]] = 65536)

Nearest Neighbour Memory Bank implementation

This class implements a nearest neighbour memory bank as described in the NNCLR paper[0]. During the forward pass we return the nearest neighbour from the memory bank.

[0] NNCLR, 2021, https://arxiv.org/abs/2104.14548

size: Size of the memory bank as (num_features, dim) tuple. If num_features is 0 then the memory bank is disabled. Deprecated: If only a single integer is passed, it is interpreted as the number of features and the feature dimension is inferred from the first batch stored in the memory bank. Leaving out the feature dimension might lead to errors in distributed training.

Examples

>>> model = NNCLR(backbone)
>>> criterion = NTXentLoss(temperature=0.1)
>>>
>>> nn_replacer = NNmemoryBankModule(size=(2 ** 16, 128))
>>>
>>> # forward pass
>>> (z0, p0), (z1, p1) = model(x0, x1)
>>> z0 = nn_replacer(z0.detach(), update=False)
>>> z1 = nn_replacer(z1.detach(), update=True)
>>>
>>> loss = 0.5 * (criterion(z0, p1) + criterion(z1, p0))

forward(output: Tensor, update: bool = False) → Tensor

Returns nearest neighbour of output tensor from memory bank

Parameters

output – The torch tensor for which you want the nearest neighbour
update – If True updated the memory bank by adding output to it

.heads

Projection and Prediction Heads for Self-supervised Learning

class lightly.models.modules.heads.BYOLPredictionHead(input_dim: int = 256, hidden_dim: int = 4096, output_dim: int = 256)

Prediction head used for BYOL.

“This MLP consists in a linear layer with output size 4096 followed by batch normalization, rectified linear units (ReLU), and a final linear layer with output dimension 256.” [0]

[0]: BYOL, 2020, https://arxiv.org/abs/2006.07733

class lightly.models.modules.heads.BYOLProjectionHead(input_dim: int = 2048, hidden_dim: int = 4096, output_dim: int = 256)

Projection head used for BYOL.

“This MLP consists in a linear layer with output size 4096 followed by batch normalization, rectified linear units (ReLU), and a final linear layer with output dimension 256.” [0]

[0]: BYOL, 2020, https://arxiv.org/abs/2006.07733

class lightly.models.modules.heads.BarlowTwinsProjectionHead(input_dim: int = 2048, hidden_dim: int = 8192, output_dim: int = 8192)

Projection head used for Barlow Twins.

“The projector network has three linear layers, each with 8192 output units. The first two layers of the projector are followed by a batch normalization layer and rectified linear units.” [0]

[0]: 2021, Barlow Twins, https://arxiv.org/abs/2103.03230

class lightly.models.modules.heads.DINOProjectionHead(input_dim: int = 2048, hidden_dim: int = 2048, bottleneck_dim: int = 256, output_dim: int = 65536, batch_norm: bool = False, freeze_last_layer: int = -1, norm_last_layer: bool = True)

Projection head used in DINO.

“The projection head consists of a 3-layer multi-layer perceptron (MLP) with hidden dimension 2048 followed by l2 normalization and a weight normalized fully connected layer with K dimensions, which is similar to the design from SwAV [1].” [0]

[0]: DINO, 2021, https://arxiv.org/abs/2104.14294
[1]: SwAV, 2020, https://arxiv.org/abs/2006.09882

input_dim: The input dimension of the head.

hidden_dim: The hidden dimension.

bottleneck_dim: Dimension of the bottleneck in the last layer of the head.

output_dim: The output dimension of the head.

batch_norm: Whether to use batch norm or not. Should be set to False when using a vision transformer backbone.

freeze_last_layer: Number of epochs during which we keep the output layer fixed. Typically doing so during the first epoch helps training. Try increasing this value if the loss does not decrease.

norm_last_layer: Whether or not to weight normalize the last layer of the DINO head. Not normalizing leads to better performance but can make the training unstable.

cancel_last_layer_gradients(current_epoch: int) → None: Cancel last layer gradients to stabilize the training.

forward(x: Tensor) → Tensor: Computes one forward pass through the head.

class lightly.models.modules.heads.DINOv2ProjectionHead(input_dim: int = 2048, hidden_dim: int = 2048, bottleneck_dim: int = 256, output_dim: int = 65536, batch_norm: bool = False)

forward(x: Tensor) → Tensor

Computes one forward pass through the projection head.

Parameters: x – Input of shape bsz x num_ftrs.

class lightly.models.modules.heads.DenseCLProjectionHead(input_dim: int = 2048, hidden_dim: int = 2048, output_dim: int = 128)

Projection head for DenseCL [0].

The projection head consists of a 2-layer MLP. It can be used for global and local features.

[0]: 2021, DenseCL: https://arxiv.org/abs/2011.09157

class lightly.models.modules.heads.MMCRProjectionHead(input_dim: int = 2048, hidden_dim: int = 8192, output_dim: int = 512, num_layers: int = 2, batch_norm: bool = True, use_bias: bool = False)

Projection head used for MMCR.

“Following Chen et al. (14), we append a small perceptron to the output of the average pooling layer of the ResNet so that zi = g(h(xi)), where h is the ResNet and g is the MLP.” [0]

[0]: MMCR, 2023, https://arxiv.org/abs/2303.03307

class lightly.models.modules.heads.MSNProjectionHead(input_dim: int = 768, hidden_dim: int = 2048, output_dim: int = 256)

Projection head for MSN [0].

“We train with a 3-layer projection head with output dimension 256 and batch-normalization at the input and hidden layers..” [0]

Code inspired by [1].

[0]: Masked Siamese Networks, 2022, https://arxiv.org/abs/2204.07141
[1]: https://github.com/facebookresearch/msn

input_dim: Input dimension, default value 768 is for a ViT base model.

hidden_dim: Hidden dimension.

output_dim: Output dimension.

class lightly.models.modules.heads.MoCoProjectionHead(input_dim: int = 2048, hidden_dim: int = 2048, output_dim: int = 128, num_layers: int = 2, batch_norm: bool = False)

Projection head used for MoCo.

“(…) we replace the fc head in MoCo with a 2-layer MLP head (hidden layer 2048-d, with ReLU)” [1]

“The projection head is a 3-layer MLP. The prediction head is a 2-layer MLP. The hidden layers of both MLPs are 4096-d and are with ReLU; the output layers of both MLPs are 256-d, without ReLU. In MoCo v3, all layers in both MLPs have BN” [2]

[0]: MoCo v1, 2020, https://arxiv.org/abs/1911.05722
[1]: MoCo v2, 2020, https://arxiv.org/abs/2003.04297
[2]: MoCo v3, 2021, https://arxiv.org/abs/2104.02057

class lightly.models.modules.heads.NNCLRPredictionHead(input_dim: int = 256, hidden_dim: int = 4096, output_dim: int = 256)

Prediction head used for NNCLR.

“The architecture of the prediction MLP g is 2 fully-connected layers of size [4096,d]. The hidden layer of the prediction MLP is followed by batch-norm and ReLU. The last layer has no batch-norm or activation.” [0]

[0]: NNCLR, 2021, https://arxiv.org/abs/2104.14548

class lightly.models.modules.heads.NNCLRProjectionHead(input_dim: int = 2048, hidden_dim: int = 2048, output_dim: int = 256)

Projection head used for NNCLR.

“The architectureof the projection MLP is 3 fully connected layers of sizes [2048,2048,d] where d is the embedding size used to apply the loss. We use d = 256 in the experiments unless otherwise stated. All fully-connected layers are followed by batch-normalization [36]. All the batch-norm layers except the last layer are followed by ReLU activation.” [0]

[0]: NNCLR, 2021, https://arxiv.org/abs/2104.14548

class lightly.models.modules.heads.ProjectionHead(blocks: Sequence[Union[Tuple[int, int, Optional[Module], Optional[Module]], Tuple[int, int, Optional[Module], Optional[Module], bool]]])

Base class for all projection and prediction heads.

Parameters: blocks – List of tuples, each denoting one block of the projection head MLP. Each tuple reads (in_features, out_features, batch_norm_layer, non_linearity_layer, use_bias (optional)).

Examples

>>> # the following projection head has two blocks
>>> # the first block uses batch norm an a ReLU non-linearity
>>> # the second block is a simple linear layer
>>> projection_head = ProjectionHead([
>>>     (256, 256, nn.BatchNorm1d(256), nn.ReLU()),
>>>     (256, 128, None, None)
>>> ])

forward(x: Tensor) → Tensor

Computes one forward pass through the projection head.

Parameters: x – Input of shape bsz x num_ftrs.

class lightly.models.modules.heads.SMoGPredictionHead(input_dim: int = 128, hidden_dim: int = 2048, output_dim: int = 128)

Prediction head used for SMoG.

“The two kinds of head are both a two-layer MLP and their hidden layer is followed by a BatchNorm [28] and an activation function. (…) The output layer of projection head also has BN” [0]

[0]: SMoG, 2022, https://arxiv.org/pdf/2207.06167.pdf

class lightly.models.modules.heads.SMoGProjectionHead(input_dim: int = 2048, hidden_dim: int = 2048, output_dim: int = 128)

Projection head used for SMoG.

“The two kinds of head are both a two-layer MLP and their hidden layer is followed by a BatchNorm [28] and an activation function. (…) The output layer of projection head also has BN” [0]

[0]: SMoG, 2022, https://arxiv.org/pdf/2207.06167.pdf

class lightly.models.modules.heads.SMoGPrototypes(group_features: Tensor, beta: float)

SMoG prototypes module for synchronous momentum grouping.

Parameters

group_features – Tensor containing the group features.
beta – Beta parameter for momentum updating.

assign_groups(x: Tensor) → Tensor

Assigns each representation in x to a group based on cosine similarity.

Parameters: x – Tensor of shape (bsz, dim).
Returns: Tensor of shape (bsz,) indicating group assignments.

forward(x: Tensor, group_features: Tensor, temperature: float = 0.1) → Tensor

Computes the logits for given model outputs and group features.

Parameters

x – Tensor of shape bsz x dim.
group_features – Momentum updated group features of shape n_groups x dim.
temperature – Temperature parameter for calculating the logits.

Returns

The computed logits.

get_updated_group_features(x: Tensor) → Tensor

Performs the synchronous momentum update of the group vectors.

Parameters: x – Tensor of shape bsz x dim.
Returns: The updated group features.

set_group_features(x: Tensor) → None: Sets the group features and asserts they don’t require gradient.

class lightly.models.modules.heads.SimCLRProjectionHead(input_dim: int = 2048, hidden_dim: int = 2048, output_dim: int = 128, num_layers: int = 2, batch_norm: bool = True)

Projection head used for SimCLR.

“We use a MLP with one hidden layer to obtain zi = g(h) = W_2 * σ(W_1 * h) where σ is a ReLU non-linearity.” [0]

“We use a 3-layer MLP projection head on top of a ResNet encoder.” [1]

[0] SimCLR v1, 2020, https://arxiv.org/abs/2002.05709
[1] SimCLR v2, 2020, https://arxiv.org/abs/2006.10029

class lightly.models.modules.heads.SimSiamPredictionHead(input_dim: int = 2048, hidden_dim: int = 512, output_dim: int = 2048)

Prediction head used for SimSiam.

“The prediction MLP (h) has BN applied to its hidden fc layers. Its output fc does not have BN (…) or ReLU. This MLP has 2 layers.” [0]

[0]: SimSiam, 2020, https://arxiv.org/abs/2011.10566

class lightly.models.modules.heads.SimSiamProjectionHead(input_dim: int = 2048, hidden_dim: int = 2048, output_dim: int = 2048)

Projection head used for SimSiam.

“The projection MLP (in f) has BN applied to each fully-connected (fc) layer, including its output fc. Its output fc has no ReLU. The hidden fc is 2048-d. This MLP has 3 layers.” [0]

[0]: SimSiam, 2020, https://arxiv.org/abs/2011.10566

class lightly.models.modules.heads.SwaVProjectionHead(input_dim: int = 2048, hidden_dim: int = 2048, output_dim: int = 128)

Projection head used for SwaV.

[0]: SwAV, 2020, https://arxiv.org/abs/2006.09882

class lightly.models.modules.heads.SwaVPrototypes(input_dim: int = 128, n_prototypes: Union[List[int], int] = 3000, n_steps_frozen_prototypes: int = 0)

Multihead Prototypes used for SwaV.

Each output feature is assigned to a prototype, SwaV solves the swapped prediction problem where the features of one augmentation are used to predict the assigned prototypes of the other augmentation.

input_dim: The input dimension of the head.

n_prototypes: Number of prototypes.

n_steps_frozen_prototypes: Number of steps during which we keep the prototypes fixed.

Examples

>>> # use features with 128 dimensions and 512 prototypes
>>> prototypes = SwaVPrototypes(128, 512)
>>>
>>> # pass batch through backbone and projection head.
>>> features = model(x)
>>> features = nn.functional.normalize(features, dim=1, p=2)
>>>
>>> # logits has shape bsz x 512
>>> logits = prototypes(features)

forward(x: Tensor, step: Optional[int] = None) → Union[Tensor, List[Tensor]]

Forward pass of the SwaVPrototypes module.

Parameters

x – Input tensor.
step – Current training step.

Returns

The logits after passing through the prototype heads. Returns a single tensor if there’s one prototype head, otherwise returns a list of tensors.

normalize() → None: Normalizes the prototypes so that they are on the unit sphere.

class lightly.models.modules.heads.TiCoProjectionHead(input_dim: int = 2048, hidden_dim: int = 4096, output_dim: int = 256)

Projection head used for TiCo.

“This MLP consists in a linear layer with output size 4096 followed by batch normalization, rectified linear units (ReLU), and a final linear layer with output dimension 256.” [0]

[0]: TiCo, 2022, https://arxiv.org/pdf/2206.10698.pdf

class lightly.models.modules.heads.VICRegProjectionHead(input_dim: int = 2048, hidden_dim: int = 8192, output_dim: int = 8192, num_layers: int = 3)

Projection head used for VICReg.

“The projector network has three linear layers, each with 8192 output units. The first two layers of the projector are followed by a batch normalization layer and rectified linear units.” [0]

[0]: 2022, VICReg, https://arxiv.org/pdf/2105.04906.pdf

class lightly.models.modules.heads.VicRegLLocalProjectionHead(input_dim: int = 2048, hidden_dim: int = 8192, output_dim: int = 8192)

Projection head used for the local head of VICRegL.

“The projector network has three linear layers. The first two layers of the projector are followed by a batch normalization layer and rectified linear units.” [0]

[0]: 2022, VICRegL, https://arxiv.org/abs/2210.01571