DINOv2

This page describes how to use DINOv2 models with LightlyTrain.

DINOv2 models are Vision Transformers (ViTs) pretrained by Meta using the DINOv2 self-supervised learning method on large image datasets. They serve as high-quality feature extractors and strong backbones for downstream tasks such as object detection, segmentation, and image classification.

Note

DINOv2 models are released under the Apache 2.0 license.

Pretrain and Fine-tune a DINOv2 Model

Pretrain

DINOv2 models can be pretrained from scratch or starting from Meta’s pretrained weights using the DINOv2 method. Below we provide the minimum scripts using dinov2/vitb14 as an example:

import lightly_train

if __name__ == "__main__":
    lightly_train.pretrain(
        out="out/my_experiment",                # Output directory.
        data="my_data_dir",                     # Directory with images.
        model="dinov2/vitb14",                  # Pass the DINOv2 model.
        method="dinov2",                        # Use the DINOv2 pretraining method.
    )
lightly-train pretrain out="out/my_experiment" data="my_data_dir" model="dinov2/vitb14" method="dinov2"

See DINOv2 method for more details on the pretraining method and its configuration options.

Fine-tune

After pretraining, the exported DINOv2 backbone can be loaded into downstream task models via the backbone_weights argument. Refer to the following pages for fine-tuning instructions and example code:

Supported Models

Pretrained Models

The following models are pretrained by Meta and loaded automatically when used.

  • dinov2/vits14

  • dinov2/vitb14

  • dinov2/vitl14

  • dinov2/vitg14

Not Pretrained Models

The following models start from random initialization and are useful when pretraining from scratch with the DINOv2 method on a custom dataset without starting from Meta’s weights.

  • dinov2/vits14-notpretrained

  • dinov2/vitb14-notpretrained

  • dinov2/vitl14-notpretrained

  • dinov2/vitg14-notpretrained