DINOv2 (beta 🔬)¶

DINOv2 is a state-of-the-art self-supervised learning method for training vision foundation models. It is optimized for large-scale models and datasets. DINOv2 pretrained models are effective across a wide range of tasks, including image classification, object detection, and segmentation. They are also known to generate high-quality features that can be used without fine-tuning the model.

Use DINOv2 in LightlyTrain¶

import lightly_train

if __name__ == "__main__":
    lightly_train.train(
        out="out/my_experiment", 
        data="my_data_dir",
        model="dinov2_vit/vitb14_pretrain",
        method="dinov2",
    )
lightly-train train out=out/my_experiment data=my_data_dir model="dinov2_vit/vitb14_pretrain" method="dinov2"

The following models are available for DINOv2 pretraining:

  • dinov2_vit/vits14

  • dinov2_vit/vits14_pretrain

  • dinov2_vit/vitb14

  • dinov2_vit/vitb14_pretrain

  • dinov2_vit/vitl14

  • dinov2_vit/vitl14_pretrain

  • dinov2_vit/vitg14

  • dinov2_vit/vitg14_pretrain

Models with a _pretrain suffix are pretrained by Meta.

What’s under the Hood¶

DINOv2 combines the strengths of DINO and iBOT, two previous self-supervised learning methods. Following DINO, it trains a student network to match the output of a momentum-averaged teacher network without labels. It also incorporates the masked image modeling loss from iBOT, which helps the model learn strong local semantic features.

Lightly Recommendations¶

  • Models: DINOv2 can only be used with ViTs. If you want to use a different model, we recommend first pretraining a ViT with DINOv2 and then distilling the knowledge of the ViT into your model of choice with the distillation method.

  • Batch Size: We recommend somewhere around 3072 for DINOv2 as the original paper suggested.

  • Number of Epochs: We recommend somewhere between 100 to 300 epochs. However, DINOv2 benefits from longer schedules and may still improve after training for more than 300 epochs.

  • Large Datasets: DINOv2 is optimized for large datasets. We recommend at least 1 million images for training from scratch.

Default Method Arguments¶

The following are the default method arguments for DINOv2. To learn how you can override these settings, see Method Arguments.

Default Method Arguments
{
    "batch_norm": false,
    "center_method": "softmax",
    "center_momentum": 0.9,
    "dino_bottleneck_dim": 256,
    "dino_loss_weight": 1.0,
    "gradient_clip_val": 3.0,
    "hidden_dim": 2048,
    "ibot_bottleneck_dim": 256,
    "ibot_loss_weight": 1.0,
    "ibot_separate_head": false,
    "koleo_loss_weight": 0.1,
    "layerwise_decay": 0.9,
    "mask_probability": 0.5,
    "mask_ratio_max": 0.5,
    "mask_ratio_min": 0.1,
    "min_lr": 1e-06,
    "momentum_end": 1.0,
    "momentum_start": 0.992,
    "norm_last_layer": false,
    "output_dim": 65536,
    "patch_embed_lr_multiplier": 0.2,
    "student_freeze_last_layer_epochs": 1,
    "student_temp": 0.1,
    "teacher_temp_end": 0.07,
    "teacher_temp_start": 0.04,
    "teacher_temp_warmup_epochs": 30,
    "warmup_epochs": 10,
    "weight_decay_end": 0.4,
    "weight_decay_start": "auto"
}

Default Image Transform Arguments¶

The following are the default transform arguments for DINOv2. To learn how you can override these settings, see Configuring Image Transforms.

Default Image Transforms
{
    "color_jitter": {
        "brightness": 0.8,
        "contrast": 0.8,
        "hue": 0.2,
        "prob": 0.8,
        "saturation": 0.4,
        "strength": 0.5
    },
    "gaussian_blur": {
        "blur_limit": 0,
        "prob": 1.0,
        "sigmas": [
            0.1,
            2.0
        ]
    },
    "global_view_1": {
        "gaussian_blur": {
            "blur_limit": 0,
            "prob": 0.1,
            "sigmas": [
                0.1,
                2.0
            ]
        },
        "solarize": {
            "prob": 0.2,
            "threshold": 0.5
        }
    },
    "image_size": [
        224,
        224
    ],
    "local_view": {
        "gaussian_blur": {
            "blur_limit": 0,
            "prob": 0.5,
            "sigmas": [
                0.1,
                2.0
            ]
        },
        "num_views": 8,
        "random_resize": {
            "max_scale": 0.32,
            "min_scale": 0.05
        },
        "view_size": [
            98,
            98
        ]
    },
    "normalize": {
        "mean": [
            0.485,
            0.456,
            0.406
        ],
        "std": [
            0.229,
            0.224,
            0.225
        ]
    },
    "random_flip": {
        "horizontal_prob": 0.5,
        "vertical_prob": 0.0
    },
    "random_gray_scale": 0.2,
    "random_resize": {
        "max_scale": 1.0,
        "min_scale": 0.32
    },
    "random_rotation": null,
    "solarize": null
}