DINOv3

This page describes how to use DINOv3 models with LightlyTrain.

DINOv3 models are Vision Transformers (ViTs) and ConvNeXt models pretrained by Meta using the DINOv3 self-supervised learning method on the large-scale LVD-1689M dataset. They are state-of-the-art vision foundation models and serve as strong backbones for downstream tasks such as object detection, segmentation, and image classification.

Note

DINOv3 models are released under the DINOv3 license. Use DINOv2 models instead for a more permissive Apache 2.0 license.

Pretrain and Fine-tune a DINOv3 Model

Pretrain

DINOv3 ViT-T/16 models (dinov3/vitt16 and dinov3/vitt16plus) are efficient tiny models trained by Lightly using the distillation method with DINOv3 ViT-L/16 as the teacher on ImageNet-1K. They are not part of Meta’s official DINOv3 release but follow the same architecture. The ViT-T architecture is based on the design proposed in Touvron et al., 2022.

You can distill your own DINOv3 ViT-T/16 model from DINOv3 ViT-L/16 on your custom dataset as follows:

import lightly_train

if __name__ == "__main__":
    lightly_train.pretrain(
        out="out/my_experiment",                # Output directory.
        data="my_data_dir",                     # Directory with images.
        model="dinov3/vitt16",                  # Student: DINOv3 ViT-T/16.
        method="distillation",
        method_args={
            "teacher": "dinov3/vitl16",         # Teacher: DINOv3 ViT-L/16.
        },
    )
lightly-train pretrain out="out/my_experiment" data="my_data_dir" model="dinov3/vitt16" method="distillation" method_args.teacher="dinov3/vitl16"

See Distillation method for more details on the pretraining method and its configuration options.

Fine-tune

DINOv3 models come with high-quality pretrained weights from Meta and can be used directly as fine-tuning backbones without additional pretraining. After pretraining on a custom dataset, the exported backbone can also be loaded via the backbone_weights argument. Refer to the following pages for fine-tuning instructions and example code:

Supported Models

ViT Models

The following ViT models are supported. The LVD-1689M and SAT-493M models are pretrained by Meta and are under the DINOv3 license. The EUPE models are pretrained by Meta using the EUPE method and are under the FAIR Noncommercial Research License. The ViT-T/16 models, except the EUPE one, are trained by Lightly using knowledge distillation from DINOv3 ViT-L/16.

  • ViT-T (Lightly, distilled from DINOv3 ViT-L/16 on ImageNet-1K)

    • dinov3/vitt16 — distillationv2 weights; recommended for dense tasks (object detection, segmentation)

    • dinov3/vitt16plus — distillationv2 weights; recommended for dense tasks

    • dinov3/vitt16-distillationv1 — distillationv1 weights; recommended for global tasks (image classification)

    • dinov3/vitt16plus-distillationv1 — distillationv1 weights; recommended for global tasks

    • dinov3/vitt16-notpretrained — random initialization; for training from scratch

    • dinov3/vitt16plus-notpretrained — random initialization; for training from scratch

  • ViT-T (Meta, LVD-1689M)

  • ViT-S (Meta, LVD-1689M)

    • dinov3/vits16

    • dinov3/vits16-eupe - EUPE weights

    • dinov3/vits16plus

  • ViT-B (Meta, LVD-1689M)

  • ViT-L (Meta)

    • dinov3/vitl16 (LVD-1689M)

    • dinov3/vitl16-sat493m (SAT-493M)

  • ViT-H (Meta, LVD-1689M)

    • dinov3/vith16plus

  • ViT-7B (Meta)

    • dinov3/vit7b16 (LVD-1689M)

    • dinov3/vit7b16-sat493m (SAT-493M)

ConvNeXt Models

The following ConvNeXt models are supported. All are pretrained by Meta on the LVD-1689M dataset. The DINOv3 models are under the DINOv3 license. The EUPE models are pretrained by Meta using the EUPE method and are under the FAIR Noncommercial Research License.

  • dinov3/convnext-tiny

  • dinov3/convnext-tiny-eupe - EUPE weights

  • dinov3/convnext-small

  • dinov3/convnext-small-eupe - EUPE weights

  • dinov3/convnext-base

  • dinov3/convnext-base-eupe - EUPE weights

  • dinov3/convnext-large