(models-dinov3)=

# DINOv3

This page describes how to use DINOv3 models with LightlyTrain.

[DINOv3](https://github.com/facebookresearch/dinov3) models are Vision Transformers
(ViTs) and ConvNeXt models pretrained by Meta using the DINOv3 self-supervised learning
method on the large-scale LVD-1689M dataset. They are state-of-the-art vision foundation
models and serve as strong backbones for downstream tasks such as object detection,
segmentation, and image classification.

```{note}
DINOv3 models are released under the
[DINOv3 license](https://github.com/lightly-ai/lightly-train/blob/main/licences/DINOv3_LICENSE.md).
Use [DINOv2](#models-dinov2) models instead for a more permissive Apache 2.0 license.
```

## Pretrain and Fine-tune a DINOv3 Model

### Pretrain

DINOv3 ViT-T/16 models (`dinov3/vitt16` and `dinov3/vitt16plus`) are efficient tiny
models trained by Lightly using the [distillation method](#methods-distillation) with
DINOv3 ViT-L/16 as the teacher on ImageNet-1K. They are not part of Meta's official
DINOv3 release but follow the same architecture. The ViT-T architecture is based on the
design proposed in [Touvron et al., 2022](https://arxiv.org/abs/2207.10666).

You can distill your own DINOv3 ViT-T/16 model from DINOv3 ViT-L/16 on your custom
dataset as follows:

````{tab} Python
```python
import lightly_train

if __name__ == "__main__":
    lightly_train.pretrain(
        out="out/my_experiment",                # Output directory.
        data="my_data_dir",                     # Directory with images.
        model="dinov3/vitt16",                  # Student: DINOv3 ViT-T/16.
        method="distillation",
        method_args={
            "teacher": "dinov3/vitl16",         # Teacher: DINOv3 ViT-L/16.
        },
    )
```
````

````{tab} Command Line
```bash
lightly-train pretrain out="out/my_experiment" data="my_data_dir" model="dinov3/vitt16" method="distillation" method_args.teacher="dinov3/vitl16"
````

See [Distillation method](#methods-distillation) for more details on the pretraining
method and its configuration options.

### Fine-tune

DINOv3 models come with high-quality pretrained weights from Meta and can be used
directly as fine-tuning backbones without additional pretraining. After pretraining on a
custom dataset, the exported backbone can also be loaded via the `backbone_weights`
argument. Refer to the following pages for fine-tuning instructions and example code:

- [Object Detection](#object-detection) — fine-tune a DINOv3-based LTDETR model;
  supports loading custom pretrained backbone weights via `backbone_weights` (see
  [Pretrain and Fine-tune](#object-detection-pretrain-finetune)).
- [Semantic Segmentation](#semantic-segmentation) — fine-tune a DINOv3-based EoMT model;
  supports loading custom pretrained backbone weights via `backbone_weights` (see
  [Pretrain and Fine-tune](#semantic-segmentation-pretrain-finetune)).
- [Instance Segmentation](#instance-segmentation) — fine-tune a DINOv3-based EoMT model.
- [Panoptic Segmentation](#panoptic-segmentation) — fine-tune a DINOv3-based EoMT model.
- [Image Classification](#image-classification) — fine-tune a DINOv3 backbone for
  classification.

## Supported Models

### ViT Models

The following ViT models are supported. The LVD-1689M and SAT-493M models are
[pretrained by Meta](https://github.com/facebookresearch/dinov3/tree/main?tab=readme-ov-file#pretrained-models)
and are under the
[DINOv3 license](https://github.com/facebookresearch/dinov3?tab=License-1-ov-file). The
EUPE models are pretrained by Meta using the
[EUPE method](https://github.com/facebookresearch/EUPE) and are under the
[FAIR Noncommercial Research License](https://github.com/facebookresearch/EUPE?tab=License-1-ov-file).
The ViT-T/16 models, except the EUPE one, are trained by Lightly using knowledge
distillation from DINOv3 ViT-L/16.

- ViT-T (Lightly, distilled from DINOv3 ViT-L/16 on ImageNet-1K)
  - `dinov3/vitt16` — distillationv2 weights; recommended for dense tasks (object
    detection, segmentation)
  - `dinov3/vitt16plus` — distillationv2 weights; recommended for dense tasks
  - `dinov3/vitt16-distillationv1` — distillationv1 weights; recommended for global
    tasks (image classification)
  - `dinov3/vitt16plus-distillationv1` — distillationv1 weights; recommended for global
    tasks
  - `dinov3/vitt16-notpretrained` — random initialization; for training from scratch
  - `dinov3/vitt16plus-notpretrained` — random initialization; for training from scratch
- ViT-T (Meta, LVD-1689M)
  - `dinov3/vitt16-eupe` - [EUPE weights](https://github.com/facebookresearch/EUPE)
- ViT-S (Meta, LVD-1689M)
  - `dinov3/vits16`
  - `dinov3/vits16-eupe` - [EUPE weights](https://github.com/facebookresearch/EUPE)
  - `dinov3/vits16plus`
- ViT-B (Meta, LVD-1689M)
  - `dinov3/vitb16`
  - `dinov3/vitb16-eupe` - [EUPE weights](https://github.com/facebookresearch/EUPE)
- ViT-L (Meta)
  - `dinov3/vitl16` (LVD-1689M)
  - `dinov3/vitl16-sat493m` (SAT-493M)
- ViT-H (Meta, LVD-1689M)
  - `dinov3/vith16plus`
- ViT-7B (Meta)
  - `dinov3/vit7b16` (LVD-1689M)
  - `dinov3/vit7b16-sat493m` (SAT-493M)

### ConvNeXt Models

The following ConvNeXt models are supported. All are
[pretrained by Meta](https://github.com/facebookresearch/dinov3/tree/main?tab=readme-ov-file#pretrained-models)
on the LVD-1689M dataset. The DINOv3 models are under the
[DINOv3 license](https://github.com/facebookresearch/dinov3?tab=License-1-ov-file). The
EUPE models are pretrained by Meta using the
[EUPE method](https://github.com/facebookresearch/EUPE) and are under the
[FAIR Noncommercial Research License](https://github.com/facebookresearch/EUPE?tab=License-1-ov-file).

- `dinov3/convnext-tiny`
- `dinov3/convnext-tiny-eupe` - [EUPE weights](https://github.com/facebookresearch/EUPE)
- `dinov3/convnext-small`
- `dinov3/convnext-small-eupe` -
  [EUPE weights](https://github.com/facebookresearch/EUPE)
- `dinov3/convnext-base`
- `dinov3/convnext-base-eupe` - [EUPE weights](https://github.com/facebookresearch/EUPE)
- `dinov3/convnext-large`