DINO¶

DINO (Distillation with No Labels) is a self-supervised learning framework for visual representation learning using knowledge distillation but without the need for labels. Similar to knowledge distillation, DINO uses a teacher-student setup where the student learns to mimic the teacher’s outputs. The major difference is that DINO uses an exponential moving average of the student as teacher. DINO achieves strong performance on image clustering, segmentation, and zero-shot transfer tasks.

Use DINO in LightlyTrain¶

import lightly_train

if __name__ == "__main__":
    lightly_train.train(
        out="out/my_experiment", 
        data="my_data_dir",
        model="torchvision/resnet18",
        method="dino",
    )
lightly-train train out=out/my_experiment data=my_data_dir model="torchvision/resnet18" method="dino"

What’s under the Hood¶

DINO trains a student network to match the output of a momentum-averaged teacher network without labels. It employs a self-distillation objective with a cross-entropy loss between the student and teacher outputs. DINO uses random cropping, resizing, color jittering, and Gaussian blur to create diverse views of the same image. In particular, DINO employs a multi-crop augmentation strategy to generate two global views and multiple local views that are smaller crops of the original image. Additionally, centering and sharpening of the teacher pseudo labels is used to stabilize the training.

Lightly Recommendations¶

  • Models: DINO works well with both ViT and CNN.

  • Batch Size: We recommend somewhere between 256 and 1024 for DINO as the original paper suggested.

  • Number of Epochs: We recommend somewhere between 100 to 300 epochs. However, DINO benefits from longer schedules and may still improve after training for more than 300 epochs.

Default Augmentation Settings¶

The following are the default augmentation settings for DINO. To learn how you can override these settings, see Configuring Image Augmentations.

{
    "color_jitter": {
        "brightness": 0.8,
        "contrast": 0.8,
        "hue": 0.2,
        "prob": 0.8,
        "saturation": 0.4,
        "strength": 0.5
    },
    "gaussian_blur": {
        "blur_limit": 0,
        "prob": 1.0,
        "sigmas": [
            0.1,
            2.0
        ]
    },
    "global_view_1": {
        "gaussian_blur": {
            "blur_limit": 0,
            "prob": 0.1,
            "sigmas": [
                0.1,
                2.0
            ]
        },
        "solarize": {
            "prob": 0.2,
            "threshold": 0.5
        }
    },
    "image_size": [
        224,
        224
    ],
    "local_view": {
        "gaussian_blur": {
            "blur_limit": 0,
            "prob": 0.5,
            "sigmas": [
                0.1,
                2.0
            ]
        },
        "num_views": 6,
        "random_resize": {
            "max_scale": 0.14,
            "min_scale": 0.05
        },
        "view_size": [
            96,
            96
        ]
    },
    "normalize": {
        "mean": [
            0.485,
            0.456,
            0.406
        ],
        "std": [
            0.229,
            0.224,
            0.225
        ]
    },
    "random_flip": {
        "horizontal_prob": 0.5,
        "vertical_prob": 0.0
    },
    "random_gray_scale": 0.2,
    "random_resize": {
        "max_scale": 1.0,
        "min_scale": 0.14
    },
    "random_rotation": null,
    "solarize": null
}