First Steps

The Lightly Docker solution follows a train, embed, select flow using self-supervised learning.

+-------+      +-------+      +--------+
| Train +----->+ Embed +----->+ Select |
+-------+      +-------+      +--------+
  1. You can either use a pre-trained model from the model zoo or fine-tune a model on your unlabeled dataset using self-supervised learning. The output of the train step is a model checkpoint.

  2. The embed step creates embeddings of the input dataset. Each sample gets represented using a low-dimensional vector. The output of the embed step is a .csv file.

  3. Finally, based on the embeddings and additional information we can use one of the sampling algorithms to pick the relevant data for you. The output of the select step is a list of filenames as well as analytics in form of a pdf report with plots.

You can use each of the three steps independently as well. E.g. you can use the Lightly Docker to embed a dataset and train a linear classifier on top of them.

The docker solution can be used as a command-line interface. You run the container, tell it where to find data, and where to store the result. That’s it. There are various parameters you can pass to the container. We put a lot of effort to also expose the full lightly framework configuration. You could use the docker solution to train a self-supervised model instead of using the Python framework.

Before jumping into the detail let’s have a look at some basics. The docker container can be used as a simple script. You can control parameters by changing flags.

Use the following command to get an overview of the available parameters:

docker run --gpus all --rm -it lightly/sampling:latest --help

Note

In case the command fails because docker does not detect your GPU you want to make sure nvidia-docker is installed. You can follow the guide here.

Storage Access

We use volume mapping provided by the docker run command to process datasets. A docker container itself is not considered to be a good place to store your data. Volume mapping allows the container to work with the filesystem of the host system.

There are three types of volume mappings:

  • Input Directory:

    The input directory contains the dataset we want to process. The format of the input data should be either a single folder containing all the images or a folder containing a subfolder which holds the images. See the tutorial Tutorial 1: Structure Your Input for more information. The container has only read access to this directory (note the :ro at the end of the volume mapping).

  • Shared Directory:

    The shared directory allows the user to pass additional inputs such as embeddings or model checkpoints to the container. The checkpoints should be generated by the lightly Python package or by the docker container and the embeddings should be in the format specified in the tutorial “Structure Your Input”. The container requires only read access to this directory.

  • Output Directory:

    The output directory is the place where the results from all computations made by the container are stored. See Reporting and Docker Output for additional information. The container requires read and write access to this directory.

Note

Docker volume or port mappings always follow the scheme that you first specify the host systems port followed by the internal port of the container. E.g. -v /datasets:/home/datasets would mount /datasets from your system to /home/datasets in the docker container.

Typically, your docker command would start like this:

  • Map INPUT_DIR (from your system) to /home/input_dir in the container

    e.g. /path/to/my/cat/dataset:/home/input_dir:ro

  • Map OUTPUT_DIR (from your system) to /home/output_dir in the container

    e.g. /path/where/I/want/the/docker/output:/home/output_dir

  • Specify the token to authenticate your user

docker run --gpus all --rm -it \
    -v INPUT_DIR:/home/input_dir:ro \
    -v OUTPUT_DIR:/home/output_dir \
    lightly/sampling:latest \
    token=MYAWESOMETOKEN

Now, let’s see how this will look in action!

Note

Learn how to obtain your Authentication API Token.

Warning

Don’t forget to replace INPUT_DIR and OUTPUT_DIR with the path to your local input and output directory. You must not change the path after the : since this path is describing the internal file system within the container!

Embedding and Sampling a Dataset

To embed your images with a pre-trained model, you can run the docker solution with this command:

docker run --gpus all --rm -it \
    -v INPUT_DIR:/home/input_dir:ro \
    -v OUTPUT_DIR:/home/output_dir \
    lightly/sampling:latest \
    token=MYAWESOMETOKEN \
    remove_exact_duplicates=True \
    enable_corruptness_check=True \
    stopping_condition.n_samples=0.3

The command above does the following:

  • remove_exact_duplicates=True Check your dataset for corrupt images

  • enable_corruptness_check=True Removes exact duplicates

  • stopping_condition.n_samples=0.3 Samples 30% of the images using the default method (coreset). Sampling 30% means that the remaining dataset will be 30% of the initial dataset size. You can also specify the exact number of remaining images by setting n_samples to an integer value.

    This allows you to specify the minimum allowed distance between two image embeddings in the output dataset. After normalizing the input embeddings to unit length, this value should be between 0 and 2. This is often a more convenient method when working with different data sources and trying to combine them in a balanced way.

  • stopping_condition.min_distance=0.2 would remove all samples which are closer to each other than 0.2.

The docker creates just an output file with the selected filenames for you. You can also tell the program to copy the selected files into the output folder by adding the parameter dump_dataset=True to the command.

Train a Self-Supervised Model

Sometimes it may be beneficial to finetune a self-supervised model on your dataset before embedding the images. This may be the case when the dataset is from a specific domain (e.g. for medical images).

The command below will train a self-supervised model for (default: 100) epochs on the images stored in the input directory before embedding and sampling them.

docker run --gpus all --rm -it \
    -v INPUT_DIR:/home/input_dir:ro \
    -v OUTPUT_DIR:/home/output_dir \
    lightly/sampling:latest \
    token=MYAWESOMETOKEN \
    enable_training=True

The training of the model is identical to using the lightly open-source package with the following command:

lightly-train input_dir=INPUT_DIR

Checkpoints from your training process will be stored in the output directory. You can continue training from such a checkpoint by copying the checkpoint to the shared directory and then passing the checkpoint filename to the container:

docker run --gpus all --rm -it \
    -v INPUT_DIR:/home/input_dir:ro \
    -v SHARED_DIR:/home/shared_dir:ro \
    -v OUTPUT_DIR:/home/output_dir \
    lightly/sampling:latest \
    token=MYAWESOMETOKEN \
    stopping_condition.n_samples=0.3 \
    enable_training=True \
    checkpoint=lightly_epoch_99.ckpt

You may not always want to train for exactly 100 epochs with the default settings. The next section will explain how to customize the default settings.

Accessing Lightly Input Parameters

The docker container is a wrapper around the lightly Python package. Hence, for training and embedding the user can access all the settings from the lightly command-line tool. Just prepend the parameter with lightly to do so.

docker run --gpus all --rm -it \
    -v INPUT_DIR:/home/input_dir:ro \
    -v OUTPUT_DIR:/home/output_dir \
    lightly/sampling:latest \
    token=MYAWESOMETOKEN \
    remove_exact_duplicates=True \
    enable_corruptness_check=True \
    stopping_condition.n_samples=0.3 \
    enable_training=True \
    lightly.trainer.max_epochs=10 \
    lightly.collate.input_size=64 \
    lightly.loader.batch_size=256 \
    lightly.trainer.precision=16 \
    lightly.model.name=resnet-101

A list of all input parameters can be found here: List of Parameters

Sampling from Embeddings File

It is also possible to sample directly from embedding files generated by previous runs. For this, move the embeddings file to the shared directory, and specify the filename like so:

docker run --gpus all --rm -it \
    -v INPUT_DIR:/home/input_dir:ro \
    -v SHARED_DIR:/home/shared_dir:ro \
    -v OUTPUT_DIR:/home/output_dir \
    lightly/sampling:latest \
    token=MYAWESOMETOKEN \
    remove_exact_duplicates=True \
    enable_corruptness_check=False \
    stopping_condition.n_samples=0.3 \
    embeddings=my_embeddings.csv

The embeddings file should follow the structure of the .csv file created by the lightly CLI: Create embeddings using the CLI or as described in Meta Information.

Manually Inspecting the Embeddings

Every time you run Lightly Docker you will find an embeddings.csv file in the output directory. This file contains the embeddings of all samples in your dataset. You can use the embeddings for clustering or manual inspection of your dataset.

Example plot of working with embeddings.csv

Example plot of working with embeddings.csv

We provide an example notebook to learn more about how to work with the embeddings.

Sampling from Video Files

In case you are working with video files, it is possible to point the docker container directly to the video files. This prevents the need to extract the individual frames beforehand. To do so, simply store all videos you want to work with in a single directory, the lightly software will automatically load all frames from the videos.

# work on a single video
data/
+-- my_video.mp4

# work on several videos
data/
+-- my_video_1.mp4
+-- my_video_2.avi

As you can see, the videos do not need to be in the same file format. An example command for a folder structure as shown above could then look like this:

docker run --gpus all --rm -it \
    -v INPUT_DIR:/home/input_dir:ro \
    -v SHARED_DIR:/home/shared_dir:ro \
    -v OUTPUT_DIR:/home/output_dir \
    lightly/sampling:latest \
    token=MYAWESOMETOKEN \
    stopping_condition.n_samples=0.3

Where INPUT_DIR is the path to the directory containing the video files.

You can let Lightly Docker automatically extract the sampled frames and save them in the output folder using dump_dataset=True.

docker run --gpus all --rm -it \
    -v INPUT_DIR:/home/input_dir:ro \
    -v SHARED_DIR:/home/shared_dir:ro \
    -v OUTPUT_DIR:/home/output_dir \
    lightly/sampling:latest \
    token=MYAWESOMETOKEN \
    stopping_condition.n_samples=0.3 \
    dump_dataset=True

Note

The dump_dataset feature by default saves the images in the png format. This can take a lot of time when working with high-resolution videos. You can speed up the process by specifying the output format output_image_format=’jpg’ or the resolution output_image_size=X of the images.

Removing Exact Duplicates

With the docker solution, it is possible to remove only exact duplicates from the dataset. For this, simply set the stopping condition n_samples to 1.0 (which translates to 100% of the data). The exact command is:

docker run --gpus all --rm -it \
    -v INPUT_DIR:/home/input_dir:ro \
    -v SHARED_DIR:/home/shared_dir:ro \
    -v OUTPUT_DIR:/home/output_dir \
    lightly/sampling:latest \
    token=MYAWESOMETOKEN \
    remove_exact_duplicates=True \
    stopping_condition.n_samples=1.

Upload Sampled Dataset To Lightly Platform

Lightly Docker can automatically push the sampled dataset as well as its embeddings to the Lightly Platform.

Imagine you have a dataset of 100 videos with 10’000 frames each. 1 Million frames in total. Using Lightly Docker and the Coreset method we sample the most diverse 50’000 images (a reduction of 20x). Now we push the 50’000 images to the Lightly Platform for a more interactive analysis. We can access all metadata as well as the embedding view to explore the dataset, find clusters and further curate the dataset. Finally, we can use the Active Learning capabilities of the Lightly Platform to iteratively train, predict, label the dataset in chunks until we reach the desired model accuracy.

To push the sampled dataset automatically after running Lightly Docker you can append upload_dataset=True to the docker run command.

E.g.

docker run --gpus all --rm -it \
    -v INPUT_DIR:/home/input_dir:ro \
    -v SHARED_DIR:/home/shared_dir:ro \
    -v OUTPUT_DIR:/home/output_dir \
    lightly/sampling:latest \
    token=MYAWESOMETOKEN \
    stopping_condition.n_samples=50'000 \
    stopping_condition.min_distance=0.3 \
    upload_dataset=True

You can upload only thumbnails (to save bandwidth) or only metadata (for privacy sensitive data) by adding the argument lightly.upload=thumbnails or lightly.upload=meta.

Note

You must specify the stopping condition n_samples and set the value below 75’000 (the current limit of a dataset in the Lightly Platform). We recommend setting both stopping conditions (min_distance and n_samples) in which case sampling stops as soon as the first condition is met.

Reporting

To facilitate sustainability and reproducibility in ML, the docker container has an integrated reporting component. For every dataset, you run through the container an output directory gets created with the exact configuration used for the experiment. Additionally, plots, statistics, and more information collected either during the training of the self-supervised model, embedding, or sampling of the dataset are provided.

To make it easier for you to understand and discuss the dataset we put the essential information into an automatically generated PDF report. Sample reports can be found on the Lightly website.

Live View of Docker Status

You can get a live status update of the currently running docker runs through the cloud platform.

To use the new feature simply follow the steps:

  1. Make sure you have the latest docker version installed (see Download the Docker Image)

  2. Open a browser and navigate to the Lightly Platform

  3. In the navigation menu on the top click on My Docker Runs

  4. Once you start the Lightly Docker you should see the dashboard of the current run. Please make sure that you use the same token for the docker run as you find in the dashboard.

In the dashboard, you see a list of your docker runs and a live update of the active runs. Use this view to see whether the data selection is still running as expected.

../../_images/docker_runs_overview.png

Note

Note that only status updates and error messages are transmitted.

Docker Output

The output directory is structured in the following way:

  • config:

    A directory containing copies of the configuration files and overwrites.

  • data:

    The data directory contains everything to do with data.

    • If enable_corruptness_check=True, it will contain a “clean” version of the dataset.

    • If remove_exact_duplicates=True, it will contain a copy of the embeddings.csv where all duplicates are removed. Otherwise, it will

    simply store the embeddings computed by the model.

  • filenames:

    This directory contains lists of filenames of the corrupt images, removed images, sampled images and the images which were removed because they have an exact duplicate in the dataset.

  • plots:

    A directory containing the plots which were produced for the report.

  • report.pdf

    To provide a simple overview of the filtering process the docker container automatically generates a report. The report contains

    • information about the job (duration, processed files etc.)

    • estimated savings in terms of labeling costs and CO2 due to the smaller dataset

    • statistics about the dataset before and after sampling

    • histogram before and after filtering

    • visualizations of the dataset

    • nearest neighbors of retained images among the removed ones

  • NEW report.json
    • The report is also available as a report.json file. Any value from the pdf pdf report can be easily be accessed.

Below you find a typical output folder structure.

|-- config
|   |-- config.yaml
|   |-- hydra.yaml
|   '-- overrides.yaml
|-- data
|   |-- al_score_embeddings.csv
|   |-- bounding_boxes.json
|   |-- bounding_boxes_examples
|   |-- embeddings.csv
|   |-- normalized_embeddings.csv
|   |-- sampled
|   '-- selected_embeddings.csv
|-- filenames
|   |-- corrupt_filenames.txt
|   |-- duplicate_filenames.txt
|   |-- removed_filenames.txt
|   '-- sampled_filenames.txt
|-- lightly_epoch_1.ckpt
|-- plots
|   |-- distance_distr_after.png
|   |-- distance_distr_before.png
|   |-- filter_decision_0.png
|   |-- filter_decision_11.png
|   |-- filter_decision_22.png
|   |-- filter_decision_33.png
|   |-- filter_decision_44.png
|   |-- filter_decision_55.png
|   |-- pretagging_histogram_after.png
|   |-- pretagging_histogram_before.png
|   |-- scatter_pca.png
|   |-- scatter_pca_no_overlay.png
|   |-- scatter_umap_k_15.png
|   |-- scatter_umap_k_15_no_overlay.png
|   |-- scatter_umap_k_5.png
|   |-- scatter_umap_k_50.png
|   |-- scatter_umap_k_50_no_overlay.png
|   '-- scatter_umap_k_5_no_overlay.png
|-- report.json
'-- report.pdf

Evaluation of the Sampling Proces

Histograms and Plots

The report contains histograms of the pairwise distance between images before and after the sampling.

An example of such a histogram before and after filtering for the CamVid dataset consisting of 367 samples is shown below. We marked the region which is of special interest with an orange rectangle. Our goal is to make this histogram more symmetric by removing samples of short distances from each other.

If we remove 25 samples (7%) out of the 367 samples of the CamVid dataset the histogram looks more symmetric as shown below. In our experiments, removing 7% of the dataset results in a model with higher validation set accuracy.

../../_images/histogram_before_after.jpg

Note

Why symmetric histograms are preferred: An asymmetric histogram can be the result of either a dataset with outliers or inliers. A heavy tail for low distances means that there is at least one high-density region with many samples very close to each other within the main cluster. Having such a high-density region can lead to biased models trained on this particular dataset. A heavy tail towards high distances shows that there is at least one high-density region outside the main cluster of samples.

Retained/Removed Image Pairs

The report also displays examples of retained images with their nearest neighbor among the removed images. This is a good heuristic to see whether the number of retained samples is too small or too large: If the pairs are are very different, this may be a sign that too many samples were removed. If the pairs are similar, it is suggested that more images are removed.

With the argument stopping_condition.n_samples=X you can set the number of samples which should be kept.

docker run --gpus all --rm -it \
    -v INPUT_DIR:/home/input_dir:ro \
    -v OUTPUT_DIR:/home/output_dir \
    lightly/sampling:latest \
    token=MYAWESOMETOKEN \
    remove_exact_duplicates=True \
    enable_corruptness_check=False \
    stopping_condition.n_samples=500

With the argument n_example_images you can determine how many pairs are shown. Note that this must be an even number.

docker run --gpus all --rm -it \
    -v INPUT_DIR:/home/input_dir:ro \
    -v OUTPUT_DIR:/home/output_dir \
    lightly/sampling:latest \
    token=MYAWESOMETOKEN \
    remove_exact_duplicates=True \
    enable_corruptness_check=False \
    stopping_condition.n_samples=0.3 \
    n_example_images=32