The Lightly Docker solution follows a train, embed, select flow using self-supervised learning.
+-------+ +-------+ +--------+ | Train +----->+ Embed +----->+ Select | +-------+ +-------+ +--------+
You can either use a pre-trained model from the model zoo or fine-tune a model on your unlabeled dataset using self-supervised learning. The output of the train step is a model checkpoint.
The embed step creates embeddings of the input dataset. Each sample gets represented using a low-dimensional vector. The output of the embed step is a .csv file.
Finally, based on the embeddings and additional information we can use one of the sampling algorithms to pick the relevant data for you. The output of the select step is a list of filenames as well as analytics in form of a pdf report with plots.
You can use each of the three steps independently as well. E.g. you can use the Lightly Docker to embed a dataset and train a linear classifier on top of them.
The docker solution can be used as a command-line interface. You run the container, tell it where to find data, and where to store the result. That’s it. There are various parameters you can pass to the container. We put a lot of effort to also expose the full lightly framework configuration. You could use the docker solution to train a self-supervised model instead of using the Python framework.
Before jumping into the detail let’s have a look at some basics. The docker container can be used as a simple script. You can control parameters by changing flags.
Use the following command to get an overview of the available parameters:
docker run --gpus all --rm -it lightly/sampling:latest --help
We use volume mapping provided by the docker run command to process datasets. A docker container itself is not considered to be a good place to store your data. Volume mapping allows the container to work with the filesystem of the host system.
There are three types of volume mappings:
- Input Directory:
The input directory contains the dataset we want to process. The format of the input data should be either a single folder containing all the images or a folder containing a subfolder which holds the images. See the tutorial “Structure Your Input” for more information. The container has only read access to this directory (note the :ro at the end of the volume mapping).
- Shared Directory:
The shared directory allows the user to pass additional inputs such as embeddings or model checkpoints to the container. The checkpoints should be generated by the lightly Python package or by the docker container and the embeddings should be in the format specified in the tutorial “Structure Your Input”. The container requires only read access to this directory.
- Output Directory:
The output directory is the place where the results from all computations made by the container are stored. See Reporting and Docker Output for additional information. The container requires read and write access to this directory.
Docker volume or port mappings always follow the scheme that you first specify the host systems port followed by the internal port of the container. E.g. -v /datasets:/home/datasets would mount /datasets from your system to /home/datasets in the docker container.
Typically, your docker command would start like this:
Map INPUT_DIR (from your system) to /home/input_dir in the container
Map OUTPUT_DIR (from your system) to /home/output_dir in the container
Specify the token to authenticate your user
docker run --gpus all --rm -it \ -v INPUT_DIR:/home/input_dir:ro \ -v OUTPUT_DIR:/home/output_dir \ lightly/sampling:latest \ token=MYAWESOMETOKEN
Now, let’s see how this will look in action!
Learn how to obtain your Authentication API Token.
Embedding and Sampling a Dataset¶
To embed your images with a pre-trained model, you can run the docker solution with this command:
docker run --gpus all --rm -it \ -v INPUT_DIR:/home/input_dir:ro \ -v OUTPUT_DIR:/home/output_dir \ lightly/sampling:latest \ token=MYAWESOMETOKEN \ remove_exact_duplicates=True \ enable_corruptness_check=True \ stopping_condition.n_samples=0.3
The command above does the following:
remove_exact_duplicates=True Check your dataset for corrupt images
enable_corruptness_check=True Removes exact duplicates
stopping_condition.n_samples=0.3 Samples 30% of the images using the default method (coreset). Sampling 30% means that the remaining dataset will be 30% of the initial dataset size. You can also specify the exact number of remaining images by setting n_samples to an integer value.
Train a Self-Supervised Model¶
Sometimes it may be beneficial to finetune a self-supervised model on your dataset before embedding the images. This may be the case when the dataset is from a specific domain (e.g. for medical images).
The command below will train a self-supervised model for (default: 100) epochs on the images stored in the input directory before embedding and sampling them.
docker run --gpus all --rm -it \ -v INPUT_DIR:/home/input_dir:ro \ -v OUTPUT_DIR:/home/output_dir \ lightly/sampling:latest \ token=MYAWESOMETOKEN \ enable_training=True
The training of the model is identical to using the lightly open-source package with the following command:
Checkpoints from your training process will be stored in the output directory. You can continue training from such a checkpoint by copying the checkpoint to the shared directory and then passing the checkpoint filename to the container:
docker run --gpus all --rm -it \ -v INPUT_DIR:/home/input_dir:ro \ -v SHARED_DIR:/home/shared_dir:ro \ -v OUTPUT_DIR:/home/output_dir \ lightly/sampling:latest \ token=MYAWESOMETOKEN \ stopping_condition.n_samples=0.3 \ enable_training=True \ checkpoint=lightly_epoch_99.ckpt
You may not always want to train for exactly 100 epochs with the default settings. The next section will explain how to customize the default settings.
Accessing Lightly Input Parameters¶
The docker container is a wrapper around the lightly Python package. Hence, for training and embedding the user can access all the settings from the lightly command-line tool. Just prepend the parameter with lightly to do so.
docker run --gpus all --rm -it \ -v INPUT_DIR:/home/input_dir:ro \ -v OUTPUT_DIR:/home/output_dir \ lightly/sampling:latest \ token=MYAWESOMETOKEN \ remove_exact_duplicates=True \ enable_corruptness_check=True \ stopping_condition.n_samples=0.3 \ enable_training=True \ lightly.trainer.max_epochs=10 \ lightly.collate.input_size=64 \ lightly.loader.batch_size=256 \ lightly.trainer.precision=16 \ lightly.model.name=resnet-101
A list of all input parameters can be found here: List of Parameters
Sampling from Embeddings File¶
It is also possible to sample directly from embedding files generated by previous runs. For this, move the embeddings file to the shared directory, and specify the filename like so:
docker run --gpus all --rm -it \ -v INPUT_DIR:/home/input_dir:ro \ -v SHARED_DIR:/home/shared_dir:ro \ -v OUTPUT_DIR:/home/output_dir \ lightly/sampling:latest \ token=MYAWESOMETOKEN \ remove_exact_duplicates=True \ enable_corruptness_check=False \ stopping_condition.n_samples=0.3 \ embeddings=my_embeddings.csv
Sampling from Video Files¶
In case you are working with video files, it is possible to point the docker container directly to the video files. This prevents the need to extract the individual frames beforehand. To do so, simply store all videos you want to work with in a single directory, the lightly software will automatically load all frames from the videos.
# work on a single video data/ +-- my_video.mp4 # work on several videos data/ +-- my_video_1.mp4 +-- my_video_2.avi
As you can see, the videos do not need to be in the same file format. An example command for a folder structure as shown above could then look like this:
docker run --gpus all --rm -it \ -v INPUT_DIR:/home/input_dir:ro \ -v SHARED_DIR:/home/shared_dir:ro \ -v OUTPUT_DIR:/home/output_dir \ lightly/sampling:latest \ token=MYAWESOMETOKEN \ stopping_condition.n_samples=0.3
Where INPUT_DIR is the path to the directory containing the video files.
Removing Exact Duplicates¶
With the docker solution, it is possible to remove only exact duplicates from the dataset. For this, simply set the stopping condition n_samples to 1.0 (which translates to 100% of the data). The exact command is:
docker run --gpus all --rm -it \ -v INPUT_DIR:/home/input_dir:ro \ -v SHARED_DIR:/home/shared_dir:ro \ -v OUTPUT_DIR:/home/output_dir \ lightly/sampling:latest \ token=MYAWESOMETOKEN \ remove_exact_duplicates=True \ stopping_condition.n_samples=1.
To facilitate sustainability and reproducibility in ML, the docker container has an integrated reporting component. For every dataset, you run through the container an output directory gets created with the exact configuration used for the experiment. Additionally, plots, statistics, and more information collected either during the training of the self-supervised model, embedding, or sampling of the dataset are provided.
To make it easier for you to understand and discuss the dataset we put the essential information into an automatically generated PDF report. Sample reports can be found on the Lightly website.
The output directory is structured in the following way:
A directory containing copies of the configuration files and overwrites.
The data directory contains everything to do with data. If enable_corruptness_check=True, it will contain a “clean” version of the dataset. If remove_exact_duplicates=True, it will contain a copy of the embeddings.csv where all duplicates are removed. Otherwise, it will simply store the embeddings computed by the model.
This directory contains lists of filenames of the corrupt images, removed images, sampled images and the images which were removed because they have an exact duplicate in the dataset.
A directory containing the plots which were produced for the report.
To provide a simple overview of the filtering process the docker container automatically generates a report. The report contains
information about the job (duration, processed files etc.)
estimated savings in terms of labeling costs and CO2 due to the smaller dataset
statistics about the dataset before and after sampling
histogram before and after filtering
visualizations of the dataset
nearest neighbors of retained images among the removed ones
Below you find a typical output folder structure.
|-- config | |-- config.yaml | |-- hydra.yaml | `-- overrides.yaml |-- data | |-- embeddings.csv | `-- unique_embeddings.csv |-- filenames | |-- corrupt_filenames.txt | |-- duplicate_filenames.txt | |-- removed_filenames.txt | `-- sampled_filenames.txt |-- plots | |-- distance_distr_after.png | |-- distance_distr_before.png | |-- filter_decision_0.png | |-- filter_decision_166668.png | |-- filter_decision_250002.png | |-- filter_decision_333336.png | |-- filter_decision_416670.png | |-- filter_decision_83334.png | |-- scatter_pca.png | |-- scatter_pca_no_overlay.png | |-- scatter_umap.png | `-- scatter_umap_no_overlay.png `-- report.pdf
Evaluation of the Sampling Proces¶
Histograms and Plots
The report contains histograms of the pairwise distance between images before and after the sampling.
An example of such a histogram before and after filtering for the CamVid dataset consisting of 367 samples is shown below. We marked the region which is of special interest with an orange rectangle. Our goal is to make this histogram more symmetric by removing samples of short distances from each other.
If we remove 25 samples (7%) out of the 367 samples of the CamVid dataset the histogram looks more symmetric as shown below. In our experiments, removing 7% of the dataset results in a model with higher validation set accuracy.
Why symmetric histograms are preferred: An asymmetric histogram can be the result of either a dataset with outliers or inliers. A heavy tail for low distances means that there is at least one high-density region with many samples very close to each other within the main cluster. Having such a high-density region can lead to biased models trained on this particular dataset. A heavy tail towards high distances shows that there is at least one high-density region outside the main cluster of samples.
Retained/Removed Image Pairs
The report also displays examples of retained images with their nearest neighbor among the removed images. This is a good heuristic to see whether the number of retained samples is too small or too large: If the pairs are are very different, this may be a sign that too many samples were removed. If the pairs are similar, it is suggested that more images are removed.
With the argument stopping_condition.n_samples=X you can set the number of samples which should be kept.
docker run --gpus all --rm -it \ -v INPUT_DIR:/home/input_dir:ro \ -v OUTPUT_DIR:/home/output_dir \ lightly/sampling:latest \ token=MYAWESOMETOKEN \ remove_exact_duplicates=True \ enable_corruptness_check=False \ stopping_condition.n_samples=500
With the argument n_example_images you can determine how many pairs are shown. Note that this must be an even number.
docker run --gpus all --rm -it \ -v INPUT_DIR:/home/input_dir:ro \ -v OUTPUT_DIR:/home/output_dir \ lightly/sampling:latest \ token=MYAWESOMETOKEN \ remove_exact_duplicates=True \ enable_corruptness_check=False \ stopping_condition.n_samples=0.3 \ n_example_images=32