.. _rst-docker-first-steps: First Steps =================================== .. warning:: **The Docker Archive documentation is deprecated** The old workflow described in these docs will not be supported with new Lightly Worker versions above 2.6. Please switch to our `new documentation page `_ instead. The Lightly Docker solution follows a train, embed, select flow using self-supervised learning. .. code-block:: console +-------+ +-------+ +--------+ | Train +----->+ Embed +----->+ Select | +-------+ +-------+ +--------+ #. You can either use a pre-trained model from the model zoo or fine-tune a model on your unlabeled dataset using self-supervised learning. The output of the train step is a model checkpoint. #. The embed step creates embeddings of the input dataset. Each sample gets represented using a low-dimensional vector. The output of the embed step is a .csv file. #. Finally, based on the embeddings and additional information we can use one of the selection strategies to pick the relevant data for you. The output of the select step is a list of filenames as well as analytics in form of a pdf report with plots. You can use each of the three steps independently as well. E.g. you can use the Lightly Docker to embed a dataset and train a linear classifier on top of them. The docker solution can be used as a command-line interface. You run the container, tell it where to find data, and where to store the result. That's it. There are various parameters you can pass to the container. We put a lot of effort to also expose the full Lightly SSL framework configuration. You could use the docker solution to train a self-supervised model instead of using the Python framework. Before jumping into the detail let's have a look at some basics. The docker container can be used as a simple script. You can control parameters by changing flags. Use the following command to get an overview of the available parameters: .. code-block:: console docker run --gpus all --rm -it lightly/worker:latest --help .. note:: In case the command fails because docker does not detect your GPU you want to make sure `nvidia-docker` is installed. You can follow the guide `here `_. Storage Access ----------------------------------- We use volume mapping provided by the docker run command to process datasets. A docker container itself is not considered to be a good place to store your data. Volume mapping allows the container to work with the filesystem of the host system. There are **three** types of volume mappings: * **Input Directory:** The input directory contains the dataset we want to process. The format of the input data should be either a single folder containing all the images or a folder containing a subfolder which holds the images. See the tutorial :ref:`input-structure-label` for more information. The container has only **read access** to this directory (note the *:ro* at the end of the volume mapping). Instead of using a local input directory you can also use a cloud storage bucket on S3, GCS, or Azure as a remote datasource. For reference, head to :ref:`ref-docker-with-datasource`. * **Shared Directory:** The shared directory allows the user to pass additional inputs such as embeddings or model checkpoints to the container. The checkpoints should be generated by the lightly Python package or by the docker container and the embeddings should be in the format specified in the tutorial "Structure Your Input". The container requires only **read access** to this directory. * **Output Directory:** The output directory is the place where the results from all computations made by the container are stored. See `Reporting`_ and `Docker Output`_ for additional information. The container requires **read and write access** to this directory. .. note:: Docker volume or port mappings always follow the scheme that you first specify the host systems port followed by the internal port of the container. E.g. **-v /datasets:/home/datasets** would mount /datasets from your system to /home/datasets in the docker container. Typically, your docker command would start like this: - Map *{INPUT_DIR}* (from your system) to */home/input_dir* in the container *e.g. /path/to/my/cat/dataset:/home/input_dir:ro* - Map *{OUTPUT_DIR}* (from your system) to */home/output_dir* in the container *e.g. /path/where/I/want/the/docker/output:/home/output_dir* - Specify the token to authenticate your user .. code-block:: console docker run --gpus all --rm -it \ -v {INPUT_DIR}:/home/input_dir:ro \ -v {OUTPUT_DIR}:/home/output_dir \ lightly/worker:latest \ token=MYAWESOMETOKEN Now, let's see how this will look in action! .. note:: Learn how to obtain your :ref:`ref-authentication-token`. .. warning:: Don't forget to replace **{INPUT_DIR}** and **{OUTPUT_DIR}** with the path to your local input and output directory. You must not change the path after the **:** since this path is describing the internal file system within the container! When running the above docker command you will find a new folder with the current date and time in the {OUTPUT_DIR} folder. This can be inconvenient if you want to run the docker in an automated pipeline as the current date and time change. Using the **run_directory** parameter you can use a custom and deterministic output folder. The following docker run command would for example store the output in the *{OUTPUT_DIR}/docker_out* folder. .. code-block:: console docker run --gpus all --rm -it \ -v {INPUT_DIR}:/home/input_dir:ro \ -v {OUTPUT_DIR}:/home/output_dir \ lightly/worker:latest \ token=MYAWESOMETOKEN \ run_directory="docker_out" Specify Relevant Files ---------------------------- Oftentimes not all files in a directory are relevant. In that case, it's possible to pass a list of filenames to the Lightly docker using the `relevant_filenames_file` configuration option. It will then only consider the listed filenames and ignore all others. To do so, you can create a text file which contains one relevant filename per line and then pass the path to the text file to the docker run command. This works for videos and images. For example, if this is your input directory: .. code-block:: console /path/to/my/data/ L my-video.mp4 L my-other-video.mp4 L some/subfolder/ L my-third-video.mp4 Then you can specify two input files by creating the following **filenames.txt**: .. code-block:: console my-video.mp4 some/subfolder/my-third-video.mp4 If you use a cloud bucket as input datasource, upload the file to it and copy the path of the file relative to the datasource root. If you use a cloud bucket and specified a separate input and output bucket, put the file in the .lightly folder of the output bucket and copy the path of the file relative to the output datasource root. E.g. if your dataset is at `path/to/dataset` and your relevant_filenames.txt at `path/to/dataset/subdir/relevant_filenames.txt`, then copy the path `subdir/relevant_filenames.txt`. If you use a local input directory, place the file in the shared directory and copy the path relative to it. Then you can add `relevant_filenames_file='subdir/relevant_filenames.txt'` to the docker run command and the Lightly docker will only consider **my-video.mp4** and **my-third-video.mp4**. Embedding a Dataset and Selecting from it ----------------------------------------- To embed your images with a pre-trained model, you can run the docker solution with this command: .. code-block:: console docker run --gpus all --rm -it \ -v {INPUT_DIR}:/home/input_dir:ro \ -v {OUTPUT_DIR}:/home/output_dir \ lightly/worker:latest \ token=MYAWESOMETOKEN \ remove_exact_duplicates=True \ enable_corruptness_check=True \ stopping_condition.n_samples=0.3 The command above does the following: - **remove_exact_duplicates=True** Check your dataset for corrupt images - **enable_corruptness_check=True** Removes exact duplicates - **stopping_condition.n_samples=0.3** Selects 30% of the images using the default method (coreset). Selecting 30% means that the remaining dataset will be 30% of the initial dataset size. You can also specify the exact number of remaining images by setting **n_samples** to an integer value. This allows you to specify the minimum allowed distance between two image embeddings in the output dataset. After normalizing the input embeddings to unit length, this value should be between 0 and 2. This is often a more convenient method when working with different data sources and trying to combine them in a balanced way. - **stopping_condition.min_distance=0.2** would remove all samples which are closer to each other than 0.2. The docker creates just an output file with the selected filenames for you. You can also tell the program to copy the selected files into the output folder by adding the parameter **dump_dataset=True** to the command. Train a Self-Supervised Model ----------------------------------- Sometimes it may be beneficial to finetune a self-supervised model on your dataset before embedding the images. This may be the case when the dataset is from a specific domain (e.g. for medical images). The command below will **train a self-supervised model** for (default: 100) epochs on the images stored in the input directory before embedding the images and selecting from them. .. code-block:: console docker run --gpus all --rm -it \ -v {INPUT_DIR}:/home/input_dir:ro \ -v {OUTPUT_DIR}:/home/output_dir \ lightly/worker:latest \ token=MYAWESOMETOKEN \ enable_training=True The training of the model is identical to using the lightly open-source package with the following command: .. code-block:: console lightly-train input_dir={INPUT_DIR} **Checkpoints** from your training process will be stored in the output directory. You can continue training from such a checkpoint by copying the checkpoint to the shared directory and then passing the checkpoint filename to the container: .. code-block:: console docker run --gpus all --rm -it \ -v {INPUT_DIR}:/home/input_dir:ro \ -v {SHARED_DIR}:/home/shared_dir \ -v {OUTPUT_DIR}:/home/output_dir \ lightly/worker:latest \ token=MYAWESOMETOKEN \ stopping_condition.n_samples=0.3 \ enable_training=True \ checkpoint=lightly_epoch_99.ckpt You may not always want to train for exactly 100 epochs with the default settings. The next section will explain how to customize the default settings. Accessing Lightly Input Parameters ----------------------------------- The docker container is a wrapper around the lightly Python package. Hence, for training and embedding the user can access all the settings from the lightly command-line tool. Just prepend the parameter with **lightly** to do so. .. code-block:: console docker run --gpus all --rm -it \ -v {INPUT_DIR}:/home/input_dir:ro \ -v {OUTPUT_DIR}:/home/output_dir \ lightly/worker:latest \ token=MYAWESOMETOKEN \ remove_exact_duplicates=True \ enable_corruptness_check=True \ stopping_condition.n_samples=0.3 \ enable_training=True \ lightly.trainer.max_epochs=10 \ lightly.collate.input_size=64 \ lightly.loader.batch_size=256 \ lightly.trainer.precision=16 \ lightly.model.name=resnet-101 A list of all input parameters can be found here: :ref:`rst-docker-parameters` .. _docker-sampling-from-embeddings: Selecting from Embeddings File ---------------------------------- It is also possible to sample directly from embedding files generated by previous runs. For this, move the embeddings file to the shared directory, and specify the filename like so: .. code-block:: console docker run --gpus all --rm -it \ -v {INPUT_DIR}:/home/input_dir:ro \ -v {SHARED_DIR}:/home/shared_dir \ -v {OUTPUT_DIR}:/home/output_dir \ lightly/worker:latest \ token=MYAWESOMETOKEN \ remove_exact_duplicates=True \ enable_corruptness_check=False \ stopping_condition.n_samples=0.3 \ embeddings=my_embeddings.csv The embeddings file should follow the structure of the .csv file created by the lightly CLI: :ref:`ref-cli-embeddings-lightly` or as described in :ref:`ref-docker-meta-information`. Manually Inspecting the Embeddings ---------------------------------- Every time you run Lightly Docker you will find an `embeddings.csv` file in the output directory. This file contains the embeddings of all samples in your dataset. You can use the embeddings for clustering or manual inspection of your dataset. .. figure:: images/colab_embeddings_example.png :align: center :alt: Example plot of working with embeddings.csv Example plot of working with embeddings.csv We provide an `example notebook `_ to learn more about how to work with the embeddings. Selecting from Video Files -------------------------- In case you are working with video files, it is possible to point the docker container directly to the video files. This prevents the need to extract the individual frames beforehand. To do so, simply store all videos you want to work with in a single directory, the lightly software will automatically load all frames from the videos. .. code-block:: console # work on a single video data/ +-- my_video.mp4 # work on several videos data/ +-- my_video_1.mp4 +-- my_video_2.avi As you can see, the videos do not need to be in the same file format. An example command for a folder structure as shown above could then look like this: .. code-block:: console docker run --gpus all --rm -it \ -v {INPUT_DIR}:/home/input_dir:ro \ -v {SHARED_DIR}:/home/shared_dir \ -v {OUTPUT_DIR}:/home/output_dir \ lightly/worker:latest \ token=MYAWESOMETOKEN \ stopping_condition.n_samples=0.3 Where {INPUT_DIR} is the path to the directory containing the video files. You can let Lightly Docker automatically extract the selected frames and save them in the output folder using `dump_dataset=True`. .. code-block:: console docker run --gpus all --rm -it \ -v {INPUT_DIR}:/home/input_dir:ro \ -v {SHARED_DIR}:/home/shared_dir \ -v {OUTPUT_DIR}:/home/output_dir \ lightly/worker:latest \ token=MYAWESOMETOKEN \ stopping_condition.n_samples=0.3 \ dump_dataset=True .. note:: The `dump_dataset` feature by default saves the images in the `png` format. This can take a lot of time when working with high-resolution videos. You can speed up the process by specifying the output format `output_image_format='jpg'` or the resolution `output_image_size=X` of the images. Removing Exact Duplicates --------------------------- With the docker solution, it is possible to remove **only exact duplicates** from the dataset. For this, simply set the stopping condition `n_samples` to 1.0 (which translates to 100% of the data). The exact command is: .. code-block:: console docker run --gpus all --rm -it \ -v {INPUT_DIR}:/home/input_dir:ro \ -v {SHARED_DIR}:/home/shared_dir \ -v {OUTPUT_DIR}:/home/output_dir \ lightly/worker:latest \ token=MYAWESOMETOKEN \ remove_exact_duplicates=True \ stopping_condition.n_samples=1. .. _ref-docker-upload-to-platform: Upload Sampled Dataset To Lightly Platform ------------------------------------------ Lightly Docker can automatically push the selected dataset as well as its embeddings to the Lightly Platform. Imagine you have a dataset of 100 videos with 10'000 frames each. 1 Million frames in total. Using Lightly Docker and the coreset method we sample the most diverse 50'000 images (a reduction of 20x). Now we push the 50'000 images to the Lightly Platform for a more interactive analysis. We can access all metadata as well as the embedding view to explore the dataset, find clusters and further curate the dataset. Finally, we can use the Active Learning capabilities of the Lightly Platform to iteratively train, predict, label the dataset in chunks until we reach the desired model accuracy. To push the selected dataset automatically after running Lightly Docker you can append `upload_dataset=True` to the docker run command. E.g. .. code-block:: console docker run --gpus all --rm -it \ -v {INPUT_DIR}:/home/input_dir:ro \ -v {SHARED_DIR}:/home/shared_dir \ -v {OUTPUT_DIR}:/home/output_dir \ lightly/worker:latest \ token=MYAWESOMETOKEN \ stopping_condition.n_samples=50'000 \ stopping_condition.min_distance=0.3 \ upload_dataset=True You can upload only thumbnails (to save bandwidth) or only metadata (for privacy sensitive data) by adding the argument `lightly.upload=thumbnails` or `lightly.upload=meta`. .. note:: You must specify the stopping condition `n_samples` and set the value below 75'000 (the current limit of a dataset in the Lightly Platform). We recommend setting both stopping conditions (`min_distance` and `n_samples`) in which case selecting stops as soon as the first condition is met. Reporting ----------------------------------- To facilitate sustainability and reproducibility in ML, the docker container has an integrated reporting component. For every dataset, you run through the container an output directory gets created with the exact configuration used for the experiment. Additionally, plots, statistics, and more information collected during the various processing steps are provided. E.g. there is information about the corruptness check, embedding process and selection process. To make it easier for you to understand and discuss the dataset we put the essential information into an automatically generated PDF report. Sample reports can be found on the `Lightly website `_. .. _ref-docker-runs: Live View of Docker Status ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can get a live status update of the currently running docker runs through the `cloud platform `_. To use the new feature simply follow the steps: #. Make sure you have the latest docker version installed (see :ref:`ref-docker-download-and-install`) #. Open a browser and navigate to the `Lightly Platform `_ #. In the navigation menu on the top click on **My Docker Runs** #. Once you start the Lightly Docker you should see the dashboard of the current run. Please make sure that you use the same token for the docker run as you find in the dashboard. In the dashboard, you see a list of your docker runs and a live update of the active runs. Use this view to see whether the data selection is still running as expected. .. image:: images/docker_runs_overview.png .. note:: Note that only status updates and error messages are transmitted. Docker Output ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The output directory is structured in the following way: * config: A directory containing copies of the configuration files and overwrites. * data: The data directory contains everything to do with data. * If `enable_corruptness_check=True`, it will contain a "clean" version of the dataset. * If `remove_exact_duplicates=True`, it will contain a copy of the `embeddings.csv` where all duplicates are removed. Otherwise, it will simply store the embeddings computed by the model. * filenames: This directory contains lists of filenames of the corrupt images, removed images, selected images and the images which were removed because they have an exact duplicate in the dataset. * plots: A directory containing the plots which were produced for the report. * report.pdf To provide a simple overview of the filtering process the docker container automatically generates a report. The report contains * information about the job (duration, processed files etc.) * estimated savings in terms of labeling costs and CO2 due to the smaller dataset * statistics about the dataset before and after the selection process * histogram before and after filtering * visualizations of the dataset * nearest neighbors of retained images among the removed ones * **NEW** report.json * The report is also available as a report.json file. Any value from the pdf pdf report can be easily be accessed. Below you find a typical output folder structure. .. code-block:: console |-- config | |-- config.yaml | |-- hydra.yaml | '-- overrides.yaml |-- data | |-- al_score_embeddings.csv | |-- bounding_boxes.json | |-- bounding_boxes_examples | |-- embeddings.csv | |-- normalized_embeddings.csv | |-- sampled | '-- selected_embeddings.csv |-- filenames | |-- corrupt_filenames.txt | |-- duplicate_filenames.txt | |-- removed_filenames.txt | '-- sampled_filenames.txt |-- lightly_epoch_1.ckpt |-- plots | |-- distance_distr_after.png | |-- distance_distr_before.png | |-- filter_decision_0.png | |-- filter_decision_11.png | |-- filter_decision_22.png | |-- filter_decision_33.png | |-- filter_decision_44.png | |-- filter_decision_55.png | |-- pretagging_histogram_after.png | |-- pretagging_histogram_before.png | |-- scatter_pca.png | |-- scatter_pca_no_overlay.png | |-- scatter_umap_k_15.png | |-- scatter_umap_k_15_no_overlay.png | |-- scatter_umap_k_5.png | |-- scatter_umap_k_50.png | |-- scatter_umap_k_50_no_overlay.png | '-- scatter_umap_k_5_no_overlay.png |-- report.json '-- report.pdf Evaluation of the Selection Proces ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Histograms and Plots** The report contains histograms of the pairwise distance between images before and after the selection process. An example of such a histogram before and after filtering for the CamVid dataset consisting of 367 samples is shown below. We marked the region which is of special interest with an orange rectangle. Our goal is to make this histogram more symmetric by removing samples of short distances from each other. If we remove 25 samples (7%) out of the 367 samples of the CamVid dataset the histogram looks more symmetric as shown below. In our experiments, removing 7% of the dataset results in a model with higher validation set accuracy. .. image:: images/histogram_before_after.jpg .. note:: Why symmetric histograms are preferred: An asymmetric histogram can be the result of either a dataset with outliers or inliers. A heavy tail for low distances means that there is at least one high-density region with many samples very close to each other within the main cluster. Having such a high-density region can lead to biased models trained on this particular dataset. A heavy tail towards high distances shows that there is at least one high-density region outside the main cluster of samples. **Retained/Removed Image Pairs** The report also displays examples of retained images with their nearest neighbor among the removed images. This is a good heuristic to see whether the number of retained samples is too small or too large: If the pairs are are very different, this may be a sign that too many samples were removed. If the pairs are similar, it is suggested that more images are removed. With the argument stopping_condition.n_samples=X you can set the number of samples which should be kept. .. code-block:: console docker run --gpus all --rm -it \ -v {INPUT_DIR}:/home/input_dir:ro \ -v {OUTPUT_DIR}:/home/output_dir \ lightly/worker:latest \ token=MYAWESOMETOKEN \ remove_exact_duplicates=True \ enable_corruptness_check=False \ stopping_condition.n_samples=500 With the argument n_example_images you can determine how many pairs are shown. Note that this must be an even number. .. code-block:: console docker run --gpus all --rm -it \ -v {INPUT_DIR}:/home/input_dir:ro \ -v {OUTPUT_DIR}:/home/output_dir \ lightly/worker:latest \ token=MYAWESOMETOKEN \ remove_exact_duplicates=True \ enable_corruptness_check=False \ stopping_condition.n_samples=0.3 \ n_example_images=32