Datapool
=================

.. warning::
    **The Docker Archive documentation is deprecated**

    The old workflow described in these docs will not be supported with new Lightly Worker versions above 2.6.
    Please switch to our `new documentation page <https://docs.lightly.ai/docs>`_ instead.

The Lightly Datapool is a tool which allows users to incrementally build up a 
dataset for their project. It keeps track of the representations of previously
selected samples and uses this information to pick new samples in order to
maximize the quality of the final dataset. It also allows for combining two 
different datasets into one.

- | If you're interested in how the datapool works, go to
  | --> `How It Works`_

- | To see how you can use the datapool, check out
  | --> `Usage`_


How It Works
---------------

The Lightly Datapool keeps track of the selected samples in a csv file called
`datapool_latest.csv`. It contains the filenames of the selected images, their
embeddings, and their weak labels. Additionally, after training a self-supervised
model, the datapool contains the checkpoint `checkpoint_latest.ckpt` which was 
used to generate the embeddings.

The datapool is located in the `shared` directory. In general, it is a directory
with the following structure:


.. code-block:: bash

    # example of a datapool
    datapool/
    +--- datapool_latest.csv
    +--- checkpoint_latest.ckpt
    +--- history/
  
The files `datapool_latest.csv` and `checkpoint_latest.csv` are updated after every
run of the Lightly Docker. The history folder contains the previous versions of 
the datapool. This feature is meant to prevent accidental overrides and can be 
deactivated from the command-line (see `Usage`_ for more information).

Usage
---------------

To **initialize** a datapool, simply pass the name of the datapool as an argument
to your docker run command and sample from a dataset as always. The Lightly Docker
will automatically create a datapool directory and populate it with the required
files.

.. note:: To use the datapool feature, the Lightly Docker requires write access
          to a shared directory. This directory can be passed with the `-v` flag.

.. code-block:: console

   docker run --gpus all --rm -it \
      -v {INPUT_DIR}:/home/input_dir:ro \
      -v {SHARED_DIR}:/home/shared_dir \
      -v {OUTPUT_DIR}:/home/output_dir \
      lightly/worker:latest \
      token=MYAWESOMETOKEN \
      append_weak_labels=False \
      stopping_condition.min_distance=0.1 \
      datapool.name=my_datapool


To **append** to your datapool, pass the name of an existing datapool as an argument.
The Lightly Docker will read the embeddings and filenames from the existing pool and
consider them during selection. Then, it will update the datapool and checkpoint files.

.. note:: You can't change the dimension of the embeddings once the datapool has
          been initialized so choose carefully!

.. code-block:: console

   docker run --gpus all --rm -it \
      -v {OTHER_INPUT_DIR}:/home/input_dir:ro \
      -v {SHARED_DIR}:/home/shared_dir \
      -v {OUTPUT_DIR}:/home/output_dir \
      lightly/worker:latest \
      token=MYAWESOMETOKEN \
      append_weak_labels=False \
      stopping_condition.min_distance=0.1 \
      datapool.name=my_datapool


To **deactivate automatic archiving** of the past datapool versions, you can pass
set the flag `keep_history` to False.

.. code-block:: console

   docker run --gpus all --rm -it \
      -v {INPUT_DIR}:/home/input_dir:ro \
      -v {SHARED_DIR}:/home/shared_dir \
      -v {OUTPUT_DIR}:/home/output_dir \
      lightly/worker:latest \
      token=MYAWESOMETOKEN \
      append_weak_labels=False \
      stopping_condition.min_distance=0.1 \
      datapool.name=my_datapool \
      datapool.keep_history=False