Datapool

Warning

The Docker Archive documentation is deprecated

The old workflow described in these docs will not be supported with new Lightly Worker versions above 2.6. Please switch to our new documentation page instead.

The Lightly Datapool is a tool which allows users to incrementally build up a dataset for their project. It keeps track of the representations of previously selected samples and uses this information to pick new samples in order to maximize the quality of the final dataset. It also allows for combining two different datasets into one.

  • If you’re interested in how the datapool works, go to
  • To see how you can use the datapool, check out
    –> Usage

How It Works

The Lightly Datapool keeps track of the selected samples in a csv file called datapool_latest.csv. It contains the filenames of the selected images, their embeddings, and their weak labels. Additionally, after training a self-supervised model, the datapool contains the checkpoint checkpoint_latest.ckpt which was used to generate the embeddings.

The datapool is located in the shared directory. In general, it is a directory with the following structure:

# example of a datapool
datapool/
+--- datapool_latest.csv
+--- checkpoint_latest.ckpt
+--- history/

The files datapool_latest.csv and checkpoint_latest.csv are updated after every run of the Lightly Docker. The history folder contains the previous versions of the datapool. This feature is meant to prevent accidental overrides and can be deactivated from the command-line (see Usage for more information).

Usage

To initialize a datapool, simply pass the name of the datapool as an argument to your docker run command and sample from a dataset as always. The Lightly Docker will automatically create a datapool directory and populate it with the required files.

Note

To use the datapool feature, the Lightly Docker requires write access to a shared directory. This directory can be passed with the -v flag.

docker run --gpus all --rm -it \
   -v {INPUT_DIR}:/home/input_dir:ro \
   -v {SHARED_DIR}:/home/shared_dir \
   -v {OUTPUT_DIR}:/home/output_dir \
   lightly/worker:latest \
   token=MYAWESOMETOKEN \
   append_weak_labels=False \
   stopping_condition.min_distance=0.1 \
   datapool.name=my_datapool

To append to your datapool, pass the name of an existing datapool as an argument. The Lightly Docker will read the embeddings and filenames from the existing pool and consider them during selection. Then, it will update the datapool and checkpoint files.

Note

You can’t change the dimension of the embeddings once the datapool has been initialized so choose carefully!

docker run --gpus all --rm -it \
   -v {OTHER_INPUT_DIR}:/home/input_dir:ro \
   -v {SHARED_DIR}:/home/shared_dir \
   -v {OUTPUT_DIR}:/home/output_dir \
   lightly/worker:latest \
   token=MYAWESOMETOKEN \
   append_weak_labels=False \
   stopping_condition.min_distance=0.1 \
   datapool.name=my_datapool

To deactivate automatic archiving of the past datapool versions, you can pass set the flag keep_history to False.

docker run --gpus all --rm -it \
   -v {INPUT_DIR}:/home/input_dir:ro \
   -v {SHARED_DIR}:/home/shared_dir \
   -v {OUTPUT_DIR}:/home/output_dir \
   lightly/worker:latest \
   token=MYAWESOMETOKEN \
   append_weak_labels=False \
   stopping_condition.min_distance=0.1 \
   datapool.name=my_datapool \
   datapool.keep_history=False