Let’s have a look at how to run the docker container to analyze and filter the famous ImageNet dataset. You can reproduce the sample report using the following command.

docker run --gpus all --rm -it \
    -v /datasets/imagenet/train/:/home/input_dir:ro \
    -v /datasets/docker_imagenet_500k:/home/output_dir \
    --ipc="host" \
    lightly/sampling:latest \
    lightly.collate.input_size=64 \
    lightly.loader.batch_size=256 \
    lightly.loader.num_workers=8 \
    lightly.trainer.max_epochs=0 \
    stopping_condition.n_samples=500000 \
    remove_exact_duplicates=True \

The complete processing time was 04h 37m 02s. The machine used for this experiment is a cloud instance with 8 cores, 30GB of RAM, and a V100 GPU. The dataset was stored on an SSD drive.

You can also use the direct link for the ImageNet report.