Known Issues & FAQ

Install Docker with GPU Support

If you install Docker using apt-get install docker or following the official Docker installation guide, you might not install the version that also supports GPU drivers. Instead, you should follow the Docker installation docs by Nvidia.

Furthermore, ensure you can run Docker as a non-root user (recommended for security).
Follow the instructions from the official Docker docs

Here is a quick summary of the required shell commands:

  1. Set up the package repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
    && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
    && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
  1. Update the repository
sudo apt-get update
  1. Install nvidia-docker
sudo apt-get install -y nvidia-docker2
  1. Restart the Docker service
sudo systemctl restart docker
  1. Test the installation
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
  1. Make sure you can run Docker as non-root (recommended for security)
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
  1. Test that you can run Docker as non-root with GPU support
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

Lightly Worker Is Slow When Working with Long Videos

We are working on this issue internally. For now, we suggest splitting the large videos into chunks. You can do this using ffmpeg without losing quality. The following code breaks up the video so that no re-encoding is needed.

ffmpeg -i input.mp4 -c copy -map 0 -segment_time 01:00:00 -f segment -reset_timestamps 1 output%03d.mp4

What exactly happens here?

  • input.mp4 is your input video
  • c copy -map 0 makes sure we copy and don’t re-encode the video
  • segment_time 01:00:00 -f segment defines that we want chunks of 1h each
  • reset_timestamps 1 ensures the timestamps are reset (each video starts from 0)
  • output%03d.mp4 is the name of the output videos (output001.mp4, output002.mp4, …)

Lightly Worker Crashes When Running with GPUs

You run a docker container with –gpus all and encounter the following error?

Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

This might be because your Docker installation does not support GPUs. Try to install nvidia-docker following this guide.

Shared Memory Error When Running Lightly Worker

The following error message appears when the Docker runtime runs out of shared memory. By default, Docker uses 64 megabytes.

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/opt/conda/envs/env/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
    obj = _ForkingPickler.dumps(obj)
File "/opt/conda/envs/env/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
File "/opt/conda/envs/env/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
    fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_31_1030151126>

To solve this problem, we need to reduce the number of workers or increase the shared memory of the Docker runtime.

Use fewer workers:
Lightly determines the number of CPU cores available and sets the number of workers to the same number. If you have a machine with many cores but not so much memory (e.g., less than 2 GB of memory per core), you may run out of memory, and you rather want to reduce the number of workers instead of increasing the shared memory.

Increase the shared memory limit:
You can change the shared memory from 64 megabytes to 512 megabytes by adding –shm-size=”512m” to the Docker run command:

# example of docker run with setting shared memory to 512 MBytes
docker run --shm-size="512m" --gpus all

# you can also increase it to 2 Gigabytes using
docker run --shm-size="2G" --gpus all

CUDA Error: All CUDA-capable devices are busy or unavailable

CUDA error: all CUDA-capable devices are busy or unavailable CUDA kernel
errors might be asynchronously reported at some other API call,so the
stacktrace below might be incorrect. For debugging consider
passing CUDA_LAUNCH_BLOCKING=1.

This error is most likely because some processes on your machine reserved resources on the GPU without properly releasing them. A simple reboot will often resolve the problem, as all GPU resources will be freshly allocated during the reboot. However, if a reboot does not help, we suggest switching to a different CUDA version on your system.

Lightly Worker Crashes Because of Too Many Open Files

The following error message appears when the Docker runtime lacks enough file handlers. By default, Docker uses nofile=1024. However, this might not be enough when using multiple workers for data fetching with lightly.loader.num_workers.

Error [Errno 24] Too many open files>

To solve this problem, increase the number of file handlers for the Docker runtime.

You can change the number of file handlers to 90000 by adding –ulimit nofile=90000:90000 to the Docker run command:

docker run --ulimit nofile=90000:90000 --gpus all

More documentation on docker file handlers is provided here.

Token Printed to Shared Stdout or Logs

The token (along with other Hydra configurations) will be printed to stdout and, therefore, appear in logs in an automated setup. This can be avoided by setting your token via the LIGHTLY_TOKEN environment variable:

docker run --shm-size="1024m" --gpus all --rm -it \
    -e LIGHTLY_TOKEN={MY_LIGHTLY_TOKEN} \
    lightly/worker:latest \
    worker.worker_id={MY_WORKER_ID}

Locked Jobs

Lightly tries to protect different Lightly Workers and their runs from each other and ensures that only one job can be active per dataset, while still allowing multiple Lightly Workers (with different worker_ids) to be running in parallel.
In certain situations it can happen that a job remains locked, e.g. when the Lightly Worker crashes, or when it is forcefully shutdown without a grace period to cleanup while executing a job. From the view point of Lightly, the job is still being processed and the Lightly Worker will be considered as being online.

This can lead to a warning message when starting the Lightly Worker with the same worker_id such as Found 4 LOCKED jobs of the worker with id.

Resolving Locked Jobs

By default, worker.force_start is set to True, which allows one to bypass our detection of a potentially second Lightly Worker still running and can result in the aforementioned warning. In this case, we will not cancel any locked jobs. The Lightly Worker starts normally and takes the next job.

Setting worker.force_start to False, will not bypass our detection of a potentially second Lightly Worker. This will cause your new Lightly Worker to not process jobs if a second Lightly Worker with the same worker_id is detected. If no other Lightly Worker is running, all the locked/failed jobs previously handled by the worker_id are cleaned up and the Lightly Worker will start normally and take the next job.
In a multi-Lightly Worker setup, is is recommended to set worker.force_start to False.

docker run --shm-size="1024m" --gpus all --rm -it \
    -e LIGHTLY_TOKEN={MY_LIGHTLY_TOKEN} \
    lightly/worker:latest \
    worker.worker_id={MY_WORKER_ID} \
    worker.force_start=False