Known Issues & FAQ
Install Docker with GPU Support
If you install Docker using apt-get install docker
or following the official Docker installation guide, you might not install the version that also supports GPU drivers. Instead, you should follow the Docker installation docs by Nvidia.
Furthermore, ensure you can run Docker as a non-root user (recommended for security).
Follow the instructions from the official Docker docs
Here is a quick summary of the required shell commands:
- Set up the package repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
- Update the repository
sudo apt-get update
- Install
nvidia-docker
sudo apt-get install -y nvidia-docker2
- Restart the Docker service
sudo systemctl restart docker
- Test the installation
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 34C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
- Make sure you can run Docker as non-root (recommended for security)
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
- Test that you can run Docker as non-root with GPU support
docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
Lightly Worker Is Slow When Working with Long Videos
We are working on this issue internally. For now, we suggest splitting the large videos into chunks. You can do this using ffmpeg
without losing quality. The following code breaks up the video so that no re-encoding is needed.
ffmpeg -i input.mp4 -c copy -map 0 -segment_time 01:00:00 -f segment -reset_timestamps 1 output%03d.mp4
What exactly happens here?
input.mp4
is your input videoc copy -map 0
makes sure we copy and don’t re-encode the videosegment_time 01:00:00 -f segment
defines that we want chunks of 1h eachreset_timestamps 1
ensures the timestamps are reset (each video starts from 0)output%03d.mp4
is the name of the output videos (output001.mp4, output002.mp4, …)
Lightly Worker Uses too much Memory and Crashes
When running the Lightly Worker and observing the logs you might encounter log messages stating a high % of memory consumption.
[2023-02-14 08:45:40,634] Memory consumption is at 98.6%.
To reduce the memory consumption we recommend to reduce the number of background workers that load the data. These background workers spawn as separate processes and can consume quite a bit of memory.
You can set thenum_workers
directly in the lightly config. In the example below we set the config to use 2 workers. The default option is to spawn a new worker for every available CPU core. If you have a machine with more cores but less memory the default option might cause a too high memory usage. Check out our Hardware Recommendations for more info about recommended number of CPU cores and memory.
from lightly.api import ApiWorkflowClient
# Create a client with your token and configure it to use your dataset ID.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN", dataset_id="MY_DATASET_ID")
# Configure and schedule a run.
scheduled_run_id = client.schedule_compute_worker_run(
worker_config={},
selection_config={
"n_samples": 50,
"strategies": [
{"input": {"type": "EMBEDDINGS"}, "strategy": {"type": "DIVERSITY"}}
],
},
lightly_config={
'loader': {
'num_workers': 2, # use 2 workers, (default: -1)
},
},
)
print(scheduled_run_id)
Lightly Worker Crashes When Running with GPUs
You run a docker container with –gpus all
and encounter the following error?
Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
This might be because your Docker installation does not support GPUs. Try to install nvidia-docker
following this guide.
Shared Memory Error When Running Lightly Worker
The following error message appears when the Docker runtime runs out of shared memory. By default, Docker uses 64 megabytes.
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/opt/conda/envs/env/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
obj = _ForkingPickler.dumps(obj)
File "/opt/conda/envs/env/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/opt/conda/envs/env/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_31_1030151126>
To solve this problem, we need to reduce the number of workers or increase the shared memory of the Docker runtime.
Use fewer workers:
Lightly determines the number of CPU cores available and sets the number of workers to the same number. If you have a machine with many cores but not so much memory (e.g., less than 2 GB of memory per core), you may run out of memory, and you rather want to reduce the number of workers instead of increasing the shared memory.
Increase the shared memory limit:
You can change the shared memory from 64 megabytes to 512 megabytes by adding –shm-size=”512m”
to the Docker run command:
# example of docker run with setting shared memory to 512 MBytes
docker run --shm-size="512m" --gpus all
# you can also increase it to 2 Gigabytes using
docker run --shm-size="2G" --gpus all
CUDA Error: All CUDA-capable devices are busy or unavailable
CUDA error: all CUDA-capable devices are busy or unavailable CUDA kernel
errors might be asynchronously reported at some other API call,so the
stacktrace below might be incorrect. For debugging consider
passing CUDA_LAUNCH_BLOCKING=1.
This error is most likely because some processes on your machine reserved resources on the GPU without properly releasing them. A simple reboot will often resolve the problem, as all GPU resources will be freshly allocated during the reboot. However, if a reboot does not help, we suggest switching to a different CUDA version on your system.
Lightly Worker Crashes Because of Too Many Open Files
The following error message appears when the Docker runtime lacks enough file handlers. By default, Docker uses nofile=1024
. However, this might not be enough when using multiple workers for data fetching with lightly.loader.num_workers
.
Error [Errno 24] Too many open files>
To solve this problem, increase the number of file handlers for the Docker runtime.
You can change the number of file handlers to 90000 by adding –ulimit nofile=90000:90000
to the Docker run command:
docker run --ulimit nofile=90000:90000 --gpus all
More documentation on docker file handlers is provided here.
Token Printed to Shared Stdout or Logs
The token (along with other Hydra configurations) will be printed to stdout and, therefore, appear in logs in an automated setup. This can be avoided by setting your token via the LIGHTLY_TOKEN
environment variable:
docker run --shm-size="1024m" --gpus all --rm -it \
-e LIGHTLY_TOKEN={MY_LIGHTLY_TOKEN} \
-e LIGHTLY_WORKER_ID={MY_WORKER_ID} \
lightly/worker:latest \
Locked Jobs
Lightly tries to protect different Lightly Workers and their runs from each other and ensures that only one job can be active per dataset, while still allowing multiple Lightly Workers (with different worker_ids) to be running in parallel.
In certain situations it can happen that a job remains locked, e.g. when the Lightly Worker crashes, or when it is forcefully shutdown without a grace period to cleanup while executing a job. From the view point of Lightly, the job is still being processed and the Lightly Worker will be considered as being online.
This can lead to a warning message when starting the Lightly Worker with the same worker_id such as Found 4 LOCKED jobs of the worker with id
.
Resolving Locked Jobs
By default, worker.force_start
is set to True
, which allows one to bypass our detection of a potentially second Lightly Worker still running and can result in the aforementioned warning. In this case, we will not cancel any locked jobs. The Lightly Worker starts normally and takes the next job.
Setting worker.force_start
to False
, will not bypass our detection of a potentially second Lightly Worker. This will cause your new Lightly Worker to not process jobs if a second Lightly Worker with the same worker_id is detected. If no other Lightly Worker is running, all the locked/failed jobs previously handled by the worker_id are cleaned up and the Lightly Worker will start normally and take the next job.
In a multi-Lightly Worker setup, is is recommended to set worker.force_start
to False
.
docker run --shm-size="1024m" --gpus all --rm -it \
-e LIGHTLY_TOKEN={MY_LIGHTLY_TOKEN} \
-e LIGHTLY_WORKER_ID={MY_WORKER_ID} \
lightly/worker:latest \
worker.force_start=False
Lightly Worker on WSL
We don't recommend running the Lightly Worker on your Windows Subsystem for Linux (WSL) as it may come to connection issues. This is likely due to Windows acting as a proxy between the Lightly Worker and the Lightly Platform or cloud providers. Instead, we recommend using a Linux machine or a cloud-hosted instance such as AWS EC2, Google Cloud compute engines or Azure Compute.
Why Is My Scheduled Run Not Picked Up by a Worker?
There can be several reasons for this:
- Make sure that you have a Lightly Worker running and polling for jobs. Look for it on the workers overview page on the Lightly Platform.
- If you specified labels for your worker, make sure that the
runs_on
labels you specified when scheduling a job are a subset of the Worker labels. See the docs on how label matching works. - If you use a Lightly Worker with version >= v2.5, then the Lightly Python Client for scheduling runs must have version >= v1.3 and vice-versa. Please see our compatibility table for a list of compatible Lightly Worker and Lightly Python Client versions.
- When a worker picks up a scheduled run, it is removed from the list of scheduled runs and moved to the worker runs list. Please check if your scheduled run was already picked up.
- The Lightly Worker polls for new jobs every 20s, it might take a while until a scheduled run is picked up.
Updated 3 months ago