Known Issues & FAQ
Install Docker with GPU Support
See our in-depth installation guide.
LightlyOne Worker Crashes with Error response from daemon
Error response from daemon: could not select device driver
You run a docker container with –gpus all
and encounter the following error?
Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
This might be because your Docker installation does not support GPUs. Try to install nvidia-docker
following this guide.
Error response from daemon: failed to create task for container
You run a docker container with –gpus all
and encounter the following error?
docker: Error response from daemon: failed to create task for container: failed to create shim task:
OCI runtime create failed: runc create failed: unable to start container process:
error during container init: error running hook #0: error running hook:
exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.
ERRO[0000] error waiting for container: context canceled
This is probably caused by no Nvidia GPUs being available. Run the command nvidia-smi
to list all available GPUs.
- If it fails with e.g.
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
, first install a Nvidia GPU driver. - If it shows available GPUs, try to install
nvidia-docker
following this guide.
LightlyOne Worker Is Slow When Working with Long Videos
We are working on this issue internally. For now, we suggest splitting the large videos into chunks. You can do this using ffmpeg
without losing quality. The following code breaks up the video so that no re-encoding is needed.
ffmpeg -i input.mp4 -c copy -map 0 -segment_time 01:00:00 -f segment -reset_timestamps 1 output%03d.mp4
What exactly happens here?
input.mp4
is your input videoc copy -map 0
makes sure we copy and don’t re-encode the videosegment_time 01:00:00 -f segment
defines that we want chunks of 1h eachreset_timestamps 1
ensures the timestamps are reset (each video starts from 0)output%03d.mp4
is the name of the output videos (output001.mp4, output002.mp4, …)
LightlyOne Worker Uses Too Much Memory and Crashes
When running the LightlyOne Worker and observing the logs, you might encounter log messages stating a high % of memory consumption.
[2023-02-14 08:45:40,634] Memory consumption is at 98.6%.
To reduce memory consumption, we recommend to reduce the number of processes. The LightlyOne Worker spawns separate processes, and they can consume quite a bit of memory.
You can set thenum_processes
directly in the worker config. In the example below, we set the config to use 2 workers. The default option is to spawn a new worker for every available CPU core. If you have a machine with more cores but less memory, the default option might cause memory usage that is too high. Check out our Hardware Recommendations for more info about the recommended number of CPU cores and memory.
from lightly.api import ApiWorkflowClient
# Create a client with your token and configure it to use your dataset ID.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN", dataset_id="MY_DATASET_ID")
# Configure and schedule a run.
scheduled_run_id = client.schedule_compute_worker_run(
worker_config={
# Number of data loading processes. If -1, then one process per CPU core
# is created. Set to 0 to load data in the main process. Set to low number
# to reduce memory usage at cost of slower processing.
"num_processes": 2, # manually cap to max 2 processes
# Number of data loading threads. If -1, then two threads per CPU core
# are created. Is always at least one.
"num_threads": -1,
},
selection_config={
"n_samples": 50,
"strategies": [
{"input": {"type": "EMBEDDINGS"}, "strategy": {"type": "DIVERSITY"}}
],
},
lightly_config={
},
)
print(scheduled_run_id)
Shared Memory Error When Running LightlyOne Worker
The following error message appears when the Docker runtime runs out of shared memory. By default, Docker uses 64 megabytes.
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/opt/conda/envs/env/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
obj = _ForkingPickler.dumps(obj)
File "/opt/conda/envs/env/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/opt/conda/envs/env/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_31_1030151126>
To solve this problem, we need to reduce the number of workers or increase the shared memory of the Docker runtime.
Use fewer workers:
LightlyOne determines the number of CPU cores available and sets the number of workers to the same number. If you have a machine with many cores but not so much memory (e.g., less than 2 GB of memory per core), you may run out of memory, and you rather want to reduce the number of workers instead of increasing the shared memory.
Increase the shared memory limit:
You can change the shared memory from 64 megabytes to 512 megabytes by adding –shm-size=”512m”
to the Docker run command:
# example of docker run with setting shared memory to 512 MBytes
docker run --shm-size="512m" --gpus all
# you can also increase it to 2 Gigabytes using
docker run --shm-size="2G" --gpus all
CUDA Error: All CUDA-capable devices are busy or unavailable
CUDA error: all CUDA-capable devices are busy or unavailable CUDA kernel
errors might be asynchronously reported at some other API call,so the
stacktrace below might be incorrect. For debugging consider
passing CUDA_LAUNCH_BLOCKING=1.
This error is likely caused by some other processes on your machine reserving resources on the GPU without properly releasing them. A simple reboot will often resolve the problem, as all GPU resources will be freshly allocated during the reboot. However, if a reboot does not help, we suggest switching to a different CUDA version on your system.
CUDA Unknown Error
CUDA unknown error - this may be due to an incorrectly set up environment,
e.g. changing env variable CUDA_VISIBLE_DEVICES after program start.
Setting the available devices to be zero.
This error is raised if your system cannot communicate with the GPU, which might be caused e.g., by a driver update without a reboot or any other setup issue. Best try a reboot. For more information on the error and potential solutions, see this and this pytorch issue.
cuDNN Errors
If you encounter cuDDN-related errors, such as CUDNN_STATUS_INTERNAL_ERROR
, please verify that your CUDA driver is up-to-date. The LightlyOne Worker requires at least CUDA driver version 455. We recommend using the latest driver version if possible. New drivers can be downloaded from here.
LightlyOne Worker Crashes Because of Too Many Open Files
The following error message appears when the Docker runtime lacks enough file handlers. By default, Docker uses nofile=1024
. However, this might not be enough when using multiple workers for data fetching with lightly.loader.num_workers
.
Error [Errno 24] Too many open files>
To solve this problem, increase the number of file handlers for the Docker runtime.
You can change the number of file handlers to 90000 by adding –ulimit nofile=90000:90000
to the Docker run command:
docker run --ulimit nofile=90000:90000 --gpus all
More documentation on docker file handlers is provided here.
LightlyOne Worker on WSL
We don't recommend running the LightlyOne Worker on your Windows Subsystem for Linux (WSL) as it may come to connection issues. This is likely due to Windows acting as a proxy between the LightlyOne Worker and the LightlyOne Platform or cloud providers. Instead, we recommend using a Linux machine or a cloud-hosted instance such as AWS EC2, Google Cloud compute engines or Azure Compute.
Permission
Local Storage Permission Errors
The worker needs access to the mounted directories of your local storage. The operating system manages this access. In the example docker run command below, the mounted directories are MOUNTED_INPUT_DIR
and MOUNTED_LIGHTLY_DIR
.
docker run ... \
-v /MOUNTED_INPUT_DIR:/input_mount:ro \
-v /MOUNTED_LIGHTLY_DIR:/lightly_mount \
...
No List Permission for Files in ‘/Input_mount/ ‘!
This error means the LightlyOne Worker cannot access the mounted input directory. It requires not only read but also execute permissions, as reading files in a directory requires execute access. As the user to access the directory might be influenced by the way docker
was installed, we recommend giving the permissions to all users. Furthermore, the same permissions are required for all subdirectories and files in the mounted directory.
Fix it with chmod -R a+rX /MOUNTED_INPUT_DIR
, change the path according to what you set in your docker run
command. Use sudo
if needed. Create the directory first if needed. If you cannot change the permissions of the directory, see our docs on how to run the LightlyOne Worker with a custom user and group.
As we constantly improve the LightlyOne Worker by improving error messages and fixing bugs, please also try updating the version using the command docker pull lightly/worker:latest
.
No Write or Overwrite Permission for ‘/lightly_mount/‘!
Similar to the error No list permission for files in '/input_mount/ '!
, the permissions are not sufficiently set. Read, write, and execute permissions to the MOUNTED_LIGHTLY_DIR
are needed.
Fix it with chmod -R a+rwx /MOUNTED_LIGHTLY_DIR
, change the path according to what you set in your docker run
command. Use sudo
if needed. Create the directory first if needed. If you cannot change the permissions of the directory, see our docs on how to run the LightlyOne Worker with a custom user and group.
Additionally, make sure that you only use /lightly_mount
and not /lightly_mount:ro
in your docker run
command.
As we constantly improve the LightlyOne Worker by improving error messages and fixing bugs, please also try updating the version using the command docker pull lightly/worker:latest
.
[Errno 13] Permission Denied
This error most likely means that the LightlyOne Worker has permission to access the mounted input and lightly directories but cannot access a file or subdirectory within the mounted directories. In that case, verify that permissions are set recursively using the -R
flag:
chmod -R a+rX /MOUNTED_INPUT_DIR
chmod -R a+rwx /MOUNTED_LIGHTLY_DIR
If you cannot change the permissions of the directories, see our docs on how to run the LightlyOne Worker with a custom user and group.
As we constantly improve the LightlyOne Worker by improving error messages and fixing bugs, please also try updating the version using the command docker pull lightly/worker:latest
.
More Restrictive Permissions
If you want to set more restrictive permissions for security reasons, we recommend:
- Mounting more specific directories, e.g.
/MOUNTED_INPUT_DIR/project_xyz/data_for_lightly/input
.- Note that by doing so, you will need to adjust lightly-serve every time you wish to view a dataset in the LightlyOne Platform.
- Running the LightlyOne Worker with a custom user and group.
Running the Worker Behind a Proxy
If you are running the worker behind a corporate proxy such as zscaler or other security solutions that terminate the SSL connection, you can set environment variables HTTPS_PROXY
and LIGHTLY_CA_CERTS
when starting the worker.
docker run ... \
-v "/home/user/ssl/cert/mycert.crt":/etc/ssl/certs/mycert.crt \
-e LIGHTLY_CA_CERTS="/etc/ssl/certs/mycert.crt" \
-e HTTPS_PROXY="https://user:password@proxyIP:proxyPort" \
...
Scheduled Runs
Locked Jobs
LightlyOne tries to protect different LightlyOne Workers and their runs from each other and ensures that only one job can be active per dataset, while still allowing multiple LightlyOne Workers (with different worker_ids) to be running in parallel.
In certain situations, it can happen that a job remains locked, e.g., when the LightlyOne Worker crashes or when it is forcefully shutdown without a grace period to clean up while executing a job. From the viewpoint of LightlyOne, the job is still being processed, and the LightlyOne Worker will be considered as being online.
This can lead to a warning message when starting the LightlyOne Worker with the same worker_id such as Found 4 LOCKED jobs of the worker with id
.
Resolving Locked Jobs
By default, worker.force_start
is set to True
, which allows one to bypass our detection of a potentially second LightlyOne Worker still running and can result in the aforementioned warning. In this case, we will not cancel any locked jobs. The LightlyOne Worker starts normally and takes the next job.
Setting worker.force_start
to False
, will not bypass our detection of a potentially second LightlyOne Worker. This will cause your new LightlyOne Worker not to process jobs if a second LightlyOne Worker with the same worker_id is detected. If no other LightlyOne Worker is running, all the locked/failed jobs previously handled by the worker_id are cleaned up and the LightlyOne Worker will start normally and take the next job.
In a multi-LightlyOne Worker setup, is is recommended to set worker.force_start
to False
.
docker run --shm-size="1024m" --gpus all --rm -it \
-e LIGHTLY_TOKEN="MY_LIGHTLY_TOKEN" \
-e LIGHTLY_WORKER_ID="MY_WORKER_ID" \
lightly/worker:latest \
worker.force_start=False
Why Is My Scheduled Run Not Picked Up by a Worker?
There can be several reasons for this:
- Make sure that you have a LightlyOne Worker running and polling for jobs. Look for it on the workers overview page on the LightlyOne Platform.
- If you specified labels for your worker, make sure that the
runs_on
labels you specified when scheduling a job are a subset of the Worker labels. See the docs on how label matching works. - Verify that the job was scheduled with a compatible Lightly Python Client. Please see our compatibility table for a list of compatible LightlyOne Worker and Lightly Python Client versions.
- When a worker picks up a scheduled run, it is removed from the list of scheduled runs and moved to the worker runs list. Please check if your scheduled run was already picked up.
- The LightlyOne Worker polls for new jobs every 20s, it might take a while until a scheduled run is picked up.
Miscellaneous
Token Printed to Shared Stdout or Logs
The token (along with other Hydra configurations) will be printed to stdout and, therefore, appear in logs in an automated setup. This can be avoided by setting your token via the LIGHTLY_TOKEN
environment variable:
docker run --shm-size="1024m" --gpus all --rm -it \
-e LIGHTLY_TOKEN="MY_LIGHTLY_TOKEN" \
lightly/worker:latest
Create a Frame-Level Tag Based on a Child Tag
Sometimes, you find interesting crops in the crop-level dataset and create a new tag to save them. However, working with the frame-level dataset based on the newly created crop-level tag is not directly possible. To overcome this limitation, we can use the API client to create a tag in the frame-level dataset based on a tag in the crop-level dataset.
from lightly.api import ApiWorkflowClient
TOKEN = "YOUR_TOKEN"
CROP_DATASET_NAME = "CROP_DATASET_NAME"
CROP_DATASET_SOURCE_TAG_NAME = "CROP_DATASET_SOURCE_TAG_NAME"
PARENT_DATASET_NAME = "PARENT_DATASET_NAME"
PARENT_DATASET_NEW_TAG_NAME = "PARENT_DATASET_NEW_TAG_NAME"
def find_matching_files(crops, frames):
"""
Given two lists of filenames, 'crops' and 'frames', this function identifies and returns
the filenames from 'frames' that match the base names in 'crops'. The 'crops' filenames
have a unique crop ID encoding at the end, which is not present in the 'frames' filenames.
Parameters:
- crops (list of str): List of filenames with crop ID encoding.
- frames (list of str): List of filenames without crop ID encoding.
Returns:
- list of str: List of matching filenames from 'frames'.
"""
# Extract base names from 'crops' by removing the last 3 sections separated by '-'
crop_base_names = [crop.rsplit('-', 3)[0] for crop in crops]
# Extract base names from 'frames' by removing the file extension section
frame_base_names = [frame.rsplit('-', 1)[0] for frame in frames]
matching_files = [original_frame for original_frame, base_name_frame in zip(frames, frame_base_names) if base_name_frame in crop_base_names]
return matching_files
client = ApiWorkflowClient(token=TOKEN)
# get filenames from crops tag
client.set_dataset_id_by_name(CROP_DATASET_NAME)
filenames_crops = client.export_filenames_by_tag_name(
CROP_DATASET_SOURCE_TAG_NAME
).split('\n')
# get filenames of frame dataset
# Already created some LightlyOne Worker runs with this dataset
client.set_dataset_id_by_name(PARENT_DATASET_NAME)
filenames_frames = client.export_filenames_by_tag_name(
"initial-tag" # initial-tag always consists of all the files
).split('\n')
# find matching filenames between the two datasets
filenames_new_tag = find_matching_files(filenames_crops, filenames_frames)
# create a new tag in the frame level dataset
client.create_tag_from_filenames(fnames_new_tag=filenames_new_tag, new_tag_name=PARENT_DATASET_NEW_TAG_NAME)
Updated 2 months ago