Install Docker with GPU Support

LightlyOne Worker Crashes with Error response from daemon

Error response from daemon: could not select device driver

You run a docker container with –gpus all and encounter the following error?

Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

This might be because your Docker installation does not support GPUs. Try to install nvidia-docker following this guide.

Error response from daemon: failed to create task for container

You run a docker container with –gpus all and encounter the following error?

docker: Error response from daemon: failed to create task for container: failed to create shim task: 
OCI runtime create failed: runc create failed: unable to start container process: 
error during container init: error running hook #0: error running hook: 
exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown.
ERRO[0000] error waiting for container: context canceled

This is probably caused by no Nvidia GPUs being available. Run the command nvidia-smi to list all available GPUs.

If it fails with e.g. NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver., first install a Nvidia GPU driver.
If it shows available GPUs, try to install nvidia-docker following this guide.

LightlyOne Worker Is Slow When Working with Long Videos

We are working on this issue internally. For now, we suggest splitting the large videos into chunks. You can do this using ffmpeg without losing quality. The following code breaks up the video so that no re-encoding is needed.

ffmpeg -i input.mp4 -c copy -map 0 -segment_time 01:00:00 -f segment -reset_timestamps 1 output%03d.mp4

What exactly happens here?

input.mp4 is your input video
c copy -map 0 makes sure we copy and don’t re-encode the video
segment_time 01:00:00 -f segment defines that we want chunks of 1h each
reset_timestamps 1 ensures the timestamps are reset (each video starts from 0)
output%03d.mp4 is the name of the output videos (output001.mp4, output002.mp4, …)

LightlyOne Worker Uses Too Much Memory and Crashes

When running the LightlyOne Worker and observing the logs, you might encounter log messages stating a high % of memory consumption.

[2023-02-14 08:45:40,634] Memory consumption is at 98.6%.

To reduce memory consumption, we recommend reducing the number of background (multiprocessing) processes spawned by the LightlyOne Worker.

You can set thenum_processes directly in the worker_config. In the example below, we set the config to use 2 processes. The default option is to spawn a new process for every available CPU core. If you have a machine with more cores but less memory, the default option might cause memory usage that is too high. Check out our Hardware Recommendations for more info about the recommended number of CPU cores and memory.

from lightly.api import ApiWorkflowClient

# Create a client with your token and configure it to use your dataset ID.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN", dataset_id="MY_DATASET_ID")

# Configure and schedule a run.
scheduled_run_id = client.schedule_compute_worker_run(
    worker_config={
        # Number of data loading processes. If -1, then one process per CPU core
        # is created. Set to 0 to load data in the main process. Set to low number
        # to reduce memory usage at cost of slower processing.
        "num_processes": 2, # manually cap to max 2 processes
      
        # Number of data loading threads. If -1, then two threads per CPU core
        # are created. Is always at least one.
        "num_threads": -1,
    },
    selection_config={
        "n_samples": 50,
        "strategies": [
            {"input": {"type": "EMBEDDINGS"}, "strategy": {"type": "DIVERSITY"}}
        ],
    },
    lightly_config={
    },
)
print(scheduled_run_id)

Shared Memory Error When Running LightlyOne Worker

The following error message appears when the Docker runtime runs out of shared memory. By default, Docker uses 64 megabytes.

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/opt/conda/envs/env/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
    obj = _ForkingPickler.dumps(obj)
File "/opt/conda/envs/env/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
File "/opt/conda/envs/env/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
    fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_31_1030151126>

To solve this problem, we need to reduce the number of processes or increase the shared memory of the Docker runtime.

Use fewer processes:
The LightlyOne Worker determines the number of CPU cores available and sets the number of processes to the same number. If you have a machine with many cores but not so much memory (e.g., less than 2 GB of memory per core), you may run out of memory, and you rather want to reduce the number of processes instead of increasing the shared memory.

Increase the shared memory limit:
You can change the shared memory from 64 megabytes to 512 megabytes by adding –shm-size=”512m” to the Docker run command:

# example of docker run with setting shared memory to 512 MBytes
docker run --shm-size="512m" --gpus all

# you can also increase it to 2 Gigabytes using
docker run --shm-size="2G" --gpus all

CUDA Error: All CUDA-capable devices are busy or unavailable

CUDA error: all CUDA-capable devices are busy or unavailable CUDA kernel
errors might be asynchronously reported at some other API call,so the
stacktrace below might be incorrect. For debugging consider
passing CUDA_LAUNCH_BLOCKING=1.

This error is likely caused by some other processes on your machine reserving resources on the GPU without properly releasing them. A simple reboot will often resolve the problem, as all GPU resources will be freshly allocated during the reboot. However, if a reboot does not help, we suggest switching to a different CUDA version on your system.

CUDA Unknown Error

CUDA unknown error - this may be due to an incorrectly set up environment, 
e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. 
Setting the available devices to be zero.

This error is raised if your system cannot communicate with the GPU, which might be caused e.g., by a driver update without a reboot or any other setup issue. Best try a reboot. For more information on the error and potential solutions, see this and this pytorch issue.

cuDNN Errors

If you encounter cuDDN-related errors, such as CUDNN_STATUS_INTERNAL_ERROR, please verify that your CUDA driver is up-to-date. The LightlyOne Worker requires at least CUDA driver version 455. We recommend using the latest driver version if possible. New drivers can be downloaded from here.

LightlyOne Worker Crashes Because of Too Many Open Files

The following error message appears when the Docker runtime lacks enough file handlers. By default, Docker uses nofile=1024. However, this might not be enough when using multiple LightlyOne Workers for data fetching with lightly.loader.num_workers.

Error [Errno 24] Too many open files>

To solve this problem, increase the number of file handlers for the Docker runtime.

You can change the number of file handlers to 90000 by adding –ulimit nofile=90000:90000 to the Docker run command:

docker run --ulimit nofile=90000:90000 --gpus all

More documentation on docker file handlers is provided here.

LightlyOne Worker on WSL

We don't recommend running the LightlyOne Worker on your Windows Subsystem for Linux (WSL) as it may come to connection issues. This is likely due to Windows acting as a proxy between the LightlyOne Worker and the LightlyOne Platform or cloud providers. Instead, we recommend using a Linux machine or a cloud-hosted instance such as AWS EC2, Google Cloud compute engines or Azure Compute.

Missing `file_name` Attribute in Prediction or Metadata Files

LightlyOne Worker versions before v2.14 additionally required a file_name attribute in the prediction and metadata files. This is no longer necessary and can be omitted. If you are using an older version of the LightlyOne Worker, please update to the latest version.

If you want to keep using an older version, please add the file_name attribute to your prediction and metadata files. The file_name attribute must always be the relative path of the image to the root directory of your input datasource. For example, if the image is saved at input_datasource/subdir/image_1.png, then file_name must be subdir/image_1.png. Similarly, for frame 123 of the video at input_datasource/subdir/video_1.mp4, file_name must be subdir/video_1-123-mp4.png.

Permission

Local Storage Permission Errors

The LightlyOne Worker needs access to the mounted directories of your local storage. The operating system manages this access. In the example docker run command below, the mounted directories are MOUNTED_INPUT_DIR and MOUNTED_LIGHTLY_DIR.

docker run ... \
  -v /MOUNTED_INPUT_DIR:/input_mount:ro \
  -v /MOUNTED_LIGHTLY_DIR:/lightly_mount \
  ...

No List Permission for Files in ‘/Input_mount/ ‘!

This error means the LightlyOne Worker cannot access the mounted input directory. It requires not only read but also execute permissions, as reading files in a directory requires execute access. As the user to access the directory might be influenced by the way dockerwas installed, we recommend giving the permissions to all users. Furthermore, the same permissions are required for all subdirectories and files in the mounted directory.

Fix it with chmod -R a+rX /MOUNTED_INPUT_DIR, change the path according to what you set in your docker run command. Use sudoif needed. Create the directory first if needed. If you cannot change the permissions of the directory, see our docs on how to run the LightlyOne Worker with a custom user and group.

As we constantly improve the LightlyOne Worker by improving error messages and fixing bugs, please also try updating the version using the command docker pull lightly/worker:latest.

No Write or Overwrite Permission for ‘/lightly_mount/‘!

Similar to the error No list permission for files in '/input_mount/ '!, the permissions are not sufficiently set. Read, write, and execute permissions to the MOUNTED_LIGHTLY_DIR are needed.

Fix it with chmod -R a+rwx /MOUNTED_LIGHTLY_DIR, change the path according to what you set in your docker run command. Use sudoif needed. Create the directory first if needed. If you cannot change the permissions of the directory, see our docs on how to run the LightlyOne Worker with a custom user and group.

Additionally, make sure that you only use /lightly_mount and not /lightly_mount:ro in your docker run command.

As we constantly improve the LightlyOne Worker by improving error messages and fixing bugs, please also try updating the version using the command docker pull lightly/worker:latest.

[Errno 13] Permission Denied

This error most likely means that the LightlyOne Worker has permission to access the mounted input and lightly directories but cannot access a file or subdirectory within the mounted directories. In that case, verify that permissions are set recursively using the -R flag:

chmod -R a+rX /MOUNTED_INPUT_DIR
chmod -R a+rwx /MOUNTED_LIGHTLY_DIR

If you cannot change the permissions of the directories, see our docs on how to run the LightlyOne Worker with a custom user and group.

As we constantly improve the LightlyOne Worker by improving error messages and fixing bugs, please also try updating the version using the command docker pull lightly/worker:latest.

More Restrictive Permissions

If you want to set more restrictive permissions for security reasons, we recommend:

Mounting more specific directories, e.g. /MOUNTED_INPUT_DIR/project_xyz/data_for_lightly/input.
- Note that by doing so, you will need to adjust lightly-serve every time you wish to view a dataset in the LightlyOne Platform.
Running the LightlyOne Worker with a custom user and group.

Running the Worker Behind a Proxy

If you are running the LightlyOne Worker behind a corporate proxy such as zscaler or other security solutions that terminate the SSL connection, you can set environment variables HTTPS_PROXY and LIGHTLY_CA_CERTS when starting the LightlyOne Worker.

docker run ... \
	-v "/home/user/ssl/cert/mycert.crt":/etc/ssl/certs/mycert.crt \
	-e LIGHTLY_CA_CERTS="/etc/ssl/certs/mycert.crt" \
	-e HTTPS_PROXY="https://user:password@proxyIP:proxyPort" \
	...

Scheduled Runs

Locked Jobs

LightlyOne tries to protect different LightlyOne Workers and their runs from each other and ensures that only one job can be active per dataset, while still allowing multiple LightlyOne Workers (with different worker_ids) to be running in parallel.
In certain situations, it can happen that a job remains locked, e.g., when the LightlyOne Worker crashes or when it is forcefully shutdown without a grace period to clean up while executing a job. From the viewpoint of LightlyOne, the job is still being processed, and the LightlyOne Worker will be considered as being online.

This can lead to a warning message when starting the LightlyOne Worker with the same worker_id such as Found 4 LOCKED jobs of the worker with id.

Resolving Locked Jobs

By default, worker.force_start is set to True, which allows one to bypass our detection of a potentially second LightlyOne Worker still running and can result in the aforementioned warning. In this case, we will not cancel any locked jobs. The LightlyOne Worker starts normally and takes the next job.

Setting worker.force_start to False, will not bypass our detection of a potentially second LightlyOne Worker. This will cause your new LightlyOne Worker not to process jobs if a second LightlyOne Worker with the same worker_id is detected. If no other LightlyOne Worker is running, all the locked/failed jobs previously handled by the worker_id are cleaned up and the LightlyOne Worker will start normally and take the next job.
In a multi-LightlyOne Worker setup, is is recommended to set worker.force_start to False.

docker run --shm-size="1024m" --gpus all --rm -it \
	-e LIGHTLY_TOKEN="MY_LIGHTLY_TOKEN" \
	-e LIGHTLY_WORKER_ID="MY_WORKER_ID" \
	lightly/worker:latest \
	worker.force_start=False

Why Is My Scheduled Run Not Picked Up by a Worker?

There can be several reasons for this:

Make sure that you have a LightlyOne Worker running and polling for jobs. Look for it on the workers overview page on the LightlyOne Platform.
If you specified labels for your LightlyOne Worker, make sure that the runs_on labels you specified when scheduling a job are a subset of the Worker labels. See the docs on how label matching works.
Verify that the job was scheduled with a compatible Lightly Python Client. Please see our compatibility table for a list of compatible LightlyOne Worker and Lightly Python Client versions.
When a LightlyOne Worker picks up a scheduled run, it is removed from the list of scheduled runs and moved to the LightlyOne Worker runs list. Please check if your scheduled run was already picked up.
The LightlyOne Worker polls for new jobs every 20s, it might take a while until a scheduled run is picked up.

Miscellaneous

Token Printed to Shared Stdout or Logs

The token (along with other Hydra configurations) will be printed to stdout and, therefore, appear in logs in an automated setup. This can be avoided by setting your token via the LIGHTLY_TOKEN environment variable:

docker run --shm-size="1024m" --gpus all --rm -it \
	-e LIGHTLY_TOKEN="MY_LIGHTLY_TOKEN" \
	lightly/worker:latest

Create a Frame-Level Tag Based on a Child Tag

Sometimes, you find interesting crops in the crop-level dataset and create a new tag to save them. However, working with the frame-level dataset based on the newly created crop-level tag is not directly possible. To overcome this limitation, we can use the API client to create a tag in the frame-level dataset based on a tag in the crop-level dataset.

from lightly.api import ApiWorkflowClient

TOKEN = "YOUR_TOKEN"

CROP_DATASET_NAME = "CROP_DATASET_NAME"
CROP_DATASET_SOURCE_TAG_NAME = "CROP_DATASET_SOURCE_TAG_NAME"

PARENT_DATASET_NAME = "PARENT_DATASET_NAME"
PARENT_DATASET_NEW_TAG_NAME = "PARENT_DATASET_NEW_TAG_NAME"

def find_matching_files(crops, frames):
    """
    Given two lists of filenames, 'crops' and 'frames', this function identifies and returns
    the filenames from 'frames' that match the base names in 'crops'. The 'crops' filenames 
    have a unique crop ID encoding at the end, which is not present in the 'frames' filenames.

    Parameters:
    - crops (list of str): List of filenames with crop ID encoding.
    - frames (list of str): List of filenames without crop ID encoding.

    Returns:
    - list of str: List of matching filenames from 'frames'.
    """
    
    # Extract base names from 'crops' by removing the last 3 sections separated by '-'
    crop_base_names = [crop.rsplit('-', 3)[0] for crop in crops]
    
    # Extract base names from 'frames' by removing the file extension section
    frame_base_names = [frame.rsplit('-', 1)[0] for frame in frames]
    
    matching_files = [original_frame for original_frame, base_name_frame in zip(frames, frame_base_names) if base_name_frame in crop_base_names]

    return matching_files


client = ApiWorkflowClient(token=TOKEN)

# get filenames from crops tag
client.set_dataset_id_by_name(CROP_DATASET_NAME)
filenames_crops = client.export_filenames_by_tag_name(
    CROP_DATASET_SOURCE_TAG_NAME
).split('\n')

# get filenames of frame dataset
# Already created some LightlyOne Worker runs with this dataset
client.set_dataset_id_by_name(PARENT_DATASET_NAME)
filenames_frames = client.export_filenames_by_tag_name(
    "initial-tag" # initial-tag always consists of all the files
).split('\n')

# find matching filenames between the two datasets
filenames_new_tag = find_matching_files(filenames_crops, filenames_frames)

# create a new tag in the frame level dataset
client.create_tag_from_filenames(fnames_new_tag=filenames_new_tag, new_tag_name=PARENT_DATASET_NEW_TAG_NAME)

Install Docker with GPU Support

LightlyOne Worker Crashes with Error response from daemon

Error response from daemon: could not select device driver

Error response from daemon: failed to create task for container

LightlyOne Worker Is Slow When Working with Long Videos

LightlyOne Worker Uses Too Much Memory and Crashes

Shared Memory Error When Running LightlyOne Worker

CUDA Error: All CUDA-capable devices are busy or unavailable

CUDA Unknown Error

cuDNN Errors

LightlyOne Worker Crashes Because of Too Many Open Files

LightlyOne Worker on WSL

Missing file_name Attribute in Prediction or Metadata Files

Permission

Local Storage Permission Errors

No List Permission for Files in ‘/Input_mount/ ‘!

No Write or Overwrite Permission for ‘/lightly_mount/‘!

[Errno 13] Permission Denied

More Restrictive Permissions

Running the Worker Behind a Proxy

Scheduled Runs

Locked Jobs

Resolving Locked Jobs

Why Is My Scheduled Run Not Picked Up by a Worker?

Miscellaneous

Token Printed to Shared Stdout or Logs

Create a Frame-Level Tag Based on a Child Tag

Missing `file_name` Attribute in Prediction or Metadata Files