Install Docker with GPU Support

Lightly Worker Crashes When Running with GPUs

You run a docker container with –gpus all and encounter the following error?

Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

This might be because your Docker installation does not support GPUs. Try to install nvidia-docker following this guide.

Lightly Worker Is Slow When Working with Long Videos

We are working on this issue internally. For now, we suggest splitting the large videos into chunks. You can do this using ffmpeg without losing quality. The following code breaks up the video so that no re-encoding is needed.

ffmpeg -i input.mp4 -c copy -map 0 -segment_time 01:00:00 -f segment -reset_timestamps 1 output%03d.mp4

What exactly happens here?

input.mp4 is your input video
c copy -map 0 makes sure we copy and don’t re-encode the video
segment_time 01:00:00 -f segment defines that we want chunks of 1h each
reset_timestamps 1 ensures the timestamps are reset (each video starts from 0)
output%03d.mp4 is the name of the output videos (output001.mp4, output002.mp4, …)

Lightly Worker Uses Too Much Memory and Crashes

When running the Lightly Worker and observing the logs, you might encounter log messages stating a high % of memory consumption.

[2023-02-14 08:45:40,634] Memory consumption is at 98.6%.

To reduce memory consumption, we recommend to reduce the number of processes. The Lightly Worker spawns separate processes, and they can consume quite a bit of memory.

You can set thenum_processes directly in the worker config. In the example below, we set the config to use 2 workers. The default option is to spawn a new worker for every available CPU core. If you have a machine with more cores but less memory, the default option might cause memory usage that is too high. Check out our Hardware Recommendations for more info about the recommended number of CPU cores and memory.

from lightly.api import ApiWorkflowClient

# Create a client with your token and configure it to use your dataset ID.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN", dataset_id="MY_DATASET_ID")

# Configure and schedule a run.
scheduled_run_id = client.schedule_compute_worker_run(
    worker_config={
        # Number of data loading processes. If -1, then one process per CPU core
        # is created. Set to 0 to load data in the main process. Set to low number
        # to reduce memory usage at cost of slower processing.
        "num_processes": 2, # manually cap to max 2 processes
      
        # Number of data loading threads. If -1, then two threads per CPU core
        # are created. Is always at least one.
        "num_threads": -1,
    },
    selection_config={
        "n_samples": 50,
        "strategies": [
            {"input": {"type": "EMBEDDINGS"}, "strategy": {"type": "DIVERSITY"}}
        ],
    },
    lightly_config={
    },
)
print(scheduled_run_id)

Shared Memory Error When Running Lightly Worker

The following error message appears when the Docker runtime runs out of shared memory. By default, Docker uses 64 megabytes.

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/opt/conda/envs/env/lib/python3.7/multiprocessing/queues.py", line 236, in _feed
    obj = _ForkingPickler.dumps(obj)
File "/opt/conda/envs/env/lib/python3.7/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
File "/opt/conda/envs/env/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
    fd, size = storage._share_fd_()
RuntimeError: unable to write to file </torch_31_1030151126>

To solve this problem, we need to reduce the number of workers or increase the shared memory of the Docker runtime.

Use fewer workers:
Lightly determines the number of CPU cores available and sets the number of workers to the same number. If you have a machine with many cores but not so much memory (e.g., less than 2 GB of memory per core), you may run out of memory, and you rather want to reduce the number of workers instead of increasing the shared memory.

Increase the shared memory limit:
You can change the shared memory from 64 megabytes to 512 megabytes by adding –shm-size=”512m” to the Docker run command:

# example of docker run with setting shared memory to 512 MBytes
docker run --shm-size="512m" --gpus all

# you can also increase it to 2 Gigabytes using
docker run --shm-size="2G" --gpus all

CUDA Error: All CUDA-capable devices are busy or unavailable

CUDA error: all CUDA-capable devices are busy or unavailable CUDA kernel
errors might be asynchronously reported at some other API call,so the
stacktrace below might be incorrect. For debugging consider
passing CUDA_LAUNCH_BLOCKING=1.

This error is likely caused by some other processes on your machine reserving resources on the GPU without properly releasing them. A simple reboot will often resolve the problem, as all GPU resources will be freshly allocated during the reboot. However, if a reboot does not help, we suggest switching to a different CUDA version on your system.

CUDA Unknown Error

CUDA unknown error - this may be due to an incorrectly set up environment, 
e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. 
Setting the available devices to be zero.

This error is raised if your system cannot communicate with the GPU, which might be caused e.g., by a driver update without a reboot or any other setup issue. Best try a reboot. For more information on the error and potential solutions, see this and this pytorch issue.

cuDNN Errors

If you encounter cuDDN-related errors, such as CUDNN_STATUS_INTERNAL_ERROR, please verify that your CUDA driver is up-to-date. The Lightly Worker requires at least CUDA driver version 455. We recommend using the latest driver version if possible. New drivers can be downloaded from here.

Lightly Worker Crashes Because of Too Many Open Files

The following error message appears when the Docker runtime lacks enough file handlers. By default, Docker uses nofile=1024. However, this might not be enough when using multiple workers for data fetching with lightly.loader.num_workers.

Error [Errno 24] Too many open files>

To solve this problem, increase the number of file handlers for the Docker runtime.

You can change the number of file handlers to 90000 by adding –ulimit nofile=90000:90000 to the Docker run command:

docker run --ulimit nofile=90000:90000 --gpus all

More documentation on docker file handlers is provided here.

Lightly Worker on WSL

We don't recommend running the Lightly Worker on your Windows Subsystem for Linux (WSL) as it may come to connection issues. This is likely due to Windows acting as a proxy between the Lightly Worker and the Lightly Platform or cloud providers. Instead, we recommend using a Linux machine or a cloud-hosted instance such as AWS EC2, Google Cloud compute engines or Azure Compute.

Permission

Local Storage Permission Errors

The worker needs access to the mounted directories of your local storage. The operating system manages this access. In the example docker run command below, the mounted directories are MOUNTED_INPUT_DIR and MOUNTED_LIGHTLY_DIR.

docker run ... \
  -v /MOUNTED_INPUT_DIR:/input_mount:ro \
  -v /MOUNTED_LIGHTLY_DIR:/lightly_mount \
  ...

No List Permission for Files in ‘/Input_mount/ ‘!

This error means the Lightly Worker cannot access the mounted input directory. It requires not only read but also execute permissions, as reading files in a directory requires execute access. As the user to access the directory might be influenced by the way dockerwas installed, we recommend giving the permissions to all users. Furthermore, the same permissions are required for all subdirectories and files in the mounted directory.

Fix it with chmod -R a+rX /MOUNTED_INPUT_DIR, change the path according to what you set in your docker run command. Use sudoif needed. Create the directory first if needed. If you cannot change the permissions of the directory, see our docs on how to run the Lightly Worker with a custom user and group.

As we constantly improve the Lightly Worker by improving error messages and fixing bugs, please also try updating the version using the command docker pull lightly/worker:latest.

No Write or Overwrite Permission for ‘/lightly_mount/‘!

Similar to the error No list permission for files in '/input_mount/ '!, the permissions are not sufficiently set. Read, write, and execute permissions to the MOUNTED_LIGHTLY_DIR are needed.

Fix it with chmod -R a+rwx /MOUNTED_LIGHTLY_DIR, change the path according to what you set in your docker run command. Use sudoif needed. Create the directory first if needed. If you cannot change the permissions of the directory, see our docs on how to run the Lightly Worker with a custom user and group.

Additionally, make sure that you only use /lightly_mount and not /lightly_mount:ro in your docker run command.

As we constantly improve the Lightly Worker by improving error messages and fixing bugs, please also try updating the version using the command docker pull lightly/worker:latest.

[Errno 13] Permission Denied

This error most likely means that the Lightly Worker has permission to access the mounted input and lightly directories but cannot access a file or subdirectory within the mounted directories. In that case, verify that permissions are set recursively using the -R flag:

chmod -R a+rX /MOUNTED_INPUT_DIR
chmod -R a+rwx /MOUNTED_LIGHTLY_DIR

If you cannot change the permissions of the directories, see our docs on how to run the Lightly Worker with a custom user and group.

As we constantly improve the Lightly Worker by improving error messages and fixing bugs, please also try updating the version using the command docker pull lightly/worker:latest.

More Restrictive Permissions

If you want to set more restrictive permissions for security reasons, we recommend:

Mounting more specific directories, e.g. /MOUNTED_INPUT_DIR/project_xyz/data_for_lightly/input.
- Note that by doing so, you will need to adjust lightly-serve every time you wish to view a dataset in the Lightly Platform.
Running the Lightly Worker with a custom user and group.

Running the Worker Behind a Proxy

If you are running the worker behind a corporate proxy such as zscaler or other security solutions that terminate the SSL connection, you can set environment variables HTTPS_PROXY and LIGHTLY_CA_CERTS when starting the worker.

docker run ... \
	-v "/home/user/ssl/cert/mycert.crt":/etc/ssl/certs/mycert.crt \
	-e LIGHTLY_CA_CERTS="/etc/ssl/certs/mycert.crt" \
	-e HTTPS_PROXY="https://user:password@proxyIP:proxyPort" \
	...

Scheduled Runs

Locked Jobs

Lightly tries to protect different Lightly Workers and their runs from each other and ensures that only one job can be active per dataset, while still allowing multiple Lightly Workers (with different worker_ids) to be running in parallel.
In certain situations, it can happen that a job remains locked, e.g., when the Lightly Worker crashes or when it is forcefully shutdown without a grace period to clean up while executing a job. From the viewpoint of Lightly, the job is still being processed, and the Lightly Worker will be considered as being online.

This can lead to a warning message when starting the Lightly Worker with the same worker_id such as Found 4 LOCKED jobs of the worker with id.

Resolving Locked Jobs

By default, worker.force_start is set to True, which allows one to bypass our detection of a potentially second Lightly Worker still running and can result in the aforementioned warning. In this case, we will not cancel any locked jobs. The Lightly Worker starts normally and takes the next job.

Setting worker.force_start to False, will not bypass our detection of a potentially second Lightly Worker. This will cause your new Lightly Worker not to process jobs if a second Lightly Worker with the same worker_id is detected. If no other Lightly Worker is running, all the locked/failed jobs previously handled by the worker_id are cleaned up and the Lightly Worker will start normally and take the next job.
In a multi-Lightly Worker setup, is is recommended to set worker.force_start to False.

docker run --shm-size="1024m" --gpus all --rm -it \
	-e LIGHTLY_TOKEN="MY_LIGHTLY_TOKEN" \
	-e LIGHTLY_WORKER_ID="MY_WORKER_ID" \
	lightly/worker:latest \
	worker.force_start=False

Why Is My Scheduled Run Not Picked Up by a Worker?

There can be several reasons for this:

Make sure that you have a Lightly Worker running and polling for jobs. Look for it on the workers overview page on the Lightly Platform.
If you specified labels for your worker, make sure that the runs_on labels you specified when scheduling a job are a subset of the Worker labels. See the docs on how label matching works.
Verify that the job was scheduled with a compatible Lightly Python Client. Please see our compatibility table for a list of compatible Lightly Worker and Lightly Python Client versions.
When a worker picks up a scheduled run, it is removed from the list of scheduled runs and moved to the worker runs list. Please check if your scheduled run was already picked up.
The Lightly Worker polls for new jobs every 20s, it might take a while until a scheduled run is picked up.

Miscellaneous

Token Printed to Shared Stdout or Logs

The token (along with other Hydra configurations) will be printed to stdout and, therefore, appear in logs in an automated setup. This can be avoided by setting your token via the LIGHTLY_TOKEN environment variable:

docker run --shm-size="1024m" --gpus all --rm -it \
	-e LIGHTLY_TOKEN="MY_LIGHTLY_TOKEN" \
	-e LIGHTLY_WORKER_ID="MY_WORKER_ID" \
	lightly/worker:latest

Create a Frame-Level Tag Based on a Child Tag

Sometimes, you find interesting crops in the crop-level dataset and create a new tag to save them. However, working with the frame-level dataset based on the newly created crop-level tag is not directly possible. To overcome this limitation, we can use the API client to create a tag in the frame-level dataset based on a tag in the crop-level dataset.

from lightly.api import ApiWorkflowClient

TOKEN = "YOUR_TOKEN"

CROP_DATASET_NAME = "CROP_DATASET_NAME"
CROP_DATASET_SOURCE_TAG_NAME = "CROP_DATASET_SOURCE_TAG_NAME"

PARENT_DATASET_NAME = "PARENT_DATASET_NAME"
PARENT_DATASET_NEW_TAG_NAME = "PARENT_DATASET_NEW_TAG_NAME"

def find_matching_files(crops, frames):
    """
    Given two lists of filenames, 'crops' and 'frames', this function identifies and returns
    the filenames from 'frames' that match the base names in 'crops'. The 'crops' filenames 
    have a unique crop ID encoding at the end, which is not present in the 'frames' filenames.

    Parameters:
    - crops (list of str): List of filenames with crop ID encoding.
    - frames (list of str): List of filenames without crop ID encoding.

    Returns:
    - list of str: List of matching filenames from 'frames'.
    """
    
    # Extract base names from 'crops' by removing the last 3 sections separated by '-'
    crop_base_names = [crop.rsplit('-', 3)[0] for crop in crops]
    
    # Extract base names from 'frames' by removing the file extension section
    frame_base_names = [frame.rsplit('-', 1)[0] for frame in frames]
    
    matching_files = [original_frame for original_frame, base_name_frame in zip(frames, frame_base_names) if base_name_frame in crop_base_names]

    return matching_files


client = ApiWorkflowClient(token=TOKEN)

# get filenames from crops tag
client.set_dataset_id_by_name(CROP_DATASET_NAME)
filenames_crops = client.export_filenames_by_tag_name(
    CROP_DATASET_SOURCE_TAG_NAME
).split('\n')

# get filenames of frame dataset
# Already created some Lightly Worker runs with this dataset
client.set_dataset_id_by_name(PARENT_DATASET_NAME)
filenames_frames = client.export_filenames_by_tag_name(
    "initial-tag" # initial-tag always consists of all the files
).split('\n')

# find matching filenames between the two datasets
filenames_new_tag = find_matching_files(filenames_crops, filenames_frames)

# create a new tag in the frame level dataset
client.create_tag_from_filenames(fnames_new_tag=filenames_new_tag, new_tag_name=PARENT_DATASET_NEW_TAG_NAME)