Relevant Filenames

Often not all files in a datasource are relevant. In that case, it's possible to pass a list of filenames to the Lightly Worker using the relevant_filenames_file configuration option. It will then only consider the listed filenames and ignore all others. To do so, you can create a text file that contains one relevant filename or a directory per line and then pass the path to the text file when scheduling the run. This works for videos and images.

For example, let's say you're working with the following file structure in your input datasource (in this case, an AWS S3 bucket) where you are only interested in image_1.png, subdir/image_2.png and subdir/image_3.png:

s3://bucket/input/
├── image_1.png
└── subdir/
    ├── image_2.png
    ├── image_3.png
    ├── image_40.png
    ├── image_41.png
    └── image_42.png

Then you can add a file called relevant_filenames.txt to your Lightly datasource with the following content:

image_1.png
subdir/image_2.png
subdir/image_3.png

🚧

The relevant_filenames_file is expected to be in the Lightly datasource and must always be located in a subdirectory called .lightly.

Only file paths relative to the input datasource are supported, and relative paths cannot include dot notations ./ or ../.

When using this feature, the Lightly datasource should look like this:

s3://bucket/lightly/
└── .lightly/
    └── relevant_filenames.txt

The corresponding Python command to submit a run would then be as follows:

from lightly.api import ApiWorkflowClient

# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN", dataset_id="MY_DATASET_ID")

client.schedule_compute_worker_run(
    worker_config={
        "relevant_filenames_file": ".lightly/relevant_filenames.txt",
    },
    selection_config={
        "n_samples": 50,
        "strategies": [
            {
                "input": {
                    "type": "EMBEDDINGS"
                },
                "strategy": {
                    "type": "DIVERSITY"
                }
            }
        ]
    }
)

Select Directories

It's also possible to specify a file path prefix by denoting it with an asterisk * to include whole directories instead of listing many files individually. Everything up until the first * of a line will be considered as the prefix.

image_1.png
subdir/*

Exclude Files

You can also use the power of the prefix to exclude certain files. To do so, start a line with a prefix (e.g. your/directory/*) and then add the exclusions separated by spaces:

your/directory/* your/directory/ignore_me/*

You can also remove files from your exclusion with the ! operator. See below for an example.

To understand how to use ! correctly, it's helpful to understand how Lightly parses the relevant filenames. Lightly uses the following logic:

  1. Iterate over the file line by line and return relevant filenames.
  2. If a prefix pattern is encountered:
    a. Switch to listing the datasource.
    b. List until the prefix is depleted while applying the exclude pattern.
    c. Go back to 1.

For example, let's assume the following relevant filenames file which includes images one trough three explicitly, includes all images in foo/bar and includes all images in foo/baz/ except foo/baz/image_1.jpg:

foo/image_1.jpg
foo/image_2.jpg
foo/bar/*
foo/baz/* foo/baz/image_1.jpg
foo/image_3.jpg

Lightly will first list foo/image_1.png and foo/image_2.png. Then it will switch to listing all files in foo/bar/. When the prefix is depleted, Lightly starts listing foo/baz/. It will list all files and filter out foo/baz/image_1.jpg. Finally, Lightly returns the last item in the list: foo/image_3.png.

Include Directory Except Subdirectory

Include everything in foo/ except the directory foo/bar:

foo/* foo/bar/*

Include Directory Except Files by Suffix

Include everything in foo/ except .png files in foo/:

foo/* foo/*.png

Include Directory Except Specific Images

Include everything in foo/ except foo/bar/image_1.jpg and foo/baz/image_1.jpg

foo/* foo/bar/image_1.jpg foo/baz/image_1.jpg

Include Directory but Exclude Images by Prefix

Exclude all images with the prefix foo/image_1.

foo/* foo/image_1*

Exclude Subdirectory Except a Specific Image

Exclude all images in the subdirectory foo/bar except for foo/bar/image_1.jpg

foo/* foo/bar/* !foo/bar/image_1.jpg