Relevant Filenames

Often not all files in a bucket are relevant. In that case, it's possible to pass a list of filenames to the Lightly Worker using the relevant_filenames_file configuration option. It will then only consider the listed filenames and ignore all others. To do so, you can create a text file that contains one relevant filename or a [folder](#Selecting Folders) per line and then pass the path to the text file when scheduling the run. This works for videos and images.

For example, let's say you're working with the following file structure in your input datasource (in this case, an AWS S3 bucket) where you are only interested in image_1.png, subdir/image_2.png and subdir/image_3.png:

s3://bucket/input/
├── image_1.png
└── subdir/
        ├── image_2.png
        ├── image_3.png
        ├── image_40.png
        ├── image_41.png
        └── image_42.png

Then you can add a file called relevant_filenames.txt to your Lightly datasource with the following content:

image_1.png
subdir/image_2.png
subdir/image_3.png

🚧

The relevant_filenames_file is expected to be in the Lightly bucket and must always be located in a subdirectory called .lightly.

Only file paths relative to the input bucket are supported, and relative paths cannot include dot notations ./ or ../.

When using this feature, the Lightly bucket should look like this:

s3://bucket/lightly/
└── .lightly/
    └── relevant_filenames.txt

The corresponding Python command to submit a run would then be as follows:

from lightly.api import ApiWorkflowClient

# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN", dataset_id="MY_DATASET_ID")

client.schedule_compute_worker_run(
    worker_config={
        "relevant_filenames_file": ".lightly/relevant_filenames.txt",
    },
    selection_config={
        "n_samples": 50,
        "strategies": [
            {
                "input": {
                    "type": "EMBEDDINGS"
                },
                "strategy": {
                    "type": "DIVERSITY"
                }
            }
        ]
    }
)

Select Folders

It's also possible to specify a file path prefix by denoting it with an asterisk * to include whole folders instead of listing many files individually. Everything up until the first * of a line will be considered as the prefix.

image_1.png
subdir/*

Exclude Files

You can also combine the power of the prefix with the gitignore syntax to exclude specific files again. All the gitignore patterns are considered as a whole, meaning one can exclude and then specifically include a file again by negating it with a ! as shown below:

image_1.png
subdir/* subdir/image_4* !subdir/image_41.png !subdir/image_42.png
^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
prefix   gitignore patterns separated by a whitespace

This will select all files which match the prefix but ignores files that match the gitignore patterns. In the above example image_1.png, subdir/image_2.png, subdir/image_3.png, subdir/image_41.png, subdir/image_42.png would be considered, while subdir/image_40.png would be ignored.