Selection

With the power of Lightly, you can select a subset of your unlabeled data stored within your datasource. This allows you to mine your data efficiently based on several objectives you define.

For example, you can specify that the images in the subset should be visually diverse, be images the model struggles with (active learning), should only be sharp images, or have a certain distribution of classes, e.g. be 50% from sunny, 30% from cloudy and 20% from rainy weather. See further examples and use cases.

Each of these objectives is defined by a pair of settings, the input and the strategy:

  • The input defines which data the objective is defined on. This data is either a scalar number or a vector for each sample in the dataset. See selection input for more information.
  • The strategy defines the objective to apply on the input data. See selection strategies for more information.

Lightly allows you to specify several objectives at the same time. The algorithms try to fulfill all objectives simultaneously.
For details on how the different selection strategies are combined, see selection combination.

Lightly's data selection algorithms support different input types:

  • Embeddings computed using our Lightly Framework for self-supervised learning.
  • Lightly metadata are metadata of images like the sharpness and are computed out of the images themselves by Lightly.
  • (Optional) Model predictions such as classifications, object detections, or segmentations.
  • (Optional) Custom Metadata can be any additional key-value information you can encode in a JSON file (from numbers to categorical strings) such as weather conditions, temperature, timestamp, location, etc.

Prerequisites

In order to use the selection feature, you need to:

Scheduling a Run

For scheduling a Lightly Worker run with a custom selection, you can use the Python Lightly Framework and its schedule_compute_worker_run method. You specify the selection with the selection_config argument. See Run Your First Selection for reference.

Here is an example of scheduling a Lightly Worker run with a selection configuration:

from lightly.api import ApiWorkflowClient

# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN", dataset_id="MY_DATASET_ID")

# Schedule the compute run using a custom config.
# You can edit the values according to your needs.
scheduled_run_id = client.schedule_compute_worker_run(
    selection_config={
        "n_samples": 50,
        "strategies": [
            {
                "input": {
                    "type": "EMBEDDINGS"
                },
                "strategy": {
                    "type": "DIVERSITY"
                }
            }
        ]
    },
)

Selection Configuration

The configuration of a selection needs to specify both the maximum number of samples to select and the strategies:

{
    "n_samples": 50,
    "proportion_samples": 0.1,
    "strategies": [
        {
            "input": {
                "type": ...
            },
            "strategy": {
                "type": ...
            }
        },
        ... more strategies
    ]
}

The variable n_samples must be a positive integer specifying the absolute number of samples that should be selected. Alternatively to n_samples, you can also set proportion_samples to set the number of samples to be selected relative to the input dataset size. E.g. set it to 0.1 to select 10% of all samples. Please set either one or the other. Setting both or none of them will cause an error.

Each strategy is specified by a dictionary, which is always made up of an input and the actual strategy.

{
    "input": {
        "type": ...
    },
    "strategy": {
        "type": ...
    }
},