Selection

With the power of LightlyOne, you can select a subset of your unlabeled data stored within your datasource. This allows you to mine your data efficiently based on several objectives you define.

For example, you can specify that the images in the subset should be visually diverse, be images the model struggles with (active learning), should only be sharp images, or have a certain distribution of classes, e.g. be 50% from sunny, 30% from cloudy and 20% from rainy weather. See further examples and use cases.

Each of these objectives is defined by a pair of settings, the input and the strategy:

  • The input defines which data the objective is defined on. This data is either a scalar number or a vector for each sample in the dataset. See selection input for more information.
  • The strategy defines the objective to apply on the input data. See selection strategies for more information.

LightlyOne allows you to specify several objectives at the same time. The algorithms try to fulfill all objectives simultaneously.
For details on how the different selection strategies are combined, see selection combination.

LightlyOne data selection algorithms support different input types:

  • Embeddings computed using our Lightly Framework for self-supervised learning.
  • Lightly metadata are metadata of images like the sharpness and are computed out of the images themselves by LightlyOne.
  • (Optional) Model predictions such as classifications, object detections, or segmentations.
  • (Optional) Custom Metadata can be any additional key-value information you can encode in a JSON file (from numbers to categorical strings) such as weather conditions, temperature, timestamp, location, etc.

Prerequisites

In order to use the selection feature, you need to:

  • Start the LightlyOne Worker in worker mode.
  • Set up a dataset in the LightlyOne Platform with cloud storage as datasource. See Create a Dataset.

Scheduling a Run

For scheduling a LightlyOne Worker run with a custom selection, you can use the Lightly Python Client and its schedule_compute_worker_run method. You specify the selection with the selection_config argument. See Run Your First Selection for reference.

Here is an example of scheduling a LightlyOne Worker run with a selection configuration:

from lightly.api import ApiWorkflowClient

# Create the LightlyOne client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN", dataset_id="MY_DATASET_ID")

# Schedule the compute run using a custom config.
# You can edit the values according to your needs.
scheduled_run_id = client.schedule_compute_worker_run(
    selection_config={
        "n_samples": 50,
        "strategies": [
            {
                "input": {
                    "type": "EMBEDDINGS"
                },
                "strategy": {
                    "type": "DIVERSITY"
                }
            }
        ]
    },
)

Selection Configuration

The configuration of a selection needs to specify both the maximum number of samples to select and the strategies:

{
    "n_samples": 50,
    "proportion_samples": 0.1,
    "strategies": [
        {
            "input": {
                "type": ...
            },
            "strategy": {
                "type": ...
            }
        },
        ... more strategies
    ]
}

The variable n_samples must be a positive integer specifying the absolute number of samples that should be selected. Alternatively to n_samples, you can also set proportion_samples to set the number of samples to be selected relative to the input dataset size. E.g. set it to 0.1 to select 10% of all samples. Please set either one or the other. Setting both or none of them will cause an error.

Each strategy is specified by a dictionary, which is always made up of an input and the actual strategy.

{
    "input": {
        "type": ...
    },
    "strategy": {
        "type": ...
    }
},