Selection

Lightly allows you to specify the subset to be selected based on several objectives.

E.g. you can specify that the images in the subset should be visually diverse, be images the model struggles with (active learning), should only be sharp images, or have a certain distribution of classes, e.g. be 50% from sunny, 30% from cloudy and 20% from rainy weather.

Each of these objectives is defined by a strategy. A strategy consists of two parts:

  • The input defines which data the objective is defined on. This data is either a scalar number or a vector for each sample in the dataset.

  • The strategy itself defines the objective to apply on the input data.

Lightly allows you to specify several objectives at the same time. The algorithms try to fulfil all objectives simultaneously.

Lightly’s data selection algorithms support four types of input:

Prerequisites

In order to use the selection feature, you need to

Scheduling a Lightly Worker run with selection

For scheduling a Lightly Worker run with a specific selection, you can use the python client and its schedule_compute_worker_run method. You specify the selection with the selection_config argument. See Scheduling a Simple Job for reference.

Here is an example for scheduling a Lightly worker run with a specific selection configuration:

import time

from lightly.openapi_generated.swagger_client import DockerRunScheduledState, DockerRunState

# You can reuse the client from previous scripts. If you want to create a new
# one you can uncomment the following line:
# import lightly
# client = lightly.api.ApiWorkflowClient(token="TOKEN", dataset_id="DATASET_ID")

# Schedule the compute run using a custom config.
# You can easily edit the values according to your needs.


scheduled_run_id = client.schedule_compute_worker_run(
    worker_config={
        'enable_corruptness_check': True,
        'remove_exact_duplicates': True,
        'enable_training': False,
    },
    selection_config={
        "n_samples": 50,
        "strategies": [
            {
                "input": {
                    "type": "EMBEDDINGS"
                },
                "strategy": {
                    "type": "DIVERSITY"
                }
            }
        ]
    },
    lightly_config={
        'loader': {
            'batch_size': 16,
            'shuffle': True,
            'num_workers': -1,
            'drop_last': True
        },
        'model': {
            'name': 'resnet-18',
            'out_dim': 128,
            'num_ftrs': 32,
            'width': 1
        },
        'trainer': {
            'gpus': 1,
            'max_epochs': 100,
            'precision': 32
        },
        'criterion': {
            'temperature': 0.5
        },
        'optimizer': {
            'lr': 1,
            'weight_decay': 0.00001
        },
        'collate': {
            'input_size': 64,
            'cj_prob': 0.8,
            'cj_bright': 0.7,
            'cj_contrast': 0.7,
            'cj_sat': 0.7,
            'cj_hue': 0.2,
            'min_scale': 0.15,
            'random_gray_scale': 0.2,
            'gaussian_blur': 0.5,
            'kernel_size': 0.1,
            'vf_prob': 0,
            'hf_prob': 0.5,
            'rr_prob': 0
        }
    }
)

"""
Optionally, You can use this code to track and print the state of the compute worker.
The loop will end once the compute worker run has finished, was canceled or aborted/failed.
"""
for run_info in client.compute_worker_run_info_generator(scheduled_run_id=scheduled_run_id):
    print(f"Compute worker run is now in state='{run_info.state}' with message='{run_info.message}'")

if run_info.ended_successfully():
    print("SUCCESS")
else:
    print("FAILURE")

Selection Configuration

The configuration of a selection needs to specify both the maximum number of samples to select and the strategies:

{
    "n_samples": 50,
    "proportion_samples": 0.1
    "strategies": [
        {
            "input": {
                "type": ...
            },
            "strategy": {
                "type": ...
            }
        },
        ... more strategies
    ]
}

The variable n_samples must be a positive integer specifying the absolute number of samples which should be selected. Alternatively to n_samples, you can also set proportion_samples to set the number of samples to be selected relative to the input dataset size. E.g. set it to 0.1 to select 10% of all samples. Please set either one or the other. Setting both or none of them will cause an error.

Each strategy is specified by a dictionary, which is always made up of an input and the actual strategy.

{
    "input": {
        "type": ...
    },
    "strategy": {
        "type": ...
    }
},

Selection Input

The input can be one of the following:

The lightly OSS framework for self supervised learning is used to compute the embeddings. They are a vector of numbers for each sample.

You can define embeddings as input using:

"input": {
    "type": "EMBEDDINGS"
}

Selection Strategy

There are several types of selection strategies, all trying to reach different objectives.

Use this strategy to select samples such that they are as different as possible from each other.

Can be used with EMBEDDINGS, SCORES and numerical METADATA. Samples with a high distance between their embeddings/scores/metadata are considered to be more different from each other than samples with a low distance. The strategy is specified like this:

"strategy": {
    "type": "DIVERSITY"
}

If you want to preserve a minimum distance between chosen samples, you can specify it as an additional stopping condition. The selection process will stop as soon as one of the stopping criteria has been reached.

"strategy": {
    "type": "DIVERSITY",
    "stopping_condition_minimum_distance": 0.2
}

Setting "stopping_condition_minimum_distance": 0.2 will remove all samples which are closer to each other than 0.2. This allows you to specify the minimum allowed distance between two images in the output dataset. If you use embeddings as input, this value should be between 0 and 2.0, as the embeddings are normalized to unit length. This is often a convenient method when working with different data sources and trying to combine them in a balanced way. If you want to use this stopping condition to stop the selection early, make sure that you allow selecting enough samples by setting n_samples or proportion_samples high enough.

Note

Higher minimum distance in the embedding space results in more diverse images being selected. Furthermore, increasing the minimum distance will result in fewer samples being selected.

Configuration Examples

Here are examples for the full configuration including the input for several objectives:

Visual Diversity (CORESET)

Choosing 100 samples that are visually diverse equals diversifying samples based on their embeddings:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "EMBEDDINGS"
            },
            "strategy": {
                "type": "DIVERSITY"
            }
        }
    ]
}
Active Learning

Active Learning equals weighting samples based on active learning scores:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "SCORES",
                "task": "my_object_detection_task", # change to your task
                "score": "uncertainty_entropy" # change to your preferred score
            },
            "strategy": {
                "type": "WEIGHTS"
            }
        }
    ]
}

Note

This works as well for Image Classifciation or Segmentation!

Visual Diversity and Active Learning (CORAL)

For combining two strategies, just specify both of them:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "EMBEDDINGS"
            },
            "strategy": {
                "type": "DIVERSITY"
            }
        },
        {
            "input": {
                "type": "SCORES",
                "task": "my_object_detection_task", # change to your task
                "score": "uncertainty_entropy" # change to your preferred score
            },
            "strategy": {
                "type": "WEIGHTS"
            }
        }
    ]
}
Metadata Thresholding

This can be used to remove e.g. blurry images, which equals selecting samples whose sharpness is over a threshold:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "METADATA",
                "key": "lightly.sharpness"
            },
            "strategy": {
                "type": "THRESHOLD",
                "threshold": 20,
                "operation": "BIGGER"
            }
        }
    ]
}
Object Balancing

Use lightly pretagging to get the objects, then specify a target distribution of classes:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "PREDICTIONS",
                "task": "lightly_pretagging", # (optional) change to your task
                "name": "CLASS_DISTRIBUTION"
            },
            "strategy": {
                "type": "BALANCE",
                "target": {
                    "car": 0.1,
                    "bicycle": 0.5,
                    "bus": 0.1,
                    "motorcycle": 0.1,
                    "person": 0.1,
                    "train": 0.05,
                    "truck": 0.05
                }
            }
        }
    ]
}

Note

To use the lightly pretagging you need to enable it using 'pretagging': True in the worker config. See Pretagging for reference.

Metadata Balancing

Let’s assume you have specified metadata with the path weather.description and want your selected subset to have 20% sunny, 40% cloudy and the rest other images:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "METADATA",
                "key": "weather.description"
            },
            "strategy": {
                "type": "BALANCE",
                "target": {
                    "sunny": 0.2,
                    "cloudy": 0.4
                }
            }
        }
    ]
}

Application of Strategies

Generally, the order in which the different strategies were defined in the config does not matter. In a first step, all the thresholding strategies are applied. In the next step, all other strategies are applied in parallel.

Note

Note that different taskes can also be combined. E.g. you can use predictions from “my_weather_classification_task” for one strategy combined with predictions from “my_object_detection_task” from another strategy.

The Lightly optimizer tries to fulfil all strategies as good as possible. Potential reasons why your objectives were not satisfied:

  • Tradeoff between different objectives. The optimizer always has to tradeoff between different objectives. E.g. it may happen that all samples with high WEIGHTS are close together. If you also specified the objective DIVERSITY, then only a few of these high-weight samples may be chosen. Instead, also other samples that are more diverse, but have lower weights, are chosen.

  • Restrictions in the input dataset. This applies especially for BALANCE: E.g. if there are only 10 images of ambulances in the input dataset and a total of 1000 images are selected, the output can only have a maximum of 1% ambulances. Thus a BALANCE target of having 20% ambulances cannot be fulfilled.

  • Too little samples to choose. If the selection algorithm can only choose a small number of samples, it may not be possible to fulfil the objectives. You can solve this by increasing n_samples or proportion_samples.

Selection on object level

Lightly supports doing selection on Object Level.

While embeddings are fully available, there are some limitations regarding the usage of METADATA and predictions for SCORES and PREDICTIONS as input:

  • When using the object level workflow, the object detections used to create the object crops out of the images are available and can be used for both the SCORES and PREDICTIONS input. However, predictions from other tasks are NOT available at the moment.

  • Lightly metadata is generated on the fly for the object crops and can thus be used for selection. However, other metadata is on image level and thus NOT available at the moment.

If your use case would profit from using image-level data for object-level selection, please reach out to us.