Customize a Selection

Lightly allows you to specify the subset to be selected based on several objectives.

For example, you can specify that the images in the subset should be visually diverse, be images the model struggles with (active learning), should only be sharp images, or have a certain distribution of classes, e.g. be 50% from sunny, 30% from cloudy and 20% from rainy weather.

Each of these objectives is defined by a pair of settings, the input and the strategy:

  • The input defines which data the objective is defined on. This data is either a scalar number or a vector for each sample in the dataset.
  • The strategy defines the objective to apply on the input data.

Lightly allows you to specify several objectives at the same time. The algorithms try to fulfill all objectives simultaneously.

Lightly's data selection algorithms support four types of input:

Prerequisites

In order to use the selection feature, you need to:

Scheduling a Run

For scheduling a Lightly Worker run with a custom selection, you can use the Python Lightly Framework and its schedule_compute_worker_run method. You specify the selection with the selection_config argument. See Run Your First Selection for reference.

Here is an example of scheduling a Lightly Worker run with a selection configuration:

from lightly.api import ApiWorkflowClient

# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN", dataset_id="MY_DATASET_ID")

# Schedule the compute run using a custom config.
# You can edit the values according to your needs.
scheduled_run_id = client.schedule_compute_worker_run(
    selection_config={
        "n_samples": 50,
        "strategies": [
            {
                "input": {
                    "type": "EMBEDDINGS"
                },
                "strategy": {
                    "type": "DIVERSITY"
                }
            }
        ]
    },
)

Selection Configuration

The configuration of a selection needs to specify both the maximum number of samples to select and the strategies:

{
    "n_samples": 50,
    "proportion_samples": 0.1,
    "strategies": [
        {
            "input": {
                "type": ...
            },
            "strategy": {
                "type": ...
            }
        },
        ... more strategies
    ]
}

The variable n_samples must be a positive integer specifying the absolute number of samples which should be selected. Alternatively to n_samples, you can also set proportion_samples to set the number of samples to be selected relative to the input dataset size. E.g. set it to 0.1 to select 10% of all samples. Please set either one or the other. Setting both or none of them will cause an error.

Each strategy is specified by a dictionary, which is always made up of an input and the actual strategy.

{
    "input": {
        "type": ...
    },
    "strategy": {
        "type": ...
    }
},

Selection Input

The input can be one of the following:

Embeddings

The Lightly Framework for self supervised learning is used to compute the embeddings. They are a vector of numbers for each sample.

You can access embeddings as input using:

"input": {
        "type": "EMBEDDINGS"
}

You can also use embeddings from other datasets to for strategies such as similarity search:

"input": {
    "type": "EMBEDDINGS",
    "dataset_id": "DATASET_ID_OF_THE_QUERY_IMAGES",
    "tag_name": "TAG_NAME_OF_THE_QUERY_IMAGES" # e.g. "initial-tag"
},

Or object embeddings from an object prediction task to select images with diverse objects:

"input": {
    "type": "EMBEDDINGS",
    "task": "my_object_detection_task",   # or "lightly_pretagging"
},

Scores

They are scalar numbers for each element. They are specified by the prediction task and the score:

# using your own predictions
"input": {
    "type": "SCORES",
    "task": "YOUR_TASK_NAME",
    "score": "uncertainty_entropy"
}

# using the lightly pretagging model
"input": {
    "type": "SCORES",
    "task": "lightly_pretagging",
    "score": "uncertainty_entropy"
}

You can specify one of the tasks you specified in your datasource, see Work with Predictions for reference. Alternatively, set the task to lightly_pretagging to use object detections created by the Lightly Worker itself. See Lightly Pretagging for reference.

The supported score types are explained in Scorers.

Predictions

The class distribution probability vector of predictions can be used as well. Here, three cases have to be distinguished:

  • Image Classification: The probability vector of each sample's prediction is used directly.
  • Object Detection: The probability vectors of the class predictions of all objects in an image are summed up.
  • Object Detection and using Lightly for Crop Selection: Each sample is a cropped object and has a single object prediction, whose probability vector is used.

This input is specified using the prediction task. Remember the class names, as they are needed in later steps.

If you use your own predictions, the task name and class names are taken from the specification in the prediction schema.json.

Alternatively, set the task to lightly_pretagging to use object detections created by the Lightly Worker itself. Its class names are specified here: Lightly Pretagging.

# using your own predictions
"input": {
    "type": "PREDICTIONS",
    "task": "my_object_detection_task",
    "name": "CLASS_DISTRIBUTION"
}

# using the lightly pretagging model
"input": {
    "type": "PREDICTIONS",
    "task": "lightly_pretagging",
    "name": "CLASS_DISTRIBUTION"
}

Metadata

Metadata is specified by the metadata key. It can be divided across two dimensions:

  • Custom Metadata vs. Lightly Metadata
  • Numerical vs. Categorical values

Custom Metadata can be uploaded to the datasource and accessed from there. See Metadata Format for reference. An example configuration:

"input": {
    "type": "METADATA",
    "key": "weather.temperature"
}

Use as key the “path” you specified when creating the metadata in the datasource.

Lightly Metadata is calculated by the Lightly Worker. It is specified by prepending lightly to the key. An example configuration:

"input": {
    "type": "METADATA",
    "key": "lightly.sharpness"
}

Currently supported metadata are sharpness, snr (signal-to-noise-ratio) and sizeInBytes.

📘

Types of Metadata

Not all metadata types can be used in all selection strategies. Lightly differentiates between numerical and categorical metadata.

Numerical metadata refers to numbers (int, float), such as lightly.sharpness or weather.temperature.

Categorical metadata refers to discrete categories, for example: video.location_id or weather.description. It can be either an integer or a string.

Selection Strategy

There are several types of selection strategies, all trying to reach different objectives:

Diversity

Use this strategy to select samples such that they are as different as possible from each other.

Can be used with Embeddings. Samples with a high distance between their embeddings are considered to be more different from each other than samples with a low distance. The strategy is specified like this:

"strategy": {
    "type": "DIVERSITY"
}

If you want to preserve a minimum distance between chosen samples, you can specify it as an additional stopping condition. The selection process will stop as soon as one of the stopping criteria has been reached.

"strategy": {
    "type": "DIVERSITY",
    "stopping_condition_minimum_distance": 0.2
}

Setting "stopping_condition_minimum_distance": 0.2 will remove all samples which are closer to each other than 0.2. This allows you to specify the minimum allowed distance between two images in the output dataset. If you use embeddings as input, this value should be between 0 and 2.0, as the embeddings are normalised to unit length. This is often a convenient method when working with different data sources and trying to combine them in a balanced way. If you want to use this stopping condition to stop the selection early, make sure that you allow selecting enough samples by setting n_samples or proportion_samples high enough in the selection configuration.

📘

Higher minimum distance in the embedding space results in more diverse images being selected. Increasing the minimum distance will result in fewer samples being selected.

Weights

The objective of this strategy is to select samples that have a high numerical value.

Can be used with Scores and numerical Metadata inputs. It can be specified with:

"strategy": {
    "type": "WEIGHTS"
}

Threshold

The objective of this strategy is to only select samples that have a numerical value fulfilling a threshold criterion. E.g. they should be bigger or smaller than a certain value.

Can be used with Scores and numerical Metadata inputs. It is specified as follows:

"strategy": {
    "type": "THRESHOLD",
    "threshold": 20,
    "operation": "BIGGER_EQUAL"
}

This will keep all samples whose value (specified by the input) is >= 20 and remove all others. The allowed operations are SMALLER, SMALLER_EQUAL, BIGGER, BIGGER_EQUAL.

Balance

The objective of this strategy is to select samples such that the distribution of classes in them is as close to a target distribution as possible.

E.g. the samples chosen should have 50% sunny and 50% rainy weather. Or, the objects of the samples chosen should be 40% ambulance and 60% buses.

Can be used with Predictions and categorical Metadata.

"strategy": {
    "type": "BALANCE",
    "target": {
        "Ambulance": 0.4, # `Ambulance` should be a valid class in your `schema.json`
        "Bus": 0.6
    }
}

If the values of the target do not sum up to 1, the remainder is assumed to be the target for the other classes. For example, if we would set the target to 20% ambulance and 40% bus, there is the implicit assumption, that the remaining 40% should come from any other class, e.g. from cars, bicycles or pedestrians.

Note that not specified classes do not influence the selection process!

Similarity

With this strategy you can use the input embeddings from another dataset to select similar images. This can be useful if you are looking for more examples of certain edge cases.

Can be used with Embeddings.

"strategy": {
    "type": "SIMILARITY",
}

Configuration Examples

Here are examples for the full configuration including the input for several objectives:

Visual Diversity

Choosing 100 samples that are visually diverse equals diversifying samples based on their embeddings:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "EMBEDDINGS"
            },
            "strategy": {
                "type": "DIVERSITY"
            }
        }
    ]
}

Active Learning

Active Learning equals weighting samples based on active learning scores:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "SCORES",
                "task": "my_object_detection_task", # change to your task
                "score": "uncertainty_entropy" # change to your preferred score
            },
            "strategy": {
                "type": "WEIGHTS"
            }
        }
    ]
}

📘

This works as well for Image Classification or Segmentation! Just change the input task to a classification or segmentation task.

Visual Diversity and Active Learning

For combining two strategies, just specify both of them:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "EMBEDDINGS"
            },
            "strategy": {
                "type": "DIVERSITY"
            }
        },
        {
            "input": {
                "type": "SCORES",
                "task": "my_object_detection_task", # change to your task
                "score": "uncertainty_entropy" # change to your preferred score
            },
            "strategy": {
                "type": "WEIGHTS"
            }
        }
    ]
}

Metadata Thresholding

This can be used to remove e.g. blurry images, which equals selecting samples whose sharpness is above a threshold:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "METADATA",
                "key": "lightly.sharpness"
            },
            "strategy": {
                "type": "THRESHOLD",
                "threshold": 20,
                "operation": "BIGGER"
            }
        }
    ]
}

Object Balancing

Use lightly pretagging to get the objects, then specify a target distribution of classes:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "PREDICTIONS",
                "task": "lightly_pretagging", # (optional) change to your task
                "name": "CLASS_DISTRIBUTION"
            },
            "strategy": {
                "type": "BALANCE",
                "target": {
                    "car": 0.1,
                    "bicycle": 0.5,
                    "bus": 0.1,
                    "motorcycle": 0.1,
                    "person": 0.1,
                    "train": 0.05,
                    "truck": 0.05
                }
            }
        }
    ]
}

📘

To use the lightly_pretagging task you need to enable it by setting pretagging to True in the worker config. See Lightly Pretagging for details.

Metadata Balancing

Let's assume you have specified metadata with the path weather.description and want your selected subset to have 20% sunny, 40% cloudy and the rest other images:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "METADATA",
                "key": "weather.description"
            },
            "strategy": {
                "type": "BALANCE",
                "target": {
                    "sunny": 0.2,
                    "cloudy": 0.4
                }
            }
        }
    ]
}

Similarity Search

To perform similarity search you need a dataset and tag consisting of the query images.

We can then use the following configuration to find similar images from the input dataset. This example will select 100 images from the input dataset that are the most similar to the images in the tag from the query dataset.

{
    "n_samples": 100, # put your number here
    "strategies": [
        {
            "input": {
                "type": "EMBEDDINGS",
                "dataset_id": "DATASET_ID_OF_THE_QUERY_IMAGES", 
                "tag_name": "TAG_NAME_OF_THE_QUERY_IMAGES" # e.g. "initial-tag"
            },
            "strategy": {
                "type": "SIMILARITY",
            }
        }
    ]
}

Object Diversity

To select images with diverse objects on them you can use a diversity strategy with object embeddings:

{
    "n_samples": 100, # put your number here
    "strategies": [
        {
            "input": {
                "type": "EMBEDDINGS",
                "task": "my_object_detections", # or "lightly_pretagging"
            },
            "strategy": {
                "type": "DIVERSITY",
            }
        }
    ]
}

📘

This is different from the Crop Selection workflow which selects n_samples object crops. Instead, the above strategy selects n_samples images and each image is selected such that the objects on it are as different as possible to the objects on the other selected images.

Application of Strategies

Generally, the order in which the different strategies were defined in the config does not matter. In a first step, all the thresholding strategies are applied. In the next step, all other strategies are applied in parallel.

📘

Different tasks can also be combined. E.g. you can use predictions from "my_weather_classification_task" for one strategy combined with predictions from "my_object_detection_task" from another strategy.

The Lightly optimizer tries to fulfill all strategies as well as possible.

Potential reasons why your objectives were not satisfied:

  • Tradeoff between different objectives. The optimizer always has to tradeoff between different objectives. E.g. it may happen that all samples with high WEIGHTS are close together. If you also specified the objective DIVERSITY, then only a few of these high-weight samples may be chosen. Instead, also other samples that are more diverse, but have lower weights, are chosen.
  • Restrictions in the input dataset. This applies especially for BALANCE: For example, if there are only 10 images of ambulances in the input dataset and a total of 1000 images are selected, the output can only have a maximum of 1% ambulances. Thus a BALANCE target of having 20% ambulances cannot be fulfilled.
  • Too few samples to choose. If the selection algorithm can only choose a small number of samples, it may not be possible to fulfill the objectives. You can solve this by increasing n_samples or proportion_samples.

Selection on Object Crops

Lightly supports selection object crops instead of full images . While embeddings are fully available, there are some limitations regarding the usage of Metadata and predictions for Scores and Predictions as input:

  • When using the object level workflow, the object detections used to create the object crops out of the images are available and can be used for both the Scores and Predictions input. However, predictions from other tasks are NOT available at the moment.
  • Lightly metadata is generated on the fly for the object crops and can thus be used for selection. However, other metadata is on image level and NOT available at the moment.

If your use case would profit from using image-level data for object-level selection, please reach out to us.