Selection Strategies

There are different selection strategies that Lightly supports for handling the different inputs to achieve your objectives. When combining multiple objectives and different selection inputs with strategies, one can specify the strength for each of the strategies to influence the objective with a positive or inverse/negative effect.

Lightly offers the following selection strategies to help you achieve your objective:

Not all strategies can be combined with every selection input. Please see input and strategy combinations for detailed information.

Diversity

Use this strategy to select samples such that they are as different as possible from each other.

Can be used with Embeddings. Samples with a high distance between their embeddings are considered to be more different from each other than samples with a low distance. The strategy is specified like this:

"strategy": {
    "type": "DIVERSITY",
    "strength": 0.6 # optional
}

If you want to preserve a minimum distance between chosen samples, you can specify it as an additional stopping condition. The selection process will stop as soon as one of the stopping criteria has been reached.

"strategy": {
    "type": "DIVERSITY",
    "stopping_condition_minimum_distance": 0.2,
    "strength": 0.6 # optional
}

Setting "stopping_condition_minimum_distance": 0.2 will remove all samples which are closer to each other than 0.2. This allows you to specify the minimum allowed distance between two images in the output dataset. If you use embeddings as input, this value should be between 0 and 2.0, as the embeddings are normalized to unit length. This is often a convenient method when working with different data sources and trying to combine them in a balanced way. If you want to use this stopping condition to stop the selection early, make sure that you allow selecting enough samples by setting n_samples or proportion_samples high enough in the selection configuration.

πŸ“˜

Higher minimum distance in the embedding space results in more diverse images being selected. Increasing the minimum distance will result in fewer samples being selected.

Typicality

πŸ“˜

Typicality selection requires Lightly worker version at least 2.10.

The typicality strategy allows you to select samples that are representative of the distribution. The way this is achieved is by selecting samples from high-density regions. The typicality is computed as the similarity of a sample to its nearest neighbors. High typicality samples are very similar to the nearest neighbors and, therefore, correspond to high-density regions.

Can be used with Embeddings.

🚧

Typicality is suggested to be used in combination with diversity and not as a standalone selection method as it could result in data being selected only from a single cluster. Diversity selects the most different samples possible, thus ensuring you consider edge cases in your selection. Typicality selects the samples that are the most representative of your data. By adjusting the strengths of the two strategies, you can obtain the most calibrated selection strategy.

The simplest way to set the typicality strategy combined with diversity is the following:

 strategies = [
    {
        "input": {"type": "EMBEDDINGS"},
        "strategy": {"type": "TYPICALITY"},
        "strength": 1.0
    },
    {
        "input": {"type": "EMBEDDINGS"},
        "strategy": {"type": "DIVERSITY"},
        "strength": 1.0 
    },
]

Weights

The objective of this strategy is to select samples that have a high numerical value.

Can be used with Scores, numerical Metadata and Random inputs. It can be specified with:

"strategy": {
    "type": "WEIGHTS",
    "strength": 0.6 # optional
}

Threshold

The objective of this strategy is to only select samples that have a numerical value fulfilling a threshold criterion. E.g. they should be bigger or smaller than a certain value.

Can be used with Scores and numerical Metadata inputs. It is specified as follows:

"strategy": {
    "type": "THRESHOLD",
    "threshold": 20,
    "operation": "BIGGER_EQUAL"
}

This will keep all samples whose value (specified by the input) is >= 20 and remove all others. The allowed operations are SMALLER, SMALLER_EQUAL, BIGGER, BIGGER_EQUAL.

Threshold does not support strengthas it is a hard filter that is applied as a first step. See Selection Algorithm for more information.

Balance

The objective of this strategy is to select samples such that the distribution of classes in them is as close to a target distribution as possible.

E.g. the samples chosen should have 50% sunny and 50% rainy weather. Or, the objects of the samples chosen should be 40% ambulances and 60% buses.

Can be used with Predictions and categorical string Metadata. Categorical int and categorical boolean metadata cannot be used for selection a the moment.

"strategy": {
    "type": "BALANCE",
    "target": {
        "Ambulance": 0.4, # `Ambulance` should be a valid class in your `schema.json`
        "Bus": 0.6
    },
    "strength": 0.6 # optional
}

If the values of the target do not sum up to 1, the remainder is assumed to be the target for the other classes. For example, if we set the target to 20% ambulance and 40% bus, there is the implicit assumption, that the remaining 40% should come from any other class, e.g. from cars, bicycles, or pedestrians.

Note that not specified classes do not influence the selection process!

Similarity

With this strategy, you can use the input embeddings from another dataset to select similar images. This can be useful if you are looking for more examples of certain edge cases.

Can be used with Embeddings.

"strategy": {
    "type": "SIMILARITY",
    "strength": 0.6 # optional
}