Selection Strategies

There are different selection strategies that LightlyOne supports for handling the different inputs to achieve your objectives. When combining multiple objectives and different selection inputs with strategies, one can specify the strength for each of the strategies to influence the objective with a positive or inverse/negative effect.

LightlyOne offers the following selection strategies to help you achieve your objective:

Diversity
Typicality
Weights
Threshold
Balance
Similarity

Not all strategies can be combined with every selection input. Please see input and strategy combinations for detailed information.

Diversity

Use this strategy to select samples such that they are as different as possible from each other.

Can be used with Embeddings. Samples with a high distance between their embeddings are considered to be more different from each other than samples with a low distance. The strategy is specified like this:

"strategy": {
    "type": "DIVERSITY",
    "strength": 0.6 # optional
}

If you want to preserve a minimum distance between chosen samples, you can specify it as an additional stopping condition. The selection process will stop as soon as one of the stopping criteria has been reached.

"strategy": {
    "type": "DIVERSITY",
    "stopping_condition_minimum_distance": 0.2,
    "strength": 0.6 # optional
}

Setting "stopping_condition_minimum_distance": 0.2 will remove all samples which are closer to each other than 0.2. This allows you to specify the minimum allowed distance between two images in the output dataset. If you use embeddings as input, this value should be between 0 and 2.0, as the embeddings are normalized to unit length. This is often a convenient method when working with different data sources and trying to combine them in a balanced way. If you want to use this stopping condition to stop the selection early, make sure that you allow selecting enough samples by setting n_samples or proportion_samples high enough in the selection configuration.

📘
Higher minimum distance in the embedding space results in more diverse images being selected. Increasing the minimum distance will result in fewer samples being selected.

Typicality

📘
Typicality selection requires LightlyOne Worker version at least 2.10.

When the machine learning model only has a few samples to learn from and/or is already struggling with the easy or common cases,
typicality selection helps. The typicality strategy allows you to select samples that are representative of the distribution. The way this is achieved is by selecting samples from high-density regions. The typicality is computed as the similarity of a sample to its nearest neighbors. High typicality samples are very similar to the nearest neighbors and, therefore, correspond to high-density regions.

Can be used with Embeddings.

🚧
You should always combine typicality with diversity and not as a standalone selection method as it could result in data being selected only from a single cluster. Diversity selects the most different samples possible, thus ensuring you consider edge cases in your selection. Typicality selects the samples that are the most representative of your data. By adjusting the strengths of the two strategies, you can obtain the most calibrated selection strategy.
Furthermore, we strongly discourage using typicality for datasets with more than 100,000 input samples. For large datasets, it not only does not help selection, but also leads to long worker runtimes.

The simplest way to set the typicality strategy combined with diversity is the following:

 strategies = [
    {
        "input": {"type": "EMBEDDINGS"},
        "strategy": {"type": "TYPICALITY", "strength": 1.0},
    },
    {
        "input": {"type": "EMBEDDINGS"},
        "strategy": {"type": "DIVERSITY", "strength": 1.0} 
    },
]

Weights

The objective of this strategy is to select samples that have a high numerical value.

Can be used with Scores, numerical Metadata and Random inputs. It can be specified with:

"strategy": {
    "type": "WEIGHTS",
    "strength": 0.6 # optional
}

Threshold

The objective of this strategy is to only select samples that have a numerical value fulfilling a threshold criterion. E.g. they should be bigger or smaller than a certain value.

Can be used with Scores and numerical Metadata inputs. It is specified as follows:

"strategy": {
    "type": "THRESHOLD",
    "threshold": 20,
    "operation": "BIGGER_EQUAL"
}

This will keep all samples whose value (specified by the input) is >= 20 and remove all others. The allowed operations are SMALLER, SMALLER_EQUAL, BIGGER, BIGGER_EQUAL.

Threshold does not support strengthas it is a hard filter that is applied as a first step. See Selection Algorithm for more information.

Balance

It can be used with Predictions and categorical string Metadata. Categorical int and categorical boolean metadata cannot be used for selection at the moment.

This strategy aims to select samples so that the distribution of classes or metadata of the selected set satisfies certain properties. These properties can be defined by choosing an appropriate value for the distribution field of the configuration of the strategy.

Three different values for distribution are supported:

TARGET: In this case, the target distribution of classes or metadata of the selected set is explicitly defined. For example, if one would like to specify the distribution of classes of the selected set to be 40% ambulances and 60% buses, the balance strategy should be configured as shown below:
```
"strategy": {  
    "type": "BALANCE",  
    "distribution": "TARGET",  
    "target": {  
        "Ambulance": 0.4, # `Ambulance` should be a valid class in your `schema.json`  
        "Bus": 0.6  
    },  
    "strength": 0.6 # optional  
}
```
If the values of the target distribution do not sum up to 1, the remainder is assumed to be the target for the other classes. For example, if we set the target to be 20% ambulance and 40% bus, there is the implicit assumption that the remaining 40% should come from any other class (e.g. cars, bicycles, or pedestrians). 
UNIFORM: In this case, the distribution of the classes or metadata of the selected set is chosen to be a uniform distribution. This configuration should be chosen when one wants to have an equal probability of the appearance of all classes at the selected set. For example, suppose one is dealing with four classes (e.g. car, ambulance, bus, and pedestrians) and would like the selected set to be composed of 25% cars, 25% ambulances, 25% buses, and 25% pedestrians. In that case, the balance strategy should be configured as follows:
```
"strategy": {  
    "type": "BALANCE",  
    "distribution": "UNIFORM"  
}
```
INPUT: In this case, the distribution of the classes or metadata of the selected set is chosen to be the same as that of the input set. For example, if there are 50% cars, 30% bicycles, and 20% pedestrians at the input set, and if one would like to maintain these rates of appearance at the selected set, the balance strategy should be configured as follows:
Balance Strategy with Input Distribution
```
"strategy": {  
    "type": "BALANCE",  
    "distribution": "INPUT"  
}
```

📘
The uniform and input options for the distribution of the balancing strategy require LightlyOne Worker version of at least 2.12 and Lightly Python Client of at least version 1.5.4.
Prior to LightlyOne Worker version 2.12, the BALANCE selection strategy defaulted to working like distribution=TARGET. If you are using a LightlyOne Worker version prior to 2.12, please configure the BALANCE strategy without specifying the distribution key, as shown below:
"strategy": {  
    "type": "BALANCE",  
    "target": {  
        "Ambulance": 0.4, # `Ambulance` should be a valid class in your `schema.json`  
        "Bus": 0.6  
    },  
    "strength": 0.6 # optional  
}

Similarity

With this strategy, you can use the input embeddings from another dataset to select similar images. This can be useful if you are looking for more examples of certain edge cases.

Can be used with Embeddings.

"strategy": {
    "type": "SIMILARITY",
    "strength": 0.6 # optional
}