Selection Strategy Combination
Selection Algorithm
The Lightly selection algorithm selects the samples greedily, i.e. one sample after the other. This is the only algorithm that can scale to millions of samples. In each step, it selects the samples that have the highest overall score, which is the product of the scores of each strategy. Taking the product has the advantage that the scale of each strategy is irrelevant, as multiplying all scores of one strategy by a constant has the same effect as multiplying the overall scores by a constant: It does not change the order of the overall scores at all.
An example if you specify both the Visual Diversity and Active Learning strategy:
The visual diversity strategy assigns each sample a visual diversity score. At the same time, the active learning strategy assigns each sample an active learning score. Consider this example with 3 samples:
Diversity score | Active Learning score | overall score | |
---|---|---|---|
sample 1 | 1 | 0.3 | 0.3 |
sample 2 | 0.8 | 0.8 | 0.64 |
sample 3 | 0.5 | 1.0 | 0.5 |
The sample selected by the Lightly selection algorithm is sample 2 in this case, as it has the highest overall score.
The scores of the Diversity and Balance strategy depend on the already selected samples, thus they are updated and recomputed after every step. The scores of the Weighting and Similarity are constant.
Thresholding is done before the combination selection process, thus it is excluded in the combination selection of the other strategies.
Strategy Scores
The score of each strategy is computed in different ways and has different ranges. In any case, each strategy score is >= 0
.
Diversity
The strategy Diversity outputs the euclidean distance between a sample and its closest selected neighbour sample as the strategy score, normalized to a maximum distance of 1. Thus the scores are in the range of (0, 1]
. Because exact duplicates with a distance of 0 are removed, the distance is always bigger than 0.
Weights
Weighting uses its input directly as scores. Thus the range of scores depends on the input:
- The active learning scores are all in the range of
[0, 1]
. Samples with a missing, invalid or empty predictions file have a default active learning score of0.0
. - Custom metadata is in the range it is defined within. E.g. if you use
weather.temperature
for indoor temperature in degrees Celsius, it might be in the range[10, 30]
. Samples with a missing, invalid or empty custom metadata file have a default score of0
. Scores<0
are set to0
. - For the definition and thus the range of Lightly Metadata, look into the glossary.
- The random input is in the range of
[0, 1]
.
Balance
The Balance scores are in the range of [0, 2]
:
- A score of
0
is assigned if the sample has only an already strongly oversampled class and will thus worsen the distribution as much as possible. - A score of
1
is assigned if the sample will neither improve nor worsen the distribution. - A score of
2
is assigned if the sample has only a strongly undersampled class and will thus improve the distribution as much as possible.
A sample with a missing, invalid, or empty prediction file is assumed to have no effect on the distribution and thus has a score of 1
.
Similarity
The strategy Similarity outputs the maximum cosine similarity from each sample to the key samples, linearly interpolated to a range of [0, 1]
.
What happens if a strategy score is 0
?
0
?If at least one strategy outputs a score of 0
for a sample, this sample will not be selected. The only exception is if all samples have at least one strategy score being 0
, in which case the factor of 0
is ignored. In that case, the sample with the highest product of its other strategy scores is selected.
Updated 13 days ago