# Dataset Metrics

Dataset metrics are designed to help you understand the selected dataset better. They are the bridge between the selection algorithms offered by the Lightly Worker and the performance of your machine-learning model in production:

- Understanding the dataset metrics helps to know which kind of dataset to select
- Optimising the dataset metrics to fulfill your requirements can improve the performance of your model

Dataset metrics are computed with every run of the Lightly Worker. They compare the input dataset from which samples are selected to the output dataset.

Lightly computes the following dataset metrics:

- Embedding Diversity Metrics
- Embedding Coverage Metrics
- Prediction Metrics
- Active Learning Score Metrics

All the metrics can be found in the `report.json`

under the key "Statistics". The metrics are computed whenever the necessary data is available. Thus they also allow us to see if adding a strategy to optimize this metric would make sense.

```
{
"Statistics": {
"Dataset metrics": {
"embedding": {
"input": {
"min_diversity": 0.006739342585206032,
"mean_diversity": 0.12419968843460083,
"max_diversity": 0.4394771456718445
},
"output": {
"min_diversity": 0.006739342585206032,
"mean_diversity": 0.12072598934173584,
"max_diversity": 0.4394771456718445
},
"comparison": {
"mean_distance_input_to_output": 0.05417616665363312,
"max_distance_input_to_output": 0.41954725980758667
}
},
"prediction": {
"prediction_task_name": {
"input": {
"balance": 0.407775715441531,
"distribution": [
0.135401200697179,
0.058195081014782775,
0.7704150797237106,
0.002937189335743335,
0.0036472790652637014,
0.006100316312697696,
0.023303853850622943
]
},
"output": {
"balance": 0.5428204462121771,
"distribution": [
0.21480637813211845,
0.09891799544419135,
0.633997722095672,
0.003986332574031891,
0.00489749430523918,
0.010421412300683372,
0.03297266514806378
]
},
"comparison": {
"difference_input_to_output": 0.16336479102743945
}
}
},
"active_learning_score": {
"prediction_task_name": {
"uncertainty_least_confidence": {
"input": {
"min": 0,
"median": 1,
"mean": 0.9886363636363636,
"max": 1
},
"output": {
"min": 0,
"median": 1,
"mean": 0.9848451080900379,
"max": 1
},
"comparison": {
"distribution_similarity_mannwhitneyu": 0.08997804140183326
}
},
...
}
}
}
}
```

## Embedding Diversity Metrics

Embeddings capture properties of images by a representation in an Euclidean space. Thus the distance of the embeddings of a pair of images measures the diversity of these images. When training a machine learning model on a set of images, it can help to have diverse images in it, as they help the model generalize better. Conversely, very similar images contain the same information, and training the model on one of them is enough.

Lightly measures the diversity of samples within a dataset by computing the distance from each sample to its most similar sample. This distance is the Euclidean distance in the embedding space. The embedding diversity metrics are computed for both the input and output datasets separately and independently. They are seen in the example under `data["Dataset metrics"]["embedding"]["input"]["min_diversity"]`

to `data["Dataset metrics"]["embedding"]["output"]["max_diversity"]`

Lightly provides each sample's minimum, mean and maximum distance to its closest neighbor. The example shows that the minimum and mean distances are larger in the output than in the input. This indicates that there are fewer similar images in the output.

To visualize this metric, consider the graph below, showing input samples (red) and output samples (blue) in a 2-dimensional embedding space. The green lines connect each blue sample to its nearest neighbor. The three metrics are then the mean, minimum and maximum length of the green lines. The output samples (blue points) were chosen such that the distance between them (length of green lines) is maximized.

## Embedding Coverage Metrics

The embedding coverage metrics measure how well the output set covers the samples in the input set. More precisely, the distance from each sample in the input to its closest sample in the output is computed. If these distances are low, it means that every input sample is covered by similar samples in the output. We provide the mean and maximum of these distances under `data["Dataset metrics"]["embedding"]["comparison"]["mean_distance_input_to_output"]`

and `data["Dataset metrics"]["embedding"]["comparison"]["max_distance_input_to_output"]`

.

The `max_distance_input_to_output`

is mathematically proven to upper-bound the difference in model performance between training on the input and output dataset under certain circumstances. For reference of the proof: See Theorem 1 of Sener & Saravese, 2017.

Thus minimizing this metric allows you to train the model on the smaller output set and thus with much less labeling and training effort while still having similar performance to a model trained on the entire input dataset. The `mean_distance_input_to_output`

intuitively also bounds the performance difference. It has the advantage of being a smoother metric that is less vulnerable to a single outlier.

To visualize this metric, consider the graph below, showing input samples (red) and output samples (blue) in a 2-dimensional embedding space. The black lines connect each input (red) sample to its closest output (blue) neighbor. The two metrics computed are the mean and maximum length of the black lines. Intuitively, these lengths should be minimized.

The output samples (blue points) were chosen such that their coverage of the input samples (red points) was optimized.

## Prediction Metrics

The prediction metrics contain information about the categorical distribution of the predicted classes in the datasets. `data["Dataset metrics"]["predictions"]["prediction_task_name"]["input"]["distribution"]`

is the distribution of the classes within the dataset according to the prediction task. The values sum up to 1, as it is a categorical distribution.`data["Dataset metrics"]["predictions"]["prediction_task_name"]["input"]["balance"]`

measures the balancedness of this distribution. It is at a minimum of 0 if there is maximum imbalance, i.e., if all predictions are from the same class. It is the maximum of 1 if all classes have the same frequency in the dataset. E.g., if the distribution shows that all 4 classes make up 25% of all predictions each. These metrics are computed for the input and output distribution separately.

The metric under`data["Dataset metrics"]["predictions"]["prediction_task_name"]["comparison"]["difference_input_to_output"]`

measures the difference between the input and output distribution. It is computed as the Euclidean distance between the prediction vectors.

Note that these metrics are computed for every prediction task separately.

## Active Learning Score Metrics

The active learning score metrics contain information about the numerical distribution of active learning scores. Like the prediction metrics, they are computed for every task separately. The metrics from `data["Dataset metrics"]["active_learning_score"]["prediction_task_name"]["active_learning_score_name"]["input"]["min"]`

to `data["Dataset metrics"]["active_learning_score"]["prediction_task_name"]["active_learning_score_name"]["output"]["max"]`

contain statistics about the distribution of active learning scores in the input and output distribution.

The `data["Dataset metrics"]["active_learning_score"]["prediction_task_name"]["active_learning_score_name"]["comparison"]["distribution_similarity_mannwhitneyu"]`

measures the similarity of the numerical input and output distribution in the form of the Mann–Whitney U test statistic. The similarity metric is close to 0 if the output only contains samples with the highest (or lowest) active learning scores. It has a maximum of 1 if the distribution of active learning scores in the output is similar to the 1 in the input.

Updated 4 months ago