Customize a Selection

Lightly allows you to specify the subset to be selected based on several objectives.

For example, you can specify that the images in the subset should be visually diverse, be images the model struggles with (active learning), should only be sharp images, or have a certain distribution of classes, e.g. be 50% from sunny, 30% from cloudy and 20% from rainy weather.

Each of these objectives is defined by a pair of settings, the input and the strategy:

  • The input defines which data the objective is defined on. This data is either a scalar number or a vector for each sample in the dataset.
  • The strategy defines the objective to apply on the input data.

Lightly allows you to specify several objectives at the same time. The algorithms try to fulfill all objectives simultaneously.
For details how the different selection stategies are combined, see Selection Strategy Combination.

Lightly's data selection algorithms support four types of input:

Prerequisites

In order to use the selection feature, you need to:

  • Start the Lightly Worker in worker mode.
  • Set up a dataset in the Lightly Platform with cloud storage as datasource. See Create a Dataset.

Scheduling a Run

For scheduling a Lightly Worker run with a custom selection, you can use the Python Lightly Framework and its schedule_compute_worker_run method. You specify the selection with the selection_config argument. See Run Your First Selection for reference.

Here is an example of scheduling a Lightly Worker run with a selection configuration:

from lightly.api import ApiWorkflowClient

# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN", dataset_id="MY_DATASET_ID")

# Schedule the compute run using a custom config.
# You can edit the values according to your needs.
scheduled_run_id = client.schedule_compute_worker_run(
    selection_config={
        "n_samples": 50,
        "strategies": [
            {
                "input": {
                    "type": "EMBEDDINGS"
                },
                "strategy": {
                    "type": "DIVERSITY"
                }
            }
        ]
    },
)

Selection Configuration

The configuration of a selection needs to specify both the maximum number of samples to select and the strategies:

{
    "n_samples": 50,
    "proportion_samples": 0.1,
    "strategies": [
        {
            "input": {
                "type": ...
            },
            "strategy": {
                "type": ...
            }
        },
        ... more strategies
    ]
}

The variable n_samples must be a positive integer specifying the absolute number of samples which should be selected. Alternatively to n_samples, you can also set proportion_samples to set the number of samples to be selected relative to the input dataset size. E.g. set it to 0.1 to select 10% of all samples. Please set either one or the other. Setting both or none of them will cause an error.

Each strategy is specified by a dictionary, which is always made up of an input and the actual strategy.

{
    "input": {
        "type": ...
    },
    "strategy": {
        "type": ...
    }
},

Selection Input

The input can be one of the following:

Embeddings

The Lightly Framework for self supervised learning is used to compute the embeddings. They are a vector of numbers for each sample.

You can access embeddings as input using:

"input": {
  "type": "EMBEDDINGS"
}

You can also use embeddings from other datasets to for strategies such as similarity search:

"input": {
    "type": "EMBEDDINGS",
    "dataset_id": "DATASET_ID_OF_THE_QUERY_IMAGES",
    "tag_name": "TAG_NAME_OF_THE_QUERY_IMAGES" # e.g. "initial-tag"
},

Or object embeddings from an object or keypoint detection task to select images with diverse objects:

"input": {
    "type": "EMBEDDINGS",
    "task": "my_detection_task",   # or "lightly_pretagging"
},

Scores

Active learning scores estimate for each sample how much adding that sample to the training set would improve the model. Since the label of each sample is unknown, the improvement cannot be calculated directly. Instead, a proxy score is used for each sample based on the prediction of the currently trained model. For example, samples where the model is uncertain offer the greatest learning potential for the model, as captured by high uncertainty active learning scores.
For more details on active learning scores and a list of all scores supported by Lightly, see Scorers for reference.

To use scores as input, specify the prediction task and the score keys:

# using your own predictions
"input": {
    "type": "SCORES",
    "task": "YOUR_TASK_NAME",
    "score": "uncertainty_entropy"
}

# using the lightly pretagging model
"input": {
    "type": "SCORES",
    "task": "lightly_pretagging",
    "score": "uncertainty_entropy"
}

You can specify one of the tasks you specified in your datasource, see Work with Predictions for reference. Alternatively, set the task to lightly_pretagging to use object detections created by the Lightly Worker itself. See Lightly Pretagging for reference.

We generally recommend using the uncertainty_entropy score.

Predictions

The class distribution probability vector of predictions can be used as well. Here, two cases have to be distinguished:

  • Image Classification: The probability vector of each sample's prediction is used directly.
  • Object Detection: The probability vectors of the class predictions of all objects in an image are summed up.

This input is specified using the prediction task. Remember the class names, as they are needed in later steps.

If you use your own predictions, the task name and class names are taken from the specification in the prediction schema.json.

Alternatively, set the task to lightly_pretagging to use object detections created by the Lightly Worker itself. Its class names are specified here: Lightly Pretagging.

# using your own predictions
"input": {
    "type": "PREDICTIONS",
    "task": "my_object_detection_task",
    "name": "CLASS_DISTRIBUTION"
}

# using the lightly pretagging model
"input": {
    "type": "PREDICTIONS",
    "task": "lightly_pretagging",
    "name": "CLASS_DISTRIBUTION"
}

On the other hand, the number of detected objects in certain categories can also be used as input with input name CATEGORY_COUNT. The following example takes the number of cars and people detected in an image as input.

"input": {
    "type": "PREDICTIONS",
    "task": "my_object_detection_task",
    "name": "CATEGORY_COUNT",
    "categories": ["car", "person"]
}

Categories of interest should be specified in categories as a list of category names. Category names are case sensitive and should be defined in the prediction schema.json. Note that categories must be and can only be specified when CATEGORY_COUNT is in use.

The final input value of a prediction is an integer, the total amount of detected objects that belong to the categories specified in categories. For instance, with the input defined above, if a prediction contains two cars, one person, and one dog, the final value of this prediction is three. CATEGORY_COUNT can then be paired with selection strategies such as weights and threshold.

Metadata

Metadata is specified by the metadata key. It can be divided across two dimensions:

  • Custom Metadata vs. Lightly Metadata
  • Numerical vs. Categorical values

Custom Metadata can be uploaded to the datasource and accessed from there. See Metadata Format for reference. An example configuration:

"input": {
    "type": "METADATA",
    "key": "weather.temperature"
}

Use as key the “path” you specified when creating the metadata in the datasource.

Lightly Metadata is calculated by the Lightly Worker. It is specified by prepending lightly to the key. We currently support these keys:

  • lightly.sharpness: Sharpness. Calculated as the standard deviation of values after application of a Laplacian 3x3 kernel on the image.
  • lightly.snr: Signal to noise ratio. Computed as the mean of color values divided by their standard deviation.
  • lightly.uniformRowRatio: (New in 2.7.0) Uniform row ratio. A row is considered uniform if its pixel color values differ only marginally. More precisely, we apply a Laplacian 3x3 kernel on a resized grayscale image and consider a row uniform if at least 97% of its pixels are below a threshold. The metadata value is between 0 and 1. Higher values typically indicate undesired artifacts from image decoding.
  • lightly.luminance: (New in 2.7.4) Luminance. Computed from the mean color value as perceived lightness L* in the CIELAB color space. The value ranges from 0 to 100.

An example configuration:

"input": {
    "type": "METADATA",
    "key": "lightly.sharpness"
}

📘

Types of Metadata

Not all metadata types can be used in all selection strategies. Lightly differentiates between numerical and categorical metadata.

Numerical metadata refers to numbers (int, float), such as lightly.sharpness or weather.temperature.

Categorical metadata refers to discrete categories, for example: video.location_id or weather.description. It can be either an integer or a string.

Categorical boolean metadata cannot be used for selection a the moment.

Random

This selection input generates random samples, optionally setting a seed. See here how you can combine it with weights to perform a random selection.

"input": {
	"type": "RANDOM", 
	"random_seed": 42	# Optional
}

Selection Strategy

There are several types of selection strategies, all trying to reach different objectives:

Diversity

Use this strategy to select samples such that they are as different as possible from each other.

Can be used with Embeddings. Samples with a high distance between their embeddings are considered to be more different from each other than samples with a low distance. The strategy is specified like this:

"strategy": {
    "type": "DIVERSITY"
}

If you want to preserve a minimum distance between chosen samples, you can specify it as an additional stopping condition. The selection process will stop as soon as one of the stopping criteria has been reached.

"strategy": {
    "type": "DIVERSITY",
    "stopping_condition_minimum_distance": 0.2
}

Setting "stopping_condition_minimum_distance": 0.2 will remove all samples which are closer to each other than 0.2. This allows you to specify the minimum allowed distance between two images in the output dataset. If you use embeddings as input, this value should be between 0 and 2.0, as the embeddings are normalised to unit length. This is often a convenient method when working with different data sources and trying to combine them in a balanced way. If you want to use this stopping condition to stop the selection early, make sure that you allow selecting enough samples by setting n_samples or proportion_samples high enough in the selection configuration.

📘

Higher minimum distance in the embedding space results in more diverse images being selected. Increasing the minimum distance will result in fewer samples being selected.

Weights

The objective of this strategy is to select samples that have a high numerical value.

Can be used with Scores, numerical Metadata and Random inputs. It can be specified with:

"strategy": {
    "type": "WEIGHTS"
}

Threshold

The objective of this strategy is to only select samples that have a numerical value fulfilling a threshold criterion. E.g. they should be bigger or smaller than a certain value.

Can be used with Scores and numerical Metadata inputs. It is specified as follows:

"strategy": {
    "type": "THRESHOLD",
    "threshold": 20,
    "operation": "BIGGER_EQUAL"
}

This will keep all samples whose value (specified by the input) is >= 20 and remove all others. The allowed operations are SMALLER, SMALLER_EQUAL, BIGGER, BIGGER_EQUAL.

Balance

The objective of this strategy is to select samples such that the distribution of classes in them is as close to a target distribution as possible.

E.g. the samples chosen should have 50% sunny and 50% rainy weather. Or, the objects of the samples chosen should be 40% ambulance and 60% buses.

Can be used with Predictions and categorical string Metadata. Categorical int and categorical boolean metadata cannot be used for selection a the moment.

"strategy": {
    "type": "BALANCE",
    "target": {
        "Ambulance": 0.4, # `Ambulance` should be a valid class in your `schema.json`
        "Bus": 0.6
    }
}

If the values of the target do not sum up to 1, the remainder is assumed to be the target for the other classes. For example, if we would set the target to 20% ambulance and 40% bus, there is the implicit assumption, that the remaining 40% should come from any other class, e.g. from cars, bicycles or pedestrians.

Note that not specified classes do not influence the selection process!

Similarity

With this strategy you can use the input embeddings from another dataset to select similar images. This can be useful if you are looking for more examples of certain edge cases.

Can be used with Embeddings.

"strategy": {
    "type": "SIMILARITY",
}

Strategy Strength

Each strategy has a strength configuration parameter, which sets its relative strength compared to the other strategies. The default is 1.0. Negative strengths are allowed and invert the strategy. See below for example use cases.

This config option is available since worker version 2.8.2.

Add a tiny bit of randomness as a "tie-breaker" if multiple samples have the same objective score:

// Select images with the highest number of objects.
{
  "input": {
  "type": "SCORES",
    "task": "my_object_detection_task",
    "score": "object_frequency"
  },
  "strategy": {
  	"type": "WEIGHTS"
  }
},
// Add a bit of random noise to select randomly if multiple
// images have the same number of objects.
{
  "input": {
    "type": "RANDOM",
  },
  "strategy": {
    "type": "WEIGHTS",
    "strength": 0.01,
  }
},

Enforce balancing across video metadata:

// Select the same number of frames from every video by setting a high strength.
{ 
  "input": {
    "type": "METADATA",
    "key": "video_name",
  },
  "strategy": {
    "type": "BALANCE",
    "target": {
      video_name: 1/len(videos) for video_name in videos
    },
    "strength": float(1e9),
  }
},
// Within the same video, select the most diverse frames by setting a low strength
// to give this strategy less importance than the balancing strategy.
{
  "input": {
    "type": "EMBEDDINGS",
  },
  "strategy": {
    "type": "DIVERSITY",
    "strength": 1.0,
  }
}

Prefer dark images by selecting samples with low luminance:

{
  "input": {
    "type": "METADATA",
    "key": "lightly.luminance",
  },
  "strategy": {
    "type": "WEIGHTS",
    "strength": -1.0,
  }
}

For numerical reasons, there are two restrictions for the strength parameter:

  • It must be in [-1e9, 1e9].
  • The ratio between the highest and lowest strength must be smaller than 1e10. E.g. if you have a strategy with a strength of 1e9, the other strategies should have a strength whose absolute value is at least 0.1.

Configuration Examples

Here are examples of the full configuration including the input for several objectives:

Visual Diversity

Choosing 100 samples that are visually diverse equals diversifying samples based on their embeddings:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "EMBEDDINGS"
            },
            "strategy": {
                "type": "DIVERSITY"
            }
        }
    ]
}

Active Learning

Active Learning equals weighting samples based on active learning scores:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "SCORES",
                "task": "my_object_detection_task", # change to your task
                "score": "uncertainty_entropy" # change to your preferred score
            },
            "strategy": {
                "type": "WEIGHTS"
            }
        }
    ]
}

📘

This works as well for Image Classification or Segmentation! Just change the input task to a classification or segmentation task.

Visual Diversity and Active Learning

For combining two strategies, just specify both of them:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "EMBEDDINGS"
            },
            "strategy": {
                "type": "DIVERSITY"
            }
        },
        {
            "input": {
                "type": "SCORES",
                "task": "my_object_detection_task", # change to your task
                "score": "uncertainty_entropy" # change to your preferred score
            },
            "strategy": {
                "type": "WEIGHTS"
            }
        }
    ]
}

Metadata Thresholding

This can be used to remove e.g. blurry images, which equals selecting samples whose sharpness is above a threshold:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "METADATA",
                "key": "lightly.sharpness"
            },
            "strategy": {
                "type": "THRESHOLD",
                "threshold": 20,
                "operation": "BIGGER"
            }
        }
    ]
}

Another use case is to remove images with many uniform rows which can filter out images with decoding artifacts. The following configuration keeps images with less than 2.5% of uniform rows.

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "METADATA",
                "key": "lightly.uniformRowRatio"
            },
            "strategy": {
                "type": "THRESHOLD",
                "threshold": 0.025,
                "operation": "SMALLER"
            }
        }
    ]
}

Object Balancing

Use lightly pretagging to get the objects, then specify a target distribution of classes:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "PREDICTIONS",
                "task": "lightly_pretagging", # (optional) change to your task
                "name": "CLASS_DISTRIBUTION"
            },
            "strategy": {
                "type": "BALANCE",
                "target": {
                    "car": 0.1,
                    "bicycle": 0.5,
                    "bus": 0.1,
                    "motorcycle": 0.1,
                    "person": 0.1,
                    "train": 0.05,
                    "truck": 0.05
                }
            }
        }
    ]
}

📘

To use the lightly_pretagging task you need to enable it by setting pretagging to True in the worker config. See Lightly Pretagging for details.

Metadata Balancing

Let's assume you have specified metadata with the path weather.description and want your selected subset to have 20% sunny, 40% cloudy and the rest other images:

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "METADATA",
                "key": "weather.description"
            },
            "strategy": {
                "type": "BALANCE",
                "target": {
                    "sunny": 0.2,
                    "cloudy": 0.4
                }
            }
        }
    ]
}

Similarity Search

To perform similarity search you need a dataset and tag consisting of the query images.

We can then use the following configuration to find similar images from the input dataset. This example will select 100 images from the input dataset that are the most similar to the images in the tag from the query dataset.

{
    "n_samples": 100, # put your number here
    "strategies": [
        {
            "input": {
                "type": "EMBEDDINGS",
                "dataset_id": "DATASET_ID_OF_THE_QUERY_IMAGES", 
                "tag_name": "TAG_NAME_OF_THE_QUERY_IMAGES" # e.g. "initial-tag"
            },
            "strategy": {
                "type": "SIMILARITY",
            }
        }
    ]
}

Object Diversity

To select images with diverse objects on them you can use a diversity strategy with object embeddings. With this setup, after selection the objects can be inspected in Lightly Platform.

{
    "n_samples": 100, # put your number here
    "strategies": [
        {
            "input": {
                "type": "EMBEDDINGS",
                "task": "my_object_detections", # or "lightly_pretagging"
            },
            "strategy": {
                "type": "DIVERSITY",
            }
        }
    ]
}

Random Selection

You can combine a random input with the strategy weights. As the only strategy, this chooses random samples and can be used e.g., for benchmarking. Combining it with other strategies can soften their decision boundary and lead to more inliers / common cases being chosen.

{
    "n_samples": 100, # put your number here
    "strategies": [
        {
            "input": {
                "type": "RANDOM",
                "random_seed": 42, # optional, for reproducibility
            },
            "strategy": {
                "type": "WEIGHTS",
            }
        }
    ]
}

Application of Strategies

Generally, the order in which the different strategies were defined in the config does not matter. In a first step, all the thresholding strategies are applied. In the next step, all other strategies are applied in parallel.

📘

Different tasks can also be combined. E.g. you can use predictions from "my_weather_classification_task" for one strategy combined with predictions from "my_object_detection_task" from another strategy.

The Lightly optimizer tries to fulfill all strategies as well as possible.

Potential reasons why your objectives were not satisfied:

  • Tradeoff between different objectives. The optimizer always has to tradeoff between different objectives. E.g. it may happen that all samples with high WEIGHTS are close together. If you also specified the objective DIVERSITY, then only a few of these high-weight samples may be chosen. Instead, also other samples that are more diverse, but have lower weights, are chosen. You can control the relative importance of the objectives with the strength parameter of each strategy.
  • Restrictions in the input dataset. This applies especially for BALANCE: For example, if there are only 10 images of ambulances in the input dataset and a total of 1000 images are selected, the output can only have a maximum of 1% ambulances. Thus a BALANCE target of having 20% ambulances cannot be fulfilled.
  • Too few samples to choose from. If the selection algorithm can only choose a small number of samples, it may not be possible to fulfill the objectives. You can solve this by increasing n_samples or proportion_samples.