The Lightly Platform

The lightly framework itself allows you to use self-supervised learning in a very simple way and even create embeddings of your dataset. However, we can do much more than just train and embed datasets. Once you have an embedding of an unlabeled dataset you might still require some labels to train a model. But which samples do you pick for labeling and training a model?

This is exactly why we built the Lightly Data Curation Platform. The platform helps you analyze your dataset and using various methods pick the relevant samples for your task.

The video below gives you a quick tour through the platform:


Head to our tutorials to see the many use-cases of the Lightly Platform.

Basic Concepts

The Lightly Platform is built around datasets, tags, embeddings, samples and their metadata.

Learn more about the different concepts in our Glossary.

Create a Dataset from a local folder or cloud bucket

There are several different ways to create a dataset on the lightly platform.

The baseline way is to upload your local dataset including all images or videos to the Lightly platform.

If you don’t have your data locally, but rather stored at a cloud provider like AWS S3, Google Cloud Storage or Azure, you can create a dataset directly referencing the images in your bucket. It will keep all images and videos in your own bucket and only stream them from there if they are needed. This has the advantage that you don’t need to upload your data to Lightly and can preserve its privacy.

If you want to let Lighlty take care of the data handling and upload to our servers (European location).

For datasets stored in your cloud bucket:

There is a another option of using Lightly. In case you don’t want to upload any data to the cloud nor to Lightly but still use all the features we can stream the data from a local fileserver:

Custom Metadata

With the custom metadata option, you can upload any information about your images to the Lightly Platform and analyze it there. For example, in autonomous driving, companies are often interested in different weather scenarios or the number of pedestrians in an image. The Lightly Platform supports the upload of arbitrary custom metadata as long as it’s correctly formatted.


You can pass custom metadata when creating a dataset and later configure it for inspection in the web-app. Simply add the argument custom_metadata to the lightly-magic command.

lightly-magic trainer.max_epochs=0 token='YOUR_API_TOKEN' new_dataset_name='my-dataset' input_dir='/path/to/my/dataset' custom_metadata='my-custom-metadata.json'

As with images and embeddings before, it’s also possible to upload custom metadata from your Python code:

import json
from lightly.api.api_workflow_client import ApiWorkflowClient

client = ApiWorkflowClient(token='123', dataset_id='xyz')
with open('my-custom-metadata.json') as f:


To save the custom metadata in the correct format, use the helpers format_custom_metadata and save_custom_metadata or learn more about the custom metadata format below.


Check out Dataset Identifier to see how to get the dataset identifier.


To use the custom metadata on the Lightly Platform, it must be configured first. For this, follow these steps:

  1. Go to your dataset and click on “Configurator” on the left side.

  2. Click on “Add entry” to add a new configuration.

  3. Click on “Path”. Lightly should now propose different custom metadata keys.

  4. Pick the key you are interested in, set the data type, display name, and fallback value.

  5. Click on “Save changes” on the bottom.

Done! You can now use the custom metadata in the “Explore” and “Analyze & Filter” screens.

Custom metadata weather configuration

Example of a custom metadata configuration for the key weather.temperature.


To upload the custom metadata, you need to save it to a .json file in a COCO-like format. The following things are important:

  • Information about the images is stored under the key images.

  • Each image must have a file_name and an id.

  • Custom metadata must be accessed with the metadata key.

  • Each custom metadata entry must have an image_id to match it with the corresponding image.

For the example of an autonomous driving company mentioned above, the custom metadata file would need to look like this:

    "images": [
            "file_name": "image0.jpg",
            "id": 0,
            "file_name": "image1.jpg",
            "id": 1,
    "metadata": [
            "image_id": 0,
            "number_of_pedestrians": 3,
            "weather": {
                "scenario": "cloudy",
                "temperature": 20.3
            "image_id": 1,
            "number_of_pedestrians": 1,
            "weather": {
                "scenario": "rainy",
                "temperature": 15.0

If you don’t have your data in coco format yet, but e.g. as a pandas dataframe, you can use a simple script to translate it to the coco format:

import pandas as pd

from lightly.utils import save_custom_metadata

# Define the pandas dataframe
column_names = ["filename", "number_of_pedestrians", "scenario", "temperature"]
rows = [
    ["image0.jpg", 3, "cloudy", 20.3],
    ["image1.jpg", 1, "rainy", 15.0]
df = pd.DataFrame(rows, columns=column_names)

# create a list of pairs of (filename, metadata)
custom_metadata = []
for index, row in df.iterrows():
    filename = row.filename
    metadata = {
        "number_of_pedestrians": int(row.number_of_pedestrians),
        "weather": {
            "scenario": str(row.scenario),
            "temperature": float(row.temperature),
    custom_metadata.append((filename, metadata))

# save custom metadata in the correct json format
output_file = "custom_metadata.json"
save_custom_metadata(output_file, custom_metadata)


Make sure that the custom metadata is present for every image. The metadata must not necessarily include the same keys for all images but it is strongly recommended.


Lightly supports integers, floats, strings, booleans, and even nested objects for custom metadata. Every metadata item must be a valid JSON object. Thus numpy datatypes are not supported and must be cast to float or int before saving. Otherwise there will be an error similar to TypeError: Object of type ndarray is not JSON serializable.


Before you start selecting make sure you have created a dataset and uploaded images and embeddings. See Create a Dataset.

Now, let’s get started with selecting!

Follow these steps to select the most representative images from your dataset:

#. Choose the dataset you want to work on from the “My Datasets” section by clicking on it.

  1. Navigate to “Analyze & Filter”“Sampling” through the menu on the left.

  2. Choose the embedding and selection strategy to use for this selection.

  3. Give a name to your selection so that you can later compare the different selections.

  4. Hit “Process” to start selecting the data. Each sample is now assigned an “importance score”.

    Alt text

    You can create a selection once you uploaded a dataset and an embedding. Since selecting requires more compute resources it can take a while

  5. Move the slider to select the number of images you want to keep and save your selection by creating a new tag, for example like this:

    Alt text

    You can move the slider to change the number of selected samples.

Dataset Identifier

Every dataset has a unique identifier called ‘Dataset ID’. You find it on the dataset overview page.

Alt text

The Dataset ID is a unique identifier.

Authentication API Token

To authenticate yourself on the platform when using the pip package we provide you with an authentication token. You can retrieve it when creating a new dataset or when clicking on your account (top right)-> preferences on the web application.

Alt text

With the API token you can authenticate yourself.


Keep the token for yourself and don’t share it. Anyone with the token could access your datasets!