What is Lightly?

A machine learning model can only be as good as the data it is trained on. Figuring out what the best data is can be very time consuming and expensive. With Lightly, you can automate data curation processes and process millions of images or thousands of videos every day.


Lightly fits perfectly into your existing ML pipeline. Use it to select the most valuable data for labeling.

Automatically Select Data that Matters

With Lightly, you can do Active Learning at scale. You can use inputs such as embeddings, metadata and model predictions to select the most valuable subset you want to use for labeling and model training.

By combining the three inputs, you can build your active learning strategies to find, for example:

  • images that are potential outliers or out of distribution based on embeddings
  • balancing the selected data based on locations and weather conditions provided as metadata
  • crowded scenes where the model predictions have low confidence

Lightly supports using a combination of embeddings, metadata and predictions for efficient data selection.

How does Lightly Work?

The Lightly solution is an intelligent system designed to process raw, unlabeled image data, select the most informative samples for labeling, and mitigate dataset bias. Lightly scales to big datasets with millions of images or thousands of videos. Processing these large datasets requires a special architecture.

You can use Lightly in just four simple steps:

  1. Create an account on the Lightly Platform.
  2. Create a new data curation job using our Python SDK.
  3. Spin up the Lightly Worker Docker container to process the new job.
  4. Enjoy the curated dataset either using our easy to use API or on the Platform.

The Lightly Workflow

Following, you will see a brief overview of the architecture.


A datasource provides Lightly with access to the data. Currently, Lightly supports the following types of datasources:

  • AWS Simple Cloud Storage (S3)
  • Google Cloud Storage (GCS)
  • Azure blob storage
  • Local storage (Local drives, NFS, CIFS/SMB)

The solution will access data directly from your data source and stream it from there. Your data remains secure.

Lightly Worker

The Lightly Worker is a Docker container designed to process large datasets. You host it yourself on a machine of your choice. The Lightly Worker processes runs from a run queue and stores the outputs back to your cloud storage.

Lightly Platform

The Lightly Platform is used for the orchestration of workflows and analytics. It keeps track of the state of your dataset, allows sharing datasets with co-workers or labeling partners, and much more. To use Lightly, you need to create an account for the Lightly Platform.

Lightly Python Client

Use the Lightly Python client to send commands to the Lightly Platform and workers. You can schedule runs directly from your Python code. This allows complete control over the process, easy reproducibility, and automates your data selection pipeline.

Next Step

See how to setup Lightly on your machine in our getting started section or follow our quick start guide!

What’s Next