Lightly has been designed in a way that you can incrementally build up a dataset for your project. It automatically keeps track of the representations of previously selected samples and uses this information to pick new samples in order to maximize the quality of the final dataset. It also allows for combining two different datasets into one.

For example, imagine you have a couple of raw videos. After processing them with the Lightly Worker once, you end up with a dataset of diverse frames in the Lightly Platform. Then, you add four more videos to your storage bucket. The new raw data might include unseen samples which should be added to your dataset in the Lightly Platform. You can do this by simply running the Lightly Worker again. It will automatically find the new videos, extract the frames, and compare them to the images in the dataset on the Lightly Platform. The selection strategy will take the existing data in your dataset into account when selecting new data.

634634

Embedding view on Lightly Platform with samples from the first (gray) and second (green) iteration.

After the Lightly Worker run you can go to the embedding view of the Lightly Platform to see the newly added samples there in a new tag. You'll see that the new samples (in green) fill some gaps left by the images in the first iteration (in grey). However, there are still some gaps left, which could be filled by adding more videos to the bucket and running the Lightly Worker again.

This workflow of iteratively growing your dataset with the Lightly Worker has the following advantages:

  • You can learn from your findings after each iteration to know which raw data you need to collect next.
  • Only your new data is processed, saving you time and compute cost.
  • You don’t need to configure anything, just run the same command again.
  • Only samples which are different to the existing ones are added to the dataset.

Example

This example covers how to:

  1. Schedule a run to process a storage bucket with three videos.
  2. Add two more videos to the same bucket.
  3. Run the Lightly Worker with the same config again to use the datapool feature.

This is the content of the storage bucket before running the Lightly Worker for the first time:

s3://bucket/input/videos/
├── campus4-c0.avi
├── passageway1-c1.avi
└── terrace1-c0.avi

Create a dataset which uses that bucket:

import json
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasetType, DatasourcePurpose

# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")

# Create a new dataset on the Lightly Platform.
client.create_dataset(
    dataset_name="pedestrian-videos-datapool",
    dataset_type=DatasetType.VIDEOS,
)

# Configure the Input datasource.
client.set_s3_delegated_access_config(
    resource_path="s3://bucket/input/videos/",
    region="eu-central-1",
    role_arn="S3-ROLE-ARN",
    external_id="S3-EXTERNAL-ID",
    purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_s3_delegated_access_config(
    resource_path="s3://bucket/lightly/",
    region="eu-central-1",
    role_arn="S3-ROLE-ARN",
    external_id="S3-EXTERNAL-ID",
    purpose=DatasourcePurpose.LIGHTLY,
)
import json
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasetType, DatasourcePurpose

# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")

# Create a new dataset on the Lightly Platform.
client.create_dataset(
    dataset_name="pedestrian-videos-datapool",
    dataset_type=DatasetType.VIDEOS,
)

# Configure the Input datasource.
client.set_s3_config(
    resource_path="s3://bucket/input/videos/",
    region="eu-central-1",
    access_key="S3-ACCESS-KEY",
    secret_access_key="S3-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_s3_config(
    resource_path="s3://bucket/lightly/",
    region="eu-central-1",
    access_key="S3-ACCESS-KEY",
    secret_access_key="S3-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.LIGHTLY,
)
import json
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasetType, DatasourcePurpose

# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")

# Create a new dataset on the Lightly Platform.
client.create_dataset(
    dataset_name="pedestrian-videos-datapool",
    dataset_type=DatasetType.VIDEOS,
)

# Configure the Input datasource.
client.set_gcs_config(
    resource_path="gs://bucket/input/videos/",
    project_id="PROJECT-ID",
    credentials=json.dumps(json.load(open("credentials_read.json"))),
    purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_gcs_config(
    resource_path="gs://bucket/lightly/",
    project_id="PROJECT-ID",
    credentials=json.dumps(json.load(open("credentials_write.json"))),
    purpose=DatasourcePurpose.LIGHTLY,
)
import json
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasetType, DatasourcePurpose

# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")

# Create a new dataset on the Lightly Platform.
client.create_dataset(
    dataset_name="pedestrian-videos-datapool",
    dataset_type=DatasetType.VIDEOS,
)

# Configure the Input datasource.
client.set_azure_config(
    container_name="my-container/input/videos/",
    account_name="ACCOUNT-NAME",
    sas_token="SAS-TOKEN",
    purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_azure_config(
    container_name="my-container/lightly/",
    account_name="ACCOUNT-NAME",
    sas_token="SAS-TOKEN",
    purpose=DatasourcePurpose.LIGHTLY,
)

First Run

Now, run the following code to select a subset based on the 'stopping_condition_minimum_distance': 0.1 stopping condition. In a first selection run, only select images with the specific minimum distance between each other based on the embeddings:

from lightly.api import ApiWorkflowClient

# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")

# Let's fetch the dataset we created above.
client.set_dataset_id_by_name(dataset_name="pedestrian-videos-datapool")

# Schedule the run.
client.schedule_compute_worker_run(
    worker_config={
        "enable_training": False,
    },
    selection_config={
        "n_samples": 100,
        "strategies": [
            {
                "input": {
                    "type": "EMBEDDINGS",
                },
                "strategy": {
                    "type": "DIVERSITY",
                    "stopping_condition_minimum_distance": 0.1,
                },
            }
        ],
    },
)

🚧

Training is currently not supported when using the datapool feature. Please make sure that enable_training is set to False in the worker_config.

After running the code, make sure you have a running Lightly Worker to process the run. If not, start the Lightly Worker using the following command:

docker run --shm-size="1024m" --gpus all --rm -it \
    -e LIGHTLY_TOKEN={MY_LIGHTLY_TOKEN} \
    lightly/worker:latest \
    worker.worker_id={MY_WORKER_ID}

Add More Files to Bucket

After processing the initial set of videos, now add more data to the bucket. It now looks like this:

s3://bucket/input/videos/
├── campus4-c0.avi
├── campus7-c0.avi
├── passageway1-c1.avi
├── terrace1-c0.avi
└── terrace1-c3.avi

Second Run

Run the same script again (it won’t create a new dataset but use the existing one based on the dataset name). Increase the stopping_condition_minimum_distance to 0.2 to increase the diversity of the selected frames:

from lightly.api import ApiWorkflowClient

# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")

# Let's fetch the dataset we created above.
client.set_dataset_id_by_name(dataset_name="pedestrian-videos-datapool")

# Schedule the run.
client.schedule_compute_worker_run(
    worker_config={
        "enable_training": False,
    },
    selection_config={
        "n_samples": 100,
        "strategies": [
            {
                "input": {
                    "type": "EMBEDDINGS",
                },
                "strategy": {
                    "type": "DIVERSITY",
                    "stopping_condition_minimum_distance": 0.2,
                },
            }
        ],
    },
)

The samples selected in the second run will be uploaded to the dataset under a new tag in the Lightly Platform. The selected samples from all runs will be available under the initial-tag of the dataset.

Reprocess All Data

If you want to search all data in your bucket for new samples instead of only newly added data, then set process_all to True in the worker_config. This is useful if you want to increase the dataset size but do not yet have any new data in your bucket or if your selection requirements changed and you updated your selection configuration. The process_all flag can be set as follows:

from lightly.api import ApiWorkflowClient

# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")

# Let's fetch the dataset we created above.
client.set_dataset_id_by_name(dataset_name="pedestrian-videos-datapool")

# Schedule the run.
client.schedule_compute_worker_run(
    worker_config={
        "datasource": {
            "process_all": True,
        },
        "enable_training": False,
    },
    selection_config={
        "n_samples": 100,
        "strategies": [
            {
                "input": {
                    "type": "EMBEDDINGS",
                },
                "strategy": {
                    "type": "DIVERSITY",
                    "stopping_condition_minimum_distance": 0.2,
                },
            }
        ],
    },
)