Datapool
Lightly has been designed in a way that you can incrementally build up a dataset for your project. It automatically keeps track of the representations of previously selected samples and uses this information to pick new samples in order to maximize the quality of the final dataset. It also allows for combining two different datasets into one.
For example, imagine you have a couple of raw videos. After processing them with the Lightly Worker once, you end up with a dataset of diverse frames in the Lightly Platform. Then, you add four more videos to your storage bucket. The new raw data might include unseen samples which should be added to your dataset in the Lightly Platform. You can do this by simply running the Lightly Worker again. It will automatically find the new videos, extract the frames, and compare them to the images in the dataset on the Lightly Platform. The selection strategy will take the existing data in your dataset into account when selecting new data.

Embedding view on Lightly Platform with samples from the first (gray) and second (green) iteration.
After the Lightly Worker run you can go to the embedding view of the Lightly Platform to see the newly added samples there in a new tag. You'll see that the new samples (in green) fill some gaps left by the images in the first iteration (in grey). However, there are still some gaps left, which could be filled by adding more videos to the bucket and running the Lightly Worker again.
This workflow of iteratively growing your dataset with the Lightly Worker has the following advantages:
- You can learn from your findings after each iteration to know which raw data you need to collect next.
- Only your new data is processed, saving you time and compute cost.
- You don’t need to configure anything, just run the same command again.
- Only samples which are different to the existing ones are added to the dataset.
Example
This example covers how to:
- Schedule a run to process a storage bucket with three videos.
- Add two more videos to the same bucket.
- Run the Lightly Worker with the same config again to use the datapool feature.
This is the content of the storage bucket before running the Lightly Worker for the first time:
s3://bucket/input/videos/
├── campus4-c0.avi
├── passageway1-c1.avi
└── terrace1-c0.avi
Create a dataset which uses that bucket:
import json
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasetType, DatasourcePurpose
# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
# Create a new dataset on the Lightly Platform.
client.create_dataset(
dataset_name="pedestrian-videos-datapool",
dataset_type=DatasetType.VIDEOS,
)
# Configure the Input datasource.
client.set_s3_delegated_access_config(
resource_path="s3://bucket/input/videos/",
region="eu-central-1",
role_arn="S3-ROLE-ARN",
external_id="S3-EXTERNAL-ID",
purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_s3_delegated_access_config(
resource_path="s3://bucket/lightly/",
region="eu-central-1",
role_arn="S3-ROLE-ARN",
external_id="S3-EXTERNAL-ID",
purpose=DatasourcePurpose.LIGHTLY,
)
import json
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasetType, DatasourcePurpose
# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
# Create a new dataset on the Lightly Platform.
client.create_dataset(
dataset_name="pedestrian-videos-datapool",
dataset_type=DatasetType.VIDEOS,
)
# Configure the Input datasource.
client.set_s3_config(
resource_path="s3://bucket/input/videos/",
region="eu-central-1",
access_key="S3-ACCESS-KEY",
secret_access_key="S3-SECRET-ACCESS-KEY",
purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_s3_config(
resource_path="s3://bucket/lightly/",
region="eu-central-1",
access_key="S3-ACCESS-KEY",
secret_access_key="S3-SECRET-ACCESS-KEY",
purpose=DatasourcePurpose.LIGHTLY,
)
import json
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasetType, DatasourcePurpose
# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
# Create a new dataset on the Lightly Platform.
client.create_dataset(
dataset_name="pedestrian-videos-datapool",
dataset_type=DatasetType.VIDEOS,
)
# Configure the Input datasource.
client.set_gcs_config(
resource_path="gs://bucket/input/videos/",
project_id="PROJECT-ID",
credentials=json.dumps(json.load(open("credentials_read.json"))),
purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_gcs_config(
resource_path="gs://bucket/lightly/",
project_id="PROJECT-ID",
credentials=json.dumps(json.load(open("credentials_write.json"))),
purpose=DatasourcePurpose.LIGHTLY,
)
import json
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasetType, DatasourcePurpose
# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
# Create a new dataset on the Lightly Platform.
client.create_dataset(
dataset_name="pedestrian-videos-datapool",
dataset_type=DatasetType.VIDEOS,
)
# Configure the Input datasource.
client.set_azure_config(
container_name="my-container/input/videos/",
account_name="ACCOUNT-NAME",
sas_token="SAS-TOKEN",
purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_azure_config(
container_name="my-container/lightly/",
account_name="ACCOUNT-NAME",
sas_token="SAS-TOKEN",
purpose=DatasourcePurpose.LIGHTLY,
)
First Run
Now, run the following code to select a subset based on the 'stopping_condition_minimum_distance': 0.1
stopping condition. In a first selection run, only select images with the specific minimum distance between each other based on the embeddings:
from lightly.api import ApiWorkflowClient
# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
# Let's fetch the dataset we created above.
client.set_dataset_id_by_name(dataset_name="pedestrian-videos-datapool")
# Schedule the run.
client.schedule_compute_worker_run(
worker_config={
"enable_training": False,
},
selection_config={
"n_samples": 100,
"strategies": [
{
"input": {
"type": "EMBEDDINGS",
},
"strategy": {
"type": "DIVERSITY",
"stopping_condition_minimum_distance": 0.1,
},
}
],
},
)
Training is currently not supported when using the datapool feature. Please make sure that
enable_training
is set toFalse
in theworker_config
.
After running the code, make sure you have a running Lightly Worker to process the run. If not, start the Lightly Worker using the following command:
docker run --shm-size="1024m" --gpus all --rm -it \
-e LIGHTLY_TOKEN={MY_LIGHTLY_TOKEN} \
-e LIGHTLY_WORKER_ID={MY_WORKER_ID} \
lightly/worker:latest \
Add More Files to Bucket
After processing the initial set of videos, now add more data to the bucket. It now looks like this:
s3://bucket/input/videos/
├── campus4-c0.avi
├── campus7-c0.avi
├── passageway1-c1.avi
├── terrace1-c0.avi
└── terrace1-c3.avi
Second Run
Run the same script again (it won’t create a new dataset but use the existing one based on the dataset name).
The process all flag
By default, Lightly will only process the images that were added to the bucket after the first run. For this, Lightly remembers the exact time a bucket was last processed. If you want to process all images in the bucket, set the process all flag to true. You can find more information under Process All Data below.
from lightly.api import ApiWorkflowClient
# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
# Let's fetch the dataset we created above.
client.set_dataset_id_by_name(dataset_name="pedestrian-videos-datapool")
# Schedule the run.
client.schedule_compute_worker_run(
worker_config={
"enable_training": False,
},
selection_config={
"n_samples": 100,
"strategies": [
{
"input": {
"type": "EMBEDDINGS",
},
"strategy": {
"type": "DIVERSITY",
"stopping_condition_minimum_distance": 0.1,
},
}
],
},
)
The samples selected in the second run will be uploaded to the dataset under a new tag in the Lightly Platform. The selected samples from all runs will be available under the initial-tag
of the dataset.
Process All Data
If you want to search all data in your bucket for new samples instead of only newly added data, then set process_all
to True
in the worker_config
. This is useful if you want to increase the dataset size but do not yet have any new data in your bucket or if your selection requirements changed and you updated your selection configuration. The process_all
flag can be set as follows:
from lightly.api import ApiWorkflowClient
# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
# Let's fetch the dataset we created above.
client.set_dataset_id_by_name(dataset_name="pedestrian-videos-datapool")
# Schedule the run.
client.schedule_compute_worker_run(
worker_config={
"datasource": {
"process_all": True,
},
"enable_training": False,
},
selection_config={
"n_samples": 100,
"strategies": [
{
"input": {
"type": "EMBEDDINGS",
},
"strategy": {
"type": "DIVERSITY",
"stopping_condition_minimum_distance": 0.2,
},
}
],
},
)
Updated about 1 month ago