Cloud Storage

Datasources

Datasources are LightlyOne's way of accessing data in your cloud storage. They are always associated with a dataset and need to be configured with credentials from your cloud provider. Currently, LightlyOne integrates with the following cloud providers:

To create a datasource, you must specify a dataset, the credentials, and a resource_path. The resource_path must point to an existing directory within your storage bucket. This directory must exist but can be empty.

LightlyOne requires you to configure an Input and a Lightly datasource. They are explained in detail below.

Dataset

As shown in Set Up Your First Dataset you can easily set up a dataset from Python. The dataset stores the results from your LightlyOne Worker runs and provides access to the selected images.

You can choose the input type for your dataset (image or video):

from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasetType

# Create the LightlyOne client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")

# Create a new dataset on the LightlyOne Platform.
client.create_dataset(dataset_name="dataset-name", dataset_type=DatasetType.IMAGES)
dataset_id = client.dataset_id
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasetType

# Create the LightlyOne client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")

# Create a new dataset on the LightlyOne Platform.
client.create_dataset(dataset_name="dataset-name", dataset_type=DatasetType.VIDEOS)
dataset_id = client.dataset_id
from lightly.api import ApiWorkflowClient

client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN", dataset_id="MY_DATASET_ID")

Supported File Types

LightlyOne supports different file types within your cloud storage:

Images

  • png
  • jpg/ jpeg
  • bmp
  • gif
  • tiff

Videos

  • mov
  • mp4
  • avi

See Video as Input for a detailed list of supported video containers and codecs.

Input Datasource

The Input datasource is where LightlyOne reads your raw input data from. LightlyOne requires list and read access to it. Please refer to the documentation of the cloud storage provider you use for the specific permissions needed.

You can configure your Input datasource from Python as follows:

from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the Input datasource.
client.set_s3_delegated_access_config(
    resource_path="s3://bucket/input/",
    region="eu-central-1",
    role_arn="S3-ROLE-ARN",
    external_id="S3-EXTERNAL-ID",
    purpose=DatasourcePurpose.INPUT,
)
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the Input datasource.
client.set_s3_config(
    resource_path="s3://bucket/input/",
    region="eu-central-1",
    access_key="S3-ACCESS-KEY",
    secret_access_key="S3-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.INPUT,
)
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the Input datasource.
client.set_azure_config(
    container_name="my-container/input/",
    account_name="ACCOUNT-NAME",
    sas_token="SAS-TOKEN",
    purpose=DatasourcePurpose.INPUT,
)
import json
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the Input datasource.
client.set_gcs_config(
    resource_path="gs://bucket/input/",
    project_id="PROJECT-ID",
    credentials=json.dumps(json.load(open("credentials_read.json"))),
    purpose=DatasourcePurpose.INPUT,
)
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the Input datasource.
client.set_obs_config(
    resource_path="obs://bucket/input/",
    obs_endpoint="https://obs-endpoint-of-your-cloud-provider.com",
    obs_access_key_id="OBS-ACCESS-KEY",
    obs_secret_access_key="OBS-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.INPUT,
)

📘

Input Structure

LightlyOne is agnostic to nested directories, so there are no requirements on the input data structure within the input datasource. However, LightlyOne can only access data in the path of the input datasource, so make sure all the data you want to process is in the right place.

📘

Datatypes

LightlyOne currently works on images and videos. You can specify which type of input data you want to process when creating a dataset.

Lightly Datasource

The Lightly bucket serves as an interactive bucket where LightlyOne can read things from but also write output data to. LightlyOne, therefore, requires list, read, and write access to the Lightly bucket. Please refer to the documentation of the cloud storage provider you are using for the specific permissions needed. You can have separate credentials or use the same as for the Input bucket. The Lightly bucket can point to a different directory in the same or another bucket (even located at a different cloud storage provider).

Here is an overview of what the Lightly bucket is used for:

  • Saving thumbnails of images for a more responsive experience in the LightlyOne Platform.
  • Saving frames of videos if your input consists of videos.
  • Providing the relevant filenames file if you want to run the LightlyOne Worker only on a subset of input files.
  • Providing predictions for the selection process. See also Prediction Format.
  • Providing metadata as additional information for the selection process. See also Metadata Format.

You can configure your Lightly datasource from Python as follows:

from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the Lightly datasource.
client.set_s3_delegated_access_config(
    resource_path="s3://bucket/lightly/",
    region="eu-central-1",
    role_arn="S3-ROLE-ARN",
    external_id="S3-EXTERNAL-ID",
    purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the Lightly datasource.
client.set_s3_config(
    resource_path="s3://bucket/lightly/",
    region="eu-central-1",
    access_key="S3-ACCESS-KEY",
    secret_access_key="S3-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the Lightly datasource.
client.set_azure_config(
    container_name="my-container/lightly/",
    account_name="ACCOUNT-NAME",
    sas_token="SAS-TOKEN",
    purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the Lightly datasource.
client.set_gcs_config(
    resource_path="gs://bucket/lightly/",
    project_id="PROJECT-ID",
    credentials=json.dumps(json.load(open("credentials_write.json"))),
    purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the Lightly datasource.
client.set_obs_config(
    resource_path="obs://bucket/lightly/",
    obs_endpoint="https://obs-endpoint-of-your-cloud-provider.com",
    obs_access_key_id="OBS-ACCESS-KEY",
    obs_secret_access_key="OBS-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.LIGHTLY,
)

Lightly Path Structure

Most files created by a LightlyOne Worker run are created in the Lightly bucket and are placed under .lightly, e.g. s3://bucket/lightly/.lightly.

Image Artifacts

  • Thumbnails, saved under .lightly/thumbnails are used by the LightlyOne Platform to speed up the loading times.
  • Frames, saved under .lightly/frames are created when running the LightlyOne Worker on videos. These are important to send to labeling to improve your model.
  • Crops, saved under .lightly/crops are created when running the LightlyOne Worker with object diversity. These are important to send to labeling to improve your model.

Run Artifacts

Other artifacts (report.pdf and checkpoint.ckpt) specific to a particular run and containing potentially sensitive information are placed under .lightly/runs/<run_id>, where the run_id is a string like "64a29e5f5836c16dac7dc6e3". E.g.

  • s3://bucket/lightly/.lightly/runs/<run_id>/report.pdf
  • s3://bucket/lightly/.lightly/runs/<run_id>/checkpoint.ckpt

Further Artifacts

When conducting a run with LightlyOne Worker there are also other artifacts created, but saved on the LightlyOne Platform. This is done so we are able to provide support and show enhanced information about your run in the UI. None of these files contain any sensitive information besides the filename.
These files include:

  • The log.txt and memlog.txt
  • The report.json
  • The embeddings.csv
  • Internal and debugging relevant information such as sequence_information.json, and corruptness_check_information.json

Retention Policy

LightlyOne also works on buckets with a retention or lifecycle policy defined for their objects. As the LightlyOne Worker loads data directly from the bucket, objects in the bucket must persist over the whole run duration. We recommend the following retention policies:

  • Input bucket: At least 30 days.
  • Lightly bucket: Unlimited.

There are some limitations if images that have been selected by LightlyOne are deleted from the Input bucket:

  • The images are no longer visible on the LightlyOne Platform.
  • The images can no longer be exported for labeling.
  • It is no longer possible to train a model on datasets with the missing images. We recommend disabling the retention policy of the Input bucket for use cases where a new model is trained on a dataset in every LightlyOne Worker run.

LightlyOne stores all outputs from a LightlyOne Worker run in the Lightly bucket. Because these outputs can be accessed in future runs, for example, to load a model checkpoint from a previous run, we recommend disabling the retention policy of the Lightly bucket.

 Handling Expiring Inputs

📘

Handling expiring inputs requires LightlyOne Worker v2.11+ and is currently only supported for AWS S3 buckets.

For buckets with retention policies, an image or video may be removed from the Input bucket while the LightlyOne Worker is processing the data. To avoid any issues with data being removed during a run, the LightlyOne Worker has to be configured to check the expiration date of the processed data. This is done using the datasource.input_expiration options in the worker config:

client.schedule_compute_worker_run(
    worker_config={
        "datasource": {
            "input_expiration": {
                "min_days_to_expiration": 5, 
                "handling_strategy": "SKIP"  # can be "SKIP" or "ABORT"
            }
        }
    },
    selection_config={
        "n_samples": 50,
        "strategies": [
            {"input": {"type": "EMBEDDINGS"}, "strategy": {"type": "DIVERSITY"}}
        ],
    },
)

The min_days_to_expiration option defines how many days an input image or video must continue existing in the Input bucket after the LightlyOne Worker run starts. It should be set to a higher value than the expected duration of the LightlyOne Worker run. For quick runs that take a couple of hours, the value should be set to 1. For longer runs, the value should be the expected number of days the run takes + 1. For example, if a run is expected to take 3 days, min_days_to_expiration should be at least 4.

Input images or videos that expire before the min_days_to_expiration period are handled according to the handling_strategy option. If the strategy is set to "SKIP", the expiring inputs are skipped and not processed by the LightlyOne Worker. If the strategy is set to "ABORT", the LightlyOne Worker run is stopped if any expiring input is detected.

Verify Datasource Permissions

Once you have set up the datasources it is crucial to ensure that LightlyOne has the proper permissions. This can be done via code or in the LightlyOne Platform. The LightlyOne Worker will also make this check when scheduling a run.

import json
permissions = client.list_datasource_permissions()

# Verify your datasources credentials
try:
    assert permissions["can_list"]
    assert permissions["can_read"]
    assert permissions["can_write"]
    assert permissions["can_overwrite"]
except AssertionError as e:
    # Show permission related errors.
    print("Datasources are missing permissions. Potential errors are:")
    print(json.dumps(permissions["errors"], indent=4).encode().decode('unicode_escape'))

🚧

Verify Datasource Permissions

To use the above code snippet to verify the datasource permissions you must use version 1.2.42 or newer of the Python package. You can upgrade to the latest version using pip install lightly --upgrade.

Update Datasource Credentials

Certain types of credentials expire after a time or are invalidated at regular intervals. For these scenarios it can be useful to update your datasource credentials. You can use the same commands as above:

from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.set_dataset_id_by_name("MY_DATASET_NAME")

# Configure the Input datasource.
client.set_s3_delegated_access_config(
    resource_path="s3://bucket/input/",
    region="eu-central-1",
    role_arn="S3-ROLE-ARN",
    external_id="S3-EXTERNAL-ID",
    purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_s3_delegated_access_config(
    resource_path="s3://bucket/lightly/",
    region="eu-central-1",
    role_arn="S3-ROLE-ARN",
    external_id="S3-EXTERNAL-ID",
    purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.set_dataset_id_by_name("MY_DATASET_NAME")

# Configure the Input datasource.
client.set_s3_config(
    resource_path="s3://bucket/input/",
    region="eu-central-1",
    access_key="S3-ACCESS-KEY",
    secret_access_key="S3-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_s3_config(
    resource_path="s3://bucket/lightly/",
    region="eu-central-1",
    access_key="S3-ACCESS-KEY",
    secret_access_key="S3-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.set_dataset_id_by_name("MY_DATASET_NAME")

# Configure the Input datasource.
client.set_azure_config(
    container_name="my-container/input/",
    account_name="ACCOUNT-NAME",
    sas_token="SAS-TOKEN",
    purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_azure_config(
    container_name="my-container/lightly/",
    account_name="ACCOUNT-NAME",
    sas_token="SAS-TOKEN",
    purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.set_dataset_id_by_name("MY_DATASET_NAME")

# Configure the Input datasource.
client.set_gcs_config(
    resource_path="gs://bucket/input/",
    project_id="PROJECT-ID",
    credentials=json.dumps(json.load(open("credentials_read.json"))),
    purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_gcs_config(
    resource_path="gs://bucket/lightly/",
    project_id="PROJECT-ID",
    credentials=json.dumps(json.load(open("credentials_write.json"))),
    purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.set_dataset_id_by_name("MY_DATASET_NAME")


# Configure the Input datasource.
client.set_obs_config(
    resource_path="obs://bucket/input/",
    obs_endpoint="https://obs-endpoint-of-your-cloud-provider.com",
    obs_access_key_id="OBS-ACCESS-KEY",
    obs_secret_access_key="OBS-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_obs_config(
    resource_path="obs://bucket/lightly/",
    obs_endpoint="https://obs-endpoint-of-your-cloud-provider.com",
    obs_access_key_id="OBS-ACCESS-KEY",
    obs_secret_access_key="OBS-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.LIGHTLY,
)