Set Up Your First Dataset

Datasource Access

To run the Lightly Worker, define an input and Lightly datasource. The input is used to provide data to the Lightly Worker whereas the Lightly datasource serves as a working directory where both inputs and outputs are stored. Lightly is compatible with the most common cloud storage providers (S3, GCS, Azure) and works with local storage as well.

Create a Dataset

A new dataset can be easily created from the Python client. From where you installed the Lightly Python client, run the following script to create one.

from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasetType

# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")

# Create a new dataset on the Lightly Platform.
client.create_dataset(
    dataset_name="dataset-name",
    dataset_type=DatasetType.IMAGES  # can be DatasetType.VIDEOS when working with videos
)
dataset_id = client.dataset_id
print(f"dataset_id: {dataset_id}")

After creating the dataset, you can configure the datasource. Choose the tab according to your datasource.

from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the client to use the dataset ID created above.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.dataset_id = "MY_DATASET_ID"

# Configure the Input datasource.
client.set_s3_delegated_access_config(
    resource_path="s3://bucket/input/",
    region="eu-central-1",
    role_arn="S3-ROLE-ARN",
    external_id="S3-EXTERNAL-ID",
    purpose=DatasourcePurpose.INPUT
)
# Configure the Lightly datasource.
client.set_s3_delegated_access_config(
    resource_path="s3://bucket/lightly/",
    region="eu-central-1",
    role_arn="S3-ROLE-ARN",
    external_id="S3-EXTERNAL-ID",
    purpose=DatasourcePurpose.LIGHTLY
)
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the client to use the dataset ID created above.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.dataset_id = "MY_DATASET_ID"

# Configure the Input datasource.
client.set_s3_config(
    resource_path="s3://bucket/input/",
    region="eu-central-1",
    access_key="S3-ACCESS-KEY",
    secret_access_key="S3-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.INPUT
)
# Configure the Lightly datasource.
client.set_s3_config(
    resource_path="s3://bucket/lightly/",
    region="eu-central-1",
    access_key="S3-ACCESS-KEY",
    secret_access_key="S3-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.LIGHTLY
)
import json
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the client to use the dataset ID created above.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.dataset_id = "MY_DATASET_ID"

# Configure the Input datasource.
client.set_gcs_config(
    resource_path="gs://bucket/input/",
    project_id="PROJECT-ID",
    credentials=json.dumps(json.load(open("credentials_read.json"))),
    purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_gcs_config(
    resource_path="gs://bucket/lightly/",
    project_id="PROJECT-ID",
    credentials=json.dumps(json.load(open("credentials_write.json"))),
    purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the client to use the dataset ID created above.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.dataset_id = "MY_DATASET_ID"

# Configure the Input datasource.
client.set_azure_config(
    container_name="my-container/input/",
    account_name="ACCOUNT-NAME",
    sas_token="SAS-TOKEN",
    purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_azure_config(
    container_name="my-container/lightly/",
    account_name="ACCOUNT-NAME",
    sas_token="SAS-TOKEN",
    purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the client to use the dataset ID created above.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.dataset_id = "MY_DATASET_ID"

# Configure the Input datasource.
client.set_obs_config(
    resource_path="obs://bucket/input/",
    obs_endpoint="https://obs-endpoint-of-your-cloud-provider.com",
    obs_access_key_id="OBS-ACCESS-KEY",
    obs_secret_access_key="OBS-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_obs_config(
    resource_path="obs://bucket/lightly/",
    obs_endpoint="https://obs-endpoint-of-your-cloud-provider.com",
    obs_access_key_id="OBS-ACCESS-KEY",
    obs_secret_access_key="OBS-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the client to use the dataset ID created above.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.dataset_id = "MY_DATASET_ID"

# This config depends on the input_mount and lightly_mount folder.
# Make sure you mounted them when starting the Lightly Worker. 
# See here: https://docs.lightly.ai/docs/install-lightly#local-storage
client.set_local_config( 
    purpose=DatasourcePurpose.INPUT,
  	#relative_path="",  # Optional: relative path to the input_mount folder. 
)
client.set_local_config(
    purpose=DatasourcePurpose.LIGHTLY,
  	#relative_path="",  # Optional: relative path in the lightlty_mount folder.
)

🚧

The credentials passed above need to provide Lightly with list and read access to the input bucket and with list, read, and write access to the Lightly bucket.

Verify Datasource Access

Before scheduling a run, you can verify that Lightly has the required access to the datasource with the following lines:


import json
permissions = client.list_datasource_permissions()

# Verify your datasources credentials
try:
    assert permissions["can_list"]
    assert permissions["can_read"]
    assert permissions["can_write"]
    assert permissions["can_overwrite"]
except AssertionError as e:
    print("Datasources are missing permissions. Potential errors are:")
    print(json.dumps(permissions["errors"], indent=4).encode().decode('unicode_escape'))

:warning: If you set up restrictive access policies for your bucket, the Lightly API might not have access to the datasource and can report missing permissions even with the correct credentials. In that case, permissions cannot be verified beforehand. Please make sure to set the datasource.bypass_verify configuration option when scheduling a run with restrictive access policies.

📘

Setting up your first dataset and running your first selection can also be done in one simple python script you can download here.