Set Up Your First Dataset

Datasource Access

To run the Lightly Worker, define an input and Lightly datasource. The input is used to provide data to the Lightly Worker whereas the Lightly datasource serves as a working directory where both inputs and outputs are stored. Lightly is compatible with the most common cloud storage providers (S3, GCS, Azure) and works with local storage as well.

Create a Dataset

A new dataset can be easily created from the Python client. From where you installed the Lightly Python client, run the following script to create one.

from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasetType

# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")

# Create a new dataset on the Lightly Platform.
client.create_dataset(
    dataset_name="dataset-name",
    dataset_type=DatasetType.IMAGES  # can be DatasetType.VIDEOS when working with videos
)
my_dataset_id = client.dataset_id
print(my_dataset_id)

After creating the dataset, you can configure the datasource:

from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the client to use the dataset ID created above.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.dataset_id = "MY_DATASET_ID"

# Configure the Input datasource.
client.set_s3_delegated_access_config(
    resource_path="s3://bucket/input/",
    region="eu-central-1",
    role_arn="S3-ROLE-ARN",
    external_id="S3-EXTERNAL-ID",
    purpose=DatasourcePurpose.INPUT
)
# Configure the Lightly datasource.
client.set_s3_delegated_access_config(
    resource_path="s3://bucket/lightly/",
    region="eu-central-1",
    role_arn="S3-ROLE-ARN",
    external_id="S3-EXTERNAL-ID",
    purpose=DatasourcePurpose.LIGHTLY
)
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the client to use the dataset ID created above.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.dataset_id = "MY_DATASET_ID"

# Configure the Input datasource.
client.set_s3_config(
    resource_path="s3://bucket/input/",
    region="eu-central-1",
    access_key="S3-ACCESS-KEY",
    secret_access_key="S3-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.INPUT
)
# Configure the Lightly datasource.
client.set_s3_config(
    resource_path="s3://bucket/lightly/",
    region="eu-central-1",
    access_key="S3-ACCESS-KEY",
    secret_access_key="S3-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.LIGHTLY
)
import json
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the client to use the dataset ID created above.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.dataset_id = "MY_DATASET_ID"

# Configure the Input datasource.
client.set_gcs_config(
    resource_path="gs://bucket/input/",
    project_id="PROJECT-ID",
    credentials=json.dumps(json.load(open("credentials_read.json"))),
    purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_gcs_config(
    resource_path="gs://bucket/lightly/",
    project_id="PROJECT-ID",
    credentials=json.dumps(json.load(open("credentials_write.json"))),
    purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the client to use the dataset ID created above.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.dataset_id = "MY_DATASET_ID"

# Configure the Input datasource.
client.set_azure_config(
    container_name="my-container/input/",
    account_name="ACCOUNT-NAME",
    sas_token="SAS-TOKEN",
    purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_azure_config(
    container_name="my-container/lightly/",
    account_name="ACCOUNT-NAME",
    sas_token="SAS-TOKEN",
    purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the client to use the dataset ID created above.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.dataset_id = "MY_DATASET_ID"

# Configure the Input datasource.
client.set_obs_config(
    resource_path="obs://bucket/input/",
    obs_endpoint="https://obs-endpoint-of-your-cloud-provider.com",
    obs_access_key_id="OBS-ACCESS-KEY",
    obs_secret_access_key="OBS-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_obs_config(
    resource_path="obs://bucket/lightly/",
    obs_endpoint="https://obs-endpoint-of-your-cloud-provider.com",
    obs_access_key_id="OBS-ACCESS-KEY",
    obs_secret_access_key="OBS-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the client to use the dataset ID created above.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.dataset_id = "MY_DATASET_ID"

client.set_local_config(
    relative_path="",  # Relative path in the input mount folder   
    purpose=DatasourcePurpose.INPUT,
)
client.set_local_config(
    relative_path="",  # Relative path in the lightly mount folder
    purpose=DatasourcePurpose.LIGHTLY,
)

🚧

The credentials passed above need to provide Lightly with list and read access to the input bucket and with list, read, and write access to the Lightly bucket.

Verify Datasource Access

Before scheduling a run, you can verify that Lightly has the required access to the datasource with the following lines:

permissions = client.list_datasource_permissions()

# Show permission related errors.
print(f"Datasource permission errors: {permissions.get('errors')}")

# Make sure Lightly can access the datasource.
assert permissions["can_list"]
assert permissions["can_read"]
assert permissions["can_write"]
assert permissions["can_overwrite"]

:warning: If you set up restrictive access policies for your bucket, the Lightly API might not have access to the datasource and can report missing permissions even with the correct credentials. In that case, permissions cannot be verified beforehand. Please make sure to set the datasource.bypass_verify configuration option when scheduling a run with restrictive access policies.

📘

Setting up your first dataset and running your first selection can also be done in one simple python script you can download here.