Cloud Storage
Datasources
Datasources are Lightly's way of accessing data in your cloud storage. They are always associated with a dataset and need to be configured with credentials from your cloud provider. Currently, Lightly integrates with the following cloud providers:
- Guide to setup AWS S3 for Lightly
- Guide to setup Azure for Lightly
- Guide to setup Google Cloud Storage for Lightly
To create a datasource, you must specify a dataset, the credentials, and a resource_path
. The resource_path
must point to an existing directory within your storage bucket. This directory must exist but can be empty.
Lightly requires you to configure an Input and a Lightly datasource. They are explained in detail below.
Dataset
As shown in Set Up Your First Dataset you can easily set up a dataset from Python. The dataset stores the results from your Lightly Worker runs and provides access to the selected images.
You can choose the input type for your dataset (image or video):
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasetType
# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
# Create a new dataset on the Lightly Platform.
client.create_dataset(dataset_name="dataset-name", dataset_type=DatasetType.IMAGES)
dataset_id = client.dataset_id
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasetType
# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
# Create a new dataset on the Lightly Platform.
client.create_dataset(dataset_name="dataset-name", dataset_type=DatasetType.VIDEOS)
dataset_id = client.dataset_id
from lightly.api import ApiWorkflowClient
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN", dataset_id="MY_DATASET_ID")
Supported File Types
Lightly supports different file types within your cloud storage:
Images
png
jpg
/jpeg
bmp
gif
tiff
Videos
mov
mp4
avi
See Video as Input for a detailed list of supported video containers and codecs.
Input Datasource
The Input datasource is where Lightly reads your raw input data from. Lightly requires list and read access to it. Please refer to the documentation of the cloud storage provider you use for the specific permissions needed.
You can configure your Input datasource from Python as follows:
from lightly.openapi_generated.swagger_client import DatasourcePurpose
# Configure the Input datasource.
client.set_s3_delegated_access_config(
resource_path="s3://bucket/input/",
region="eu-central-1",
role_arn="S3-ROLE-ARN",
external_id="S3-EXTERNAL-ID",
purpose=DatasourcePurpose.INPUT,
)
from lightly.openapi_generated.swagger_client import DatasourcePurpose
# Configure the Input datasource.
client.set_s3_config(
resource_path="s3://bucket/input/",
region="eu-central-1",
access_key="S3-ACCESS-KEY",
secret_access_key="S3-SECRET-ACCESS-KEY",
purpose=DatasourcePurpose.INPUT,
)
from lightly.openapi_generated.swagger_client import DatasourcePurpose
# Configure the Input datasource.
client.set_azure_config(
container_name="my-container/input/",
account_name="ACCOUNT-NAME",
sas_token="SAS-TOKEN",
purpose=DatasourcePurpose.INPUT,
)
import json
from lightly.openapi_generated.swagger_client import DatasourcePurpose
# Configure the Input datasource.
client.set_gcs_config(
resource_path="gs://bucket/input/",
project_id="PROJECT-ID",
credentials=json.dumps(json.load(open("credentials_read.json"))),
purpose=DatasourcePurpose.INPUT,
)
from lightly.openapi_generated.swagger_client import DatasourcePurpose
# Configure the Input datasource.
client.set_obs_config(
resource_path="obs://bucket/input/",
obs_endpoint="https://obs-endpoint-of-your-cloud-provider.com",
obs_access_key_id="OBS-ACCESS-KEY",
obs_secret_access_key="OBS-SECRET-ACCESS-KEY",
purpose=DatasourcePurpose.INPUT,
)
Input Structure
Lightly is agnostic to nested directories, so there are no requirements on the input data structure within the input datasource. However, Lightly can only access data in the path of the input datasource, so make sure all the data you want to process is in the right place.
Datatypes
Lightly currently works on images and videos. You can specify which type of input data you want to process when creating a dataset.
Lightly Datasource
The Lightly bucket serves as an interactive bucket where Lightly can read things from but also write output data to. Lightly, therefore, requires list, read, and write access to the Lightly bucket. Please refer to the documentation of the cloud storage provider you are using for the specific permissions needed. You can have separate credentials or use the same as for the Input bucket. The Lightly bucket can point to a different directory in the same or another bucket (even located at a different cloud storage provider).
Here is an overview of what the Lightly bucket is used for:
- Saving thumbnails of images for a more responsive experience in the Lightly Platform.
- Saving frames of videos if your input consists of videos.
- Providing the relevant filenames file if you want to run the Lightly Worker only on a subset of input files.
- Providing predictions for the selection process. See also Prediction Format.
- Providing metadata as additional information for the selection process. See also Metadata Format.
You can configure your Lightly datasource from Python as follows:
from lightly.openapi_generated.swagger_client import DatasourcePurpose
# Configure the Lightly datasource.
client.set_s3_delegated_access_config(
resource_path="s3://bucket/lightly/",
region="eu-central-1",
role_arn="S3-ROLE-ARN",
external_id="S3-EXTERNAL-ID",
purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.openapi_generated.swagger_client import DatasourcePurpose
# Configure the Lightly datasource.
client.set_s3_config(
resource_path="s3://bucket/lightly/",
region="eu-central-1",
access_key="S3-ACCESS-KEY",
secret_access_key="S3-SECRET-ACCESS-KEY",
purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.openapi_generated.swagger_client import DatasourcePurpose
# Configure the Lightly datasource.
client.set_azure_config(
container_name="my-container/lightly/",
account_name="ACCOUNT-NAME",
sas_token="SAS-TOKEN",
purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.openapi_generated.swagger_client import DatasourcePurpose
# Configure the Lightly datasource.
client.set_gcs_config(
resource_path="gs://bucket/lightly/",
project_id="PROJECT-ID",
credentials=json.dumps(json.load(open("credentials_write.json"))),
purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.openapi_generated.swagger_client import DatasourcePurpose
# Configure the Lightly datasource.
client.set_obs_config(
resource_path="obs://bucket/lightly/",
obs_endpoint="https://obs-endpoint-of-your-cloud-provider.com",
obs_access_key_id="OBS-ACCESS-KEY",
obs_secret_access_key="OBS-SECRET-ACCESS-KEY",
purpose=DatasourcePurpose.LIGHTLY,
)
Lightly Path Structure
Most files created by a Lightly Worker run are created in the Lightly bucket are placed under .lightly
, e.g. s3://bucket/lightly/.lightly
.
If you have a retention policy configured for the Lightly bucket, all files might be deleted and will thus not be available anymore for visualisation in the Lightly Platform, for further analysis or to be used for labeling.
Image Artifacts
- Thumbnails, saved under
.lightly/thumbnails
are used by the Lightly Platform to speedup the loading times. - Frames, saved under
.lightly/frames
are created when running the Lightly Worker on videos. These are important to send to labeling to improve your model. - Crops, saved under
.lightly/crops
are created when running the Lightly Worker with object diversity. These are important to send to labeling to improve your model.
Run Artifacts
Other artifacts (report.pdf and checkpoint.ckpt) specific to a particular run and containing potentially sensitive information are placed under .lightly/runs/<run_id>
, where the run_id is a string like "64a29e5f5836c16dac7dc6e3". E.g.
s3://bucket/lightly/.lightly/runs/<run_id>/report.pdf
s3://bucket/lightly/.lightly/runs/<run_id>/checkpoint.ckpt
Further Artifacts
When conducting a run with Lightly Worker there are also other artifacts created, but saved on the Lightly Platform. This is done so we are able to provide support and to show enhanced information about your run in the UI. None of these files contain any sensitive information besides the filename.
These files include:
- The
log.txt
andmemlog.txt
- The
report.json
- The
embeddings.csv
- Internal and debugging relevant information such as
sequence_information.json
, andcorruptness_check_information.json
Verify Datasource Permissions
Once you have set up the datasources it is crucial to ensure that Lightly has the proper permissions. This can be done via code or in the Lightly Platform. The Lightly Worker will also make this check when scheduling a run.
permissions = client.list_datasource_permissions()
# Check Lightly access permissions
try:
assert permissions.can_list
assert permissions.can_read
assert permissions.can_write
assert permissions.can_overwrite
except AssertionError:
print("Datasources are missing permissions. Potential errors are:", permissions.errors)
Verify Datasource Permissions
To use the above code snippet to verify the datasource permissions you must use version
1.2.42
or newer of the Python package. You can upgrade to the latest version usingpip install lightly --upgrade
.
Update Datasource Credentials
Certain types of credentials expire after a time or are invalidated at regular intervals. For these scenarios it can be useful to update your datasource credentials. You can use the same commands as above:
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.set_dataset_id_by_name("MY_DATASET_NAME")
# Configure the Input datasource.
client.set_s3_delegated_access_config(
resource_path="s3://bucket/input/",
region="eu-central-1",
role_arn="S3-ROLE-ARN",
external_id="S3-EXTERNAL-ID",
purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_s3_delegated_access_config(
resource_path="s3://bucket/lightly/",
region="eu-central-1",
role_arn="S3-ROLE-ARN",
external_id="S3-EXTERNAL-ID",
purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.set_dataset_id_by_name("MY_DATASET_NAME")
# Configure the Input datasource.
client.set_s3_config(
resource_path="s3://bucket/input/",
region="eu-central-1",
access_key="S3-ACCESS-KEY",
secret_access_key="S3-SECRET-ACCESS-KEY",
purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_s3_config(
resource_path="s3://bucket/lightly/",
region="eu-central-1",
access_key="S3-ACCESS-KEY",
secret_access_key="S3-SECRET-ACCESS-KEY",
purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.set_dataset_id_by_name("MY_DATASET_NAME")
# Configure the Input datasource.
client.set_azure_config(
container_name="my-container/input/",
account_name="ACCOUNT-NAME",
sas_token="SAS-TOKEN",
purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_azure_config(
container_name="my-container/lightly/",
account_name="ACCOUNT-NAME",
sas_token="SAS-TOKEN",
purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.set_dataset_id_by_name("MY_DATASET_NAME")
# Configure the Input datasource.
client.set_gcs_config(
resource_path="gs://bucket/input/",
project_id="PROJECT-ID",
credentials=json.dumps(json.load(open("credentials_read.json"))),
purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_gcs_config(
resource_path="gs://bucket/lightly/",
project_id="PROJECT-ID",
credentials=json.dumps(json.load(open("credentials_write.json"))),
purpose=DatasourcePurpose.LIGHTLY,
)
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasourcePurpose
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")
client.set_dataset_id_by_name("MY_DATASET_NAME")
# Configure the Input datasource.
client.set_obs_config(
resource_path="obs://bucket/input/",
obs_endpoint="https://obs-endpoint-of-your-cloud-provider.com",
obs_access_key_id="OBS-ACCESS-KEY",
obs_secret_access_key="OBS-SECRET-ACCESS-KEY",
purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_obs_config(
resource_path="obs://bucket/lightly/",
obs_endpoint="https://obs-endpoint-of-your-cloud-provider.com",
obs_access_key_id="OBS-ACCESS-KEY",
obs_secret_access_key="OBS-SECRET-ACCESS-KEY",
purpose=DatasourcePurpose.LIGHTLY,
)
Updated about 2 months ago