Create a dataset from Google Cloud Storage

Lightly allows you to configure a remote datasource like Google Cloud Storage . In this guide, we will show you how to setup your Google Cloud Storage, configure your dataset to use said bucket, and only upload metadata to Lightly while keeping your data private.

Setting up Google Cloud Storage

Lightly needs to be able to create so-called presigned URLs/read URLs for displaying your data in your browser. Thus it needs at minimum get and list permissions on your bucket. To upload thumbnails, frames and crops, create and delete permissions are also required.

Let us assume the bucket is called lightly-datalake. And let us assume the folder you want to use with Lightly is located at projects/wild-animals/

Setting Up a Service Account in IAM

1. Write down your project ID. You find it in the gcloud console under Project Info.

2. Navigate to your bucket in the google cloud storage browser and from there to projects/wild-animals/. Copy the path, in this case lightly-datalake/projects/wild-animals.

Browsing a google cloud storage bucket.

3. Navigate to the tab Permissions. Make sure that your access control is uniform. If it is not, change it to uniform.

Ensuring a google cloud bucket has uniform access.
  1. Navigate to IAM & Admin -> Roles.

  • Create a new role, with the same title and ID. E.g. call it LIGHTLY_DATASET_ACCESS.

  • Click on “Add Permissions”, search for storage.objects

  • Add the permissions storage.objects.get, storage.objects.list, storage.objects.create,

    storage.objects.delete and storage.objects.update.

  • After adding the permissions, create the role.

Creating a role for accessing google cloud storage.
  1. Navigate to APIs -> Credentials.

  • Click on “Create Credentials”, choose Service Account and insert the name LIGHTLY_USER_WILD_ANIMALS.

  • The description can be service account for the Lightly API to access the wild animals dataset.

  • Click on “Create and Continue”.

  • Choose the Role you just created, i.e. LIGHTLY_DATASET_ACCESS.

  • Add a condition with the title BUCKET_PROJECTS_WILD_ANIMALS and insert the condition below in the Condition editor. Remember to change the bucket name and path to the folder. However, you must keep the “objects” in between.

(
    resource.type == 'storage.googleapis.com/Bucket' &&
    resource.name.startsWith("projects/_/buckets/lightly-datalake")
) || (
    resource.type == 'storage.googleapis.com/Object' &&
    resource.name.startsWith("projects/_/buckets/lightly-datalake/objects/projects/wild-animals")
)

For more information, head to the IAM conditions. The first part of the condition adds listing rights to the whole bucket, as they can only be handled on the bucket level. The second part adds object-level access rights (i.e. read and create) for all objects in the bucket lightly-datalake whose name starts with projects/wild-animals.

Google Cloud Service Account
  • Click on “Done” to create the service account.

  • You can change the roles of the service account later in the IAM.

  1. Navigate to APIs -> Credentials again if you are not already there.

  • Find the just created user in the list of all service accounts.

  • Click on the user and navigate to the “keys” tab.

  • Click on “Add key” and create a new private key in JSON Format. It will download the corresponding key file.

Google Cloud Service Account Key Creation

Preparing your data

For creating the dataset and uploading embeddings and metadata to it, you need the Command-line tool.

Furthermore, you need to have your data locally on your machine.

  1. Install the gsutil tool

  2. Use the rsync command <https://cloud.google.com/storage/docs/gsutil/commands/rsync>`_ to sync the files

    gsutil -m rsync -r /local/projects/wild-animals gs://datalake-lightly/projects/wild-animals
    

Uploading your data

Create and configure a dataset

  1. Create a new dataset in Lightly

  2. Edit your dataset and select Google Cloud Storage as your datasource

Configure google cloud bucket datasource in Lightly Platform
  1. As the resource path, enter the full URI to your resource eg. gs://lightly-datalake/projects/wild-animals

  2. Enter the Google Project ID you wrote down in the first step.

  3. Click on “Select Credentials File” to add the key file you downloaded in the previous step.

  4. Toggle the “Generate thumbnail” switch if you want Lightly to generate thumbnails for you.

  5. If you want to store outputs from Lightly (like thumbnails or extracted frames) in a different directory, you can toggle “Use a different output datasource” and enter a different path in your bucket. This allows you to keep your input directory clean as nothing gets ever written there.

Note

Lightly requires list, read, write and delete access to the output datasource. Make sure you have configured it accordingly in the steps before.

  1. Press save and ensure that at least the lights for List and Read turn green. If you added permissions for writing, this light should also turn green.

Use lightly-magic and lightly-upload with the following parameters:

  • Use input_dir=/local/projects/wild-animals

  • If you chose the option to generate thumbnails in your bucket, use upload=thumbnails

  • Otherwise, use upload=metadata instead.