Create a dataset from Google Cloud Storage¶
Lightly allows you to configure a remote datasource like Google Cloud Storage . In this guide, we will show you how to setup your Google Cloud Storage, configure your dataset to use said bucket, and only upload metadata to Lightly while keeping your data private.
One decision you need to make first is whether you want to use thumbnails. Using thumbnails makes the Lightly Platform more responsive, as not always the full images will be loaded. However, the thumbnails will be stored in you bucket and thus need storage. You have three options:
You want to use thumbnails, but don’t have them yet. Then you need to give Lightly write access to your bucket to create the thumbnails there for you. The write access can be configured not to allow overwriting and deleting, thus existing data cannot get lost.
You already have thumbnails in your bucket with a consistent name scheme, e.g. an image called img.jpg has a corresponding thumbnail called img_thumb.jpg. In this case, a read access to your bucket is sufficient.
You don’t want to use thumbnails. Then a read access to your bucket is sufficient. The Lightly Platform will load the full image even when requesting the thumbnail.
Depending on this decision, the following steps will differ slightly.
Setting up Google Cloud Storage¶
Lightly needs to be able to create so-called presigned URLs/read URLs for displaying your data in your browser. Thus it needs at minimum read and list permissions on your bucket.
Let us assume the bucket is called lightly-datalake. And let us assume the folder you want to use with Lightly is located at projects/wild-animals/
Setting Up a Service Account in IAM
1. Write down your project ID. You find it in the gcloud console under Project Info.
2. Navigate to your bucket in the google cloud storage browser and from there to projects/wild-animals/. Copy the path, in this case lightly-datalake/projects/wild-animals.
3. Navigate to the tab Permissions. Make sure that your access control is uniform. If it is not, change it to uniform.
Navigate to IAM & Admin -> Roles.
Create a new role, with the same title and ID. E.g. call it LIGHTLY_DATASET_ACCESS.
Click on Add Permissions, search for storage.objects
Add the permissions storage.objects.get, storage.objects.list, and storage.objects.create. The create permissions are needed if you want Lightly to create thumbnails in your bucket . Otherwise you can leave them away.
After adding the permissions, create the role.
Navigate to APIs -> Credentials.
Click on Create Credentials, choose Service Account and insert the name LIGHTLY_USER_WILD_ANIMALS.
The description can be service account for the Lightly API to access the wild animals dataset.
Click on Create and Continue.
Choose the Role you just created, i.e. LIGHTLY_DATASET_ACCESS.
Add a condition with the title BUCKET_PROJECTS_WILD_ANIMALS and insert the condition below in the Condition editor. Remember to change the bucket name and path to the folder. However, you must keep the “objects” inbetween.
( resource.type == 'storage.googleapis.com/Bucket' && resource.name.startsWith("projects/_/buckets/lightly-datalake") ) || ( resource.type == 'storage.googleapis.com/Object' && resource.name.startsWith("projects/_/buckets/lightly-datalake/objects/projects/wild-animals") )
For more information, head to the IAM conditions. The first part of the condition adds listing rights to the whole bucket, as they can only be handled on the bucket level. The second part adds object-level access rights (i.e. read and create) for all objects in the bucket lightly-datalake whose name starts with projects/wild-animals.
Click on Done to create the service account.
You can change the roles of the service account later in the IAM.
Navigate to APIs -> Credentials again if you are not already there.
Find the just created user in the list of all service accounts.
Click on the user and navigate to the keys tab.
Click on Add key and create a new private key in JSON Format. It will download the corresponding key file.
Preparing your data¶
For creating the dataset and uploading embeddings and metadata to it, you need the Command-line tool.
Furthermore, you need to have your data locally on your machine.
Install the gsutil tool
Use the rsync command <https://cloud.google.com/storage/docs/gsutil/commands/rsync>`_ to sync the files
gsutil -m rsync -r /local/projects/wild-animals gs://datalake-lightly/projects/wild-animals
Uploading your data¶
Create and configure a dataset
Create a new dataset in Lightly
Edit your dataset and select Google Cloud Storage as your datasource
As the resource path, enter the full URI to your resource eg. gs://lightly-datalake/projects/wild-animals
Enter the Google Project ID you wrote down in the first step.
Click on Select Credentials File to add the key file you downloaded in the previous step.
The thumbnail suffix depends on the option you chose in the first step
You want Lightly to create the thumbnail for you. Then choose the naming scheme to your liking.
You already have thumbnails in your bucket. Then choose the thumbnail suffix such that it reflects your naming scheme.
You don’t want to use thumbnails. Then leave the thumbnail suffix undefined/empty.
6. Press save and ensure that at least the lights for List and Read turn green. If you added permissions for writing, this light should also turn green.
After closing the pop-up by clicking the X, you should be on the dataset creation page again.
Use lightly-magic and lightly-upload with the following parameters:
If you chose the option to generate thumbnails in your bucket, use upload=thumbnails
Otherwise, use upload=metadata instead.