Create a dataset from an AWS S3 bucket
Lightly allows you to configure a remote datasource like Amazon S3 (Amazon Simple Storage Service). In this guide, we will show you how to setup your S3 bucket, configure your dataset to use said bucket, and only upload metadata to Lightly.
Setting up Amazon S3
For Lightly to be able to create so-called presigned URLs/read URLs to be used for displaying your data in your browser, Lightly needs at minimum to be able to read and list permissions on your bucket. If you want Lightly to create optimal thumbnails for you while uploading the metadata of your images, write permissions are also needed.
Let us assume your bucket is called datalake. And let us assume the folder you want to use with Lightly is located at projects/farm-animals/
Setting up IAM
Go to the Identity and Access Management IAM page and create a new user for Lightly.
Choose a unique name of your choice and select “Programmatic access” as “Access type”. Click next
Create AWS User
We will want to create very restrictive permissions for this new user so that it can’t access other resources of your company. Click on “Attach existing policies directly” and then on “Create policy”. This will bring you to a new page
Setting user permission in AWS
As our policy is very simple, we will use the JSON option and enter the following while substituting datalake with your bucket and projects/farm-animals/ with the folder you want to share.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": "s3:ListBucket", "Resource": [ "arn:aws:s3:::datalake", "arn:aws:s3:::datalake/projects/farm-animals/*" ] }, { "Sid": "VisualEditor1", "Effect": "Allow", "Action": "s3:*", "Resource": [ "arn:aws:s3:::datalake/projects/farm-animals/*" ] } ] }
Permission policy in AWS
Go to the next page and create tags as you see fit (e.g external or lightly) and give a name to your new policy before creating it.
Review and name permission policy in AWS
Return to the previous page as shown in the screenshot below and reload. Now when filtering policies, your newly created policy will show up. Select it and continue setting up your new user.
Attach permission policy to user in AWS
Write down the Access key ID and the Secret access key in a secure location (such as a password manager) as you will not be able to access this information again (you can generate new keys and revoke old keys under Security credentials of a users detail page)
Get security credentials (access key id, secret access key) from AWS
S3 IAM Delegated Access
To access your data in your S3 bucket on AWS, Lightly can assume a role in your account which has the necessary permissions to access your data. This is considered best practice by AWS.
To set up IAM Delegated Access
Go to the AWS IAM Console
Click Create role
Select AWS Account as the trusted entity type
Select Another AWS account and specify the AWS Account ID of Lightly: 311530292373
Check Require external ID, and choose an external ID. The external ID should be treated like a passphrase
Do not check Require MFA.
Click next
Select a policy which grants access to your S3 bucket. If no policy has previously been created, here is an example of how the policy should look like:
{ "Version": "2012-10-17", "Statement": [ { "Sid": "lightlyS3Access", "Action": [ "s3:GetObject", "s3:DeleteObject", "s3:PutObject", "s3:ListBucket" ], "Effect": "Allow", "Resource": [ "arn:aws:s3:::{YOUR_BUCKET}/*", "arn:aws:s3:::{YOUR_BUCKET}" ] } ] }
Name the role Lightly-S3-Integration and create the role.
Edit your new Lightly-S3-Integration role: set the Maximum session duration to 12 hours.
Warning
If you don’t set the maximum duration to 12 hours, Lightly will not be able to access your data. Please make sure to se the Maximum session duration to 12 hours.
Remember the external ID and the ARN of the newly created role (arn:aws:iam::367053757506:role/Lightly-S3-Integration)
Note
We recommend setting up separate input and output datasources (see First Steps). For this either use two different roles with narrow scope or one role with broader access.
Preparing your data
For the Command-line tool to be able to create embeddings and extract metadata from your data, lightly-magic needs to be able to access your data. You can either download/sync your data from S3 or you can mount S3 as a drive. We recommend downloading your data from S3 as it makes the overall process faster.
Prepare data by downloading from S3 (recommended)
Install AWS cli by following the guide of Amazon
Run aws configure and set the credentials
Download/synchronize the folder located on S3 to your current directory
aws s3 sync s3://datalake/projects/farm-animals ./farm
Prepare data by mounting S3 as a drive
For Linux and MacOS we recommend using s3fs-fuse to mount S3 buckets to a local file storage. You can have a look at our step-by-step guide: Load data directly from S3 buckets using s3fs-fuse.
Uploading your data
Create and configure a dataset
Create a new dataset in Lightly. Make sure that you choose the input type Images or Videos correctly, depending on the type of files in your S3 bucket.
Edit your dataset and select S3 as your datasource
Lightly S3 connection config
As the resource path, enter the full S3 URI to your resource eg. s3://datalake/projects/farm-animals/
Enter the access key and the secret access key we obtained from creating a new user in the previous step and select the AWS region in which you created your bucket.
Note
If you are using a delegated access role, toggle the switch Use IAM role based delegated access and pass the external ID and the role ARN from the previous step instead of the secret access key.
Toggle the “Generate thumbnail” switch if you want Lightly to generate thumbnails for you.
If you want to store outputs from Lightly (like thumbnails or extracted frames) in a different directory, you can toggle “Use a different output datasource” and enter a different path in your bucket. This allows you to keep your input directory clean as nothing gets ever written there.
Note
Lightly requires list, read, and write access to the output datasource. Make sure you have configured it accordingly in the steps before.
Press save and ensure that at least the lights for List and Read turn green. If you added permissions for writing, this light should also turn green.
Use lightly-magic and lightly-upload just as you always would with the following considerations;
Use input_dir=datalake/farm-animals
If you chose the option to generate thumbnails in your bucket, use the argument upload=thumbnails
Otherwise, use upload=metadata instead.