Create a dataset from an AWS S3 bucket

Lightly allows you to configure a remote datasource like Amazon S3 (Amazon Simple Storage Service). In this guide, we will show you how to setup your S3 bucket, configure your dataset to use said bucket, and only upload metadata to Lightly.

List, read and write permissions

Lightly needs at minimum to have read and list permissions (s3:GetObject and s3:ListBucket) on your bucket. It needs them to provide the Lightly Worker access to your dataset, so that it can process it. Furthermore, the Lightly Platform needs access to show you your images in the webapp.

There are different scenarios which also require write permissions (s3:PutObject):

  • You process videos.

  • You use the Lightly Worker with the object level workflow. (See Object Level)

  • You want the Lightly Worker to create thumbnails for you to increase the performance of the Lightly Platform.

User Access and Delegated Access

There are two ways to set up these permissions:

  1. User Access

This method will create a user with permissions to access your bucket. An Access ID and secret key allow to authenticate as this user. We recommend this method as it is easy to set up and provides optimal performance.

  1. Delegated Access

To access your data in your S3 bucket on AWS, Lightly can assume a role in your account which has the necessary permissions to access your data. Use this method if internal or external policies of your organization require it or disallow the other method. It comes with a small overhead for each access to a file in your bucket by Lightly. The overhead is negligible for larger files (e.g. videos or large images), but may become significant for many small files.

To set up one of the access methods:

S3 IAM User Access

  1. Go to the Identity and Access Management IAM page and create a new user for Lightly.

  2. Choose a unique name of your choice and select “Programmatic access” as “Access type”. Click next

    Create AWS User

    Create AWS User

  3. We will want to create very restrictive permissions for this new user so that it can’t access other resources of your company. Click on “Attach existing policies directly” and then on “Create policy”. This will bring you to a new page

    Setting user permission in AWS

    Setting user permission in AWS

4. As our policy is very simple, we will use the JSON option and enter the following. Please substitute datalake with the name of your bucket and projects/farm-animals/ with the folder you want to share.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowListing",
            "Effect": "Allow",
            "Action": "s3:ListBucket",
            "Resource": [
                "arn:aws:s3:::datalake",
                "arn:aws:s3:::datalake/projects/farm-animals/*"
            ]
        },
        {
            "Sid": "AllowAccess",
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::datalake/projects/farm-animals/*"
            ]
        }
    ]
}
Permission policy in AWS

Permission policy in AWS

  1. Go to the next page and create tags as you see fit (e.g external or lightly) and give a name to your new policy before creating it.

    Review and name permission policy in AWS

    Review and name permission policy in AWS

  2. Return to the previous page as shown in the screenshot below and reload. Now when filtering policies, your newly created policy will show up. Select it and continue setting up your new user.

    Attach permission policy to user in AWS

    Attach permission policy to user in AWS

  3. Write down the Access key ID and the Secret access key in a secure location (such as a password manager) as you will not be able to access this information again (you can generate new keys and revoke old keys under Security credentials of a users detail page)

    Get security credentials (access key id, secret access key) from AWS

    Get security credentials (access key id, secret access key) from AWS

Create and configure a dataset

Create and configure a dataset

  1. Create a new dataset in Lightly. Make sure that you choose the input type Images or Videos correctly, depending on the type of files in your S3 bucket.

  2. Edit your dataset and select S3 as your datasource

    Lightly S3 connection config

    Lightly S3 connection config

  3. As the resource path, enter the full S3 URI to your resource eg. s3://datalake/projects/farm-animals/

  4. Enter the access key and the secret access key we obtained from creating a new user in the previous step and select the AWS region in which you created your bucket.

    Note

    If you are using a delegated access role, toggle the switch Use IAM role based delegated access and pass the external ID and the role ARN from the previous step instead of the secret access key.

  5. Toggle the “Generate thumbnail” switch if you want Lightly to generate thumbnails for you.

    Note

    If you want to use server side encryption, toggle the switch Use server side encryption and set the KMS key arn. (see: S3 Server Side Encryption with KMS)

  6. If you want to store outputs from Lightly (like thumbnails, extracted frames or crops) in a different directory, you can toggle “Use a different output datasource” and enter a different path in your bucket. This allows you to keep your input directory clean as nothing gets ever written there.

    Note

    Lightly requires list, read, and write access to the output datasource. Make sure you have configured it accordingly in the steps before. You can also use two different permissions to only allow listing and reading for the input datasource and additionally writing for the output datasource.

  7. Press save and ensure that at least the lights for List and Read turn green. If you added permissions for writing, this light should also turn green.

More restrictive policies

It is possible to create more restrictive policies to e.g only permit certain IP ranges from accessing your data. (see: Minimum AWS Policy requirements)

Next steps

Use the Lightly Worker. (see Setup). If you have already set up the Worker, create a dataset with your S3 bucket as datasource. (see Using the Docker with a Cloud Bucket as Remote Datasource)