Load data directly from S3 buckets using s3fs-fuse
Warning
The Docker Archive documentation is deprecated
The old workflow described in these docs will not be supported with new Lightly Worker versions above 2.6. Please switch to our new documentation page instead.
Learn how to use Lightly Docker directly with an AWS S3 bucket as input.
Very often ML teams don’t have all the data locally on a single machine. Instead, the data is stored in a cloud storage such as AWS S3 or Google Cloud Storage. Wouldn’t it be great to directly connect to these storage buckets instead of having to download the data everytime to your local machine?
In this example we will do the following:
Setup s3fs-fuse to mount an S3 bucket to your local machine storage
Use the Lightly docker and set the input directory to the mounted S3 bucket
What is s3fs-fuse?
s3fs-fuse is an open-source project that allows to mount an S3 bucket to a local file storage. Only a limited set of file operations are supported (e.g. appending data to a file is very inefficient and slow as there is no direct support). However, for just reading files to train a ML model this approach is more than sufficient.
Here are some of the limitations pointed out in the GitHub readme:
Note
Generally S3 cannot offer the same performance or semantics as a local file system. More specifically:
random writes or appends to files require rewriting the entire object, optimized with multi-part upload copy
metadata operations such as listing directories have poor performance due to network latency
non-AWS providers may have eventual consistency so reads can temporarily yield stale data (AWS offers read-after-write consistency since Dec 2020)
no atomic renames of files or directories
no coordination between multiple clients mounting the same bucket
no hard links
inotify detects only local modifications, not external ones by other clients or tools
Get an AWS Bucket and Credentials
From the AWS dashboard go the S3 service (https://s3.console.aws.amazon.com/s3/home) and create an S3 bucket if you don’t have one yet.
If you don’t have credentials yet you need to go to the IAM service (https://console.aws.amazon.com/iam/home) on AWS and create a new user. Make sure you add the AmazonS3FullAccess permission. Then create and download the credentials (.csv file). In the credentials file you should find the Access key ID and Secret access key we will use later.
Install s3fs-fuse
In order to install s3fs we can follow the instructions from the GitHub readme. On Debian 9 or newer or Ubuntu 16.04 or newer we can use the following terminal instructions to install it:
sudo apt install s3fs
Below we show the output for installing s3fs on a Google Cloud Compute instance.
$ sudo apt install s3fs
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
s3fs
0 upgraded, 1 newly installed, 0 to remove and 81 not upgraded.
Need to get 214 kB of archives.
After this operation, 597 kB of additional disk space will be used.
Get:1 http://deb.debian.org/debian buster/main amd64 s3fs amd64 1.84-1 [214 kB]
Fetched 214 kB in 0s (8823 kB/s)
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_CTYPE = "UTF-8",
LANG = "C.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("C.UTF-8").
Selecting previously unselected package s3fs.
(Reading database ... 109361 files and directories currently installed.)
Preparing to unpack .../archives/s3fs_1.84-1_amd64.deb ...
Unpacking s3fs (1.84-1) ...
Setting up s3fs (1.84-1) ...
Processing triggers for man-db (2.8.5-2) ...
Configure S3 Credentials
Our freshly installed s3fs-fuse requires access to our S3 bucket. Luckily we can directly use a AWS credentials file. This file should look like this:
[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Let’s mount a bucket. We need to create a local folder where we want the S3 content to be mounted on.
mkdir /s3-mount
Now let’s use s3fs to mount the bucket to our new folder. Run the following command in your terminal.
s3fs simple-test-bucket-igor /s3-mount
Note
If you don’t specify the location of the .passwd_file s3fs uses the default location of your aws credentials ~/.aws/credentials.
If everything went well you should now be able to see the content of your bucket in your /s3-mount folder. If you add a new file to the folder it will automatically be uploaded to the bucket.
Optional: use a custom .passwd file for s3fs
If you don’t want to use the default aws credentials you can also create a separate passwd file for s3fs:
echo ACCESS_KEY_ID:SECRET_ACCESS_KEY > ${HOME}/.passwd-s3fs
chmod 600 ${HOME}/.passwd-s3fs
Now we can mount the S3 bucket using the following command in the terminal.
s3fs simple-test-bucket-igor /s3-mount -o passwd_file=${HOME}/.passwd-s3fs
Use S3 Storage with Lightly Docker
Now we can use the docker run command and use the /s3-mount directory as the input dir.
docker run --gpus all --rm -it \
-v /s3-mount:/home/input_dir:ro \
-v /docker/output:/home/output_dir \
lightly/worker:latest \
token=MYAWESOMETOKEN
You can do the same for the docker output directory (in this example I used /docker/output). We can either use the same bucket and work on subfolders or use another bucket and repeat the procedure.
Using a mounted S3 bucket for the docker output can be very handy. Using this approach the pdf report as well as all output files will directly be uploaded to the S3 storage and can be shared with your team.
Use Caching
If we use the s3fs setup to train a ML model we would iterate multiple times over all the images in the bucket. That would not be very efficient as we have lots of latency overhead as the data is streamed from the bucket. Also the costs could get high as we create many S3 transactions.
You can specify a folder for the caching by adding -o use_cache=/tmp to the command:
s3fs simple-test-bucket-igor /s3-mount -o use_cache=/tmp
For more information about caching checkout the FAQ wiki of s3fs.
Common Issues
You need to make sure the AWS S3 region is set accordingly to your bucket location. In your AWS S3 dashboard you find a list of S3 buckets as well as their region.
You can manually specify the AWS region by using the url=… flag as shown below:
s3fs simple-test-bucket-igor /s3-mount -o passwd_file=${HOME}/.passwd-s3fs -o url="https://s3-eu-central-1.amazonaws.com"