AWS Batch
AWS batch allows you to use the LightlyOne Worker while not needing to manage the instances running the LightlyOne Worker:
- Spin up an instance for processing LightlyOne Jobs while creating scheduled runs for the LightlyOne Worker.
- Spin up multiple instances for multiple scheduled runs if needed.
- Shut down the instance(s) again once the LightlyOne Worker has processed the scheduled run(s).
This allows processing multiple LightlyOne Worker jobs simultaneously while minimizing instance costs and not needing to manage instances manually.
It can be used, e.g., with AWS Lambda or a cronjob to process new images in your datasource every night or week with the LightlyOne Worker and only spin up an AWS instance while it is used.
Requirements
You need to have a machine (e.g., a developer laptop) with
- AWS CLI installed. It needs to be configured with
aws configure
. - A Python environment and the ability to install the pip packages
boto3
andlightly
.
Setting up AWS batch
Use the wizard to set up AWS batch.
- Step 1 - Select orchestration type: Choose
Amazon Elastic Compute Cloud (EC2)
to allow using GPUs. - Step 2 - Create a compute environment:
- Compute environment configuration: Choose the name, e.g.
aws-batch-lightly--compute-env
. Choose an instance role or create one if needed. - Instance configuration: Choose the instance type (family) you want to use, e.g. the
g4dn family
. Removeoptimal
to enforce using this instance type family. The size of the actual instance can be configured on the job description level, allowing you to choose the instance type depending on the job size. - Keep the other options at their default.
- Compute environment configuration: Choose the name, e.g.
- Step 3 - Create a job queue: Change the name if wished, e.g.,
aws-batch-lightly--job-queue
. - Steps 4, 5, and 6: We will set the job definitions via python later. Thus, you can click
next
for these steps.
Once the resources are created, it will forward you to the dashboard, where you see your job queues and compute environment. On the top left, you should be able to click on jobs. Set the job queue to the one you just created with the wizard and set the filter to Created before
to see the jobs. It should show the job created by the wizard in the state RUNNABLE
or COMPLETED
.
Scheduling a run for the LightlyOne Worker and processing it with AWS batch.
In a nutshell we're using a single Python script to perform the following operations:
A) We first use the LightlyOne API to schedule a run. This part should be familiar if you followed any of the other tutorials or the getting started guide.
B) We then use the AWS batch api to process the run. We spin up a new instance just for the run we created above. The instance shuts down automatically thanks to AWS batch once processing is finished. This needs setting the shutdown_when_job_finished
flag to True
in the lightly scheduled run config .
C) (Optional) We monitor the job and print updates. This helps debug until the script works as expected.
Use the python script below for all the steps. They are explained in form of comments.
import time
import datetime
import boto3 # Install it e.g. with `pip install boto3`
from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasetType, DatasourcePurpose
"""
Step 0: Set the Lightly token and the AWS Batch job queue ARN.
"""
# You must change this. Find it under the preferences page in the LightlyOne Platform: https://app.lightly.ai/preferences
MY_LIGHTLY_TOKEN = "CHANGEME"
# You must change this. Find it under https://console.aws.amazon.com/batch/home#jobDefinitions by clicking on the job queue name.
# It has the format `anr:aws:batch:REGION:AWS_ID:job-queue/JOB_QUEUE_NAME`
MY_AWS_BATCH_JOB_QUEUE_ARN = "CHANGEME"
# Optional to change.
# It is used to make sure that the worker name, worker label, job description name and job name are unique.
# Useful when processing multiple LightlyOne scheduled runs in parallel.
# Must not contain special characters except for `-` and `_`.
unique_id = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
# Create a LightlyOne API client.
client = ApiWorkflowClient(token=MY_LIGHTLY_TOKEN)
# Create a Boto3 client for AWS Batch.
batch_client = boto3.client("batch")
"""
Step 1: Register a LightlyOne Worker.
A unique LightlyOne Worker allows scheduling runs specifically for that worker.
This is only needed once for each LightlyOne Worker you want to run in parallel.
"""
worker_name = f"aws-batch-worker {unique_id}"
worker_labels = [f"{unique_id}"]
print("Worker labels:", worker_labels)
worker_id = client.register_compute_worker(name=worker_name, labels=worker_labels)
"""
Step 2: Create a new dataset on the LightlyOne Platform and configure the Input and Lightly datasources for the dataset.
This step is tha same as without AWS Batch.
"""
client.create_dataset(
dataset_name="dataset-name",
dataset_type=DatasetType.IMAGES, # must be DatasetType.VIDEOS when working with videos
)
my_dataset_id = client.dataset_id
print(f"Dataset ID: {my_dataset_id}")
# Configure the Input datasource.
client.set_s3_delegated_access_config(
resource_path="s3://bucket/input/",
region="eu-central-1",
role_arn="S3-ROLE-ARN",
external_id="S3-EXTERNAL-ID",
purpose=DatasourcePurpose.INPUT
)
# Configure the Lightly datasource.
client.set_s3_delegated_access_config(
resource_path="s3://bucket/lightly/",
region="eu-central-1",
role_arn="S3-ROLE-ARN",
external_id="S3-EXTERNAL-ID",
purpose=DatasourcePurpose.LIGHTLY
)
"""
Step 3: Configure and schedule a run.
This differs from the usual way of scheduling a run in two aspects:
i) By setting `worker_config={"shutdown_when_job_finished": True}`, the LightlyOne Worker docker container will shut down
as soon as it finishes the first scheduled run. This will then cause the AWS batch job to finish and the EC2 instance to shut down.
Otherwise, the job and machine will stay running with the LightlyOne Worker waiting for the next scheduled run to process.
ii) The `runs_on` argument was set to make the scheduled run be processed by the LightlyOne Worker created by AWS batch.
Please be aware that a LightlyOne Worker without any labels might pick up the job instead, see the label matching docs:
https://lightly-docs.readme.io/docs/assign-scheduled-runs-to-specific-workers#label-matching
"""
scheduled_run_id = client.schedule_compute_worker_run(
worker_config={
"shutdown_when_job_finished": True,
},
selection_config={
"n_samples": 50,
"strategies": [
{"input": {"type": "EMBEDDINGS"}, "strategy": {"type": "DIVERSITY"}}
],
},
runs_on=worker_labels,
)
"""
Step 5: Set and register the AWS Batch job definition.
This is only needed once unless you change the job definition.
"""
job_definition_name = f"aws-batch-lightly--job-definition--{unique_id}"
job_name = f"aws-batch-lightly--job--{unique_id}"
print(
f"Registering AWS Batch job definition {job_definition_name} and submitting job {job_name}."
)
job_definition = {
"jobDefinitionName": job_definition_name,
"type": "container",
"containerProperties": {
"image": "lightly/worker:latest",
# Resource requirements for a g4dn.2xlarge instance. Change to your needs.
# Hardware recommendations: https://docs.lightly.ai/docs/hardware-recommendations
"resourceRequirements": [
{"type": "MEMORY", "value": "32768"}, # 32 GB
{"type": "VCPU", "value": "8"},
],
"comonmand": [
f"token={MY_LIGHTLY_TOKEN}",
f"worker.worker_id={worker_id}",
],
},
}
# Register the job definition. This is only needed once unless you change the job definition.
response = batch_client.register_job_definition(**job_definition)
if "jobDefinitionName" in response and "jobDefinitionArn" in response:
print("Job Definition Registered Successfully")
print("Job Definition Name:", response["jobDefinitionName"])
print("Job Definition ARN:", response["jobDefinitionArn"])
else:
raise RuntimeError("Job Definition Registration Failed")
"""
Step 6: Submit a job for the job definition.
This is needed once each time you want to process one or a series of Lightly scheduled runs.
"""
submit_response = batch_client.submit_job(
jobName=job_name,
jobQueue=MY_AWS_BATCH_JOB_QUEUE_ARN,
jobDefinition=job_definition_name,
)
if "jobId" in submit_response and "jobName" in submit_response:
print("Job Submitted Successfully")
print("Job Name:", submit_response["jobName"])
print("Job ID:", submit_response["jobId"])
else:
raise RuntimeError("Job Submission Failed")
job_id = submit_response["jobId"]
print(f"Job ID: {job_id}")
"""
Step 7: Check the AWS batch and Lightly scheduled run job status.
This step is optional, but it is useful for monitoring and debugging.
i) At the beginning, the AWS batch job status should be `SUBMITTED` and the Lightly scheduled run state should be `OPEN`.
ii) The AWS batch job status should go into state RUNNABLE within a few seconds.
During state RUNNABLE, a new EC2 instance has to be created and started, which may take a few minutes.
If the job is stuck in `RUNNABLE`, use the AWS batch debugging troubleshooting automation
console.aws.amazon.com/systems-manager/automation/execute/AWSSupport-TroubleshootAWSBatchJob
iii) Once the EC2 instance is running, the AWS batch job status should go into state `STARTING`.
Now the LightlyOne Worker docker image is downloaded. This may take another few minutes.
iv) After the LightlyOne Worker docker image was downloaded, the AWS batch job status should go into state `RUNNING`.
It should pick up the scheduled run and start processing it.
Now the AWS batch job status should stay in state `RUNNING` until the Lightly scheduled went through all its steps.
v) Once the LightlyOne Worker finished the scheduled run, the AWS batch job status should go into state `SUCCEEDED`.
"""
while True:
aws_batch_job_status = batch_client.describe_jobs(jobs=[job_id])["jobs"][0][
"status"
]
ligthly_scheduled_run_info = client.get_compute_worker_run_info(
scheduled_run_id=scheduled_run_id
)
current_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
print(
f"{current_time} - Current Status of AWS batch job: {aws_batch_job_status} and state of LightlyOne Worker run: {ligthly_scheduled_run_info.state}"
)
if aws_batch_job_status in ["RUNNING"]:
print("LightlyOne Worker is running and searches for scheduled jobs.")
print(
"Check the Worker status in the LightlyOne Platform: https://app.lightly.ai/compute/workers"
)
elif aws_batch_job_status in ["SUCCEEDED", "FAILED"]:
print(f"AWS batch job {aws_batch_job_status}. Exiting status check loop.")
break
if ligthly_scheduled_run_info.in_end_state():
print(
f"LightlyOne Worker run finished with state {ligthly_scheduled_run_info.state}. Exiting status check loop."
)
break
time.sleep(5) # Sleep for 5 seconds before next check
Implementation recommendations
Monitoring options
Apart from monitoring via python script, you can also monitor the state of LightlyOne scheduled runs, using the LightlyOne Platform. You can also go to My Workers to see the idling and running LightlyOne Workers.
For monitoring AWS Batch jobs, you can also use the AWS batch jobs overview.
Processing multiple LightlyOne Worker Scheduled Runs
When processing multiple LightlyOne Scheduled Runs, you can decide on the most cost-effective or fastest solution.
Cost-effective solution
For processing multiple scheduled runs cost-effectively, we recommend the following setup:
- Only start one AWS batch job. This saves time and cost for spinning up the instance and downloading the LightlyOne Worker docker container.
- Schedule multiple LightlyOne Worker runs.
- For the last and only the last scheduled run: set
worker_config={"shutdown_when_job_finished": True}
.
This will make the AWS batch job process the scheduled runs and shut down once the last one is finished.
Fastest solution
For higher processing speed, you can also schedule multiple AWS batch jobs simultaneously to process multiple LightlyOne Scheduled Runs in parallel. Don't forget to set worker_config={"shutdown_when_job_finished": True}
for every scheduled job
Fixing LightlyOne Worker version
As the above script always loads the lightly/worker:latest
docker image from docker hub, it will always update to the newest LightlyOne Worker version. For reproducibility, consider fixing the version, e.g., by using a specific version.lightly/worker:X.X.X
instead. For the latest version number, please see our changelog.
Updated 2 months ago