Hardware Recommendations

The LightlyOne Worker should run on dedicated hardware to guarantee quick and stable data processing. Most cloud providers offer powerful instances that can be switched on and off depending on the workload. The table below shows hardware requirements and example instances of prominent cloud providers.

Input Images	System Memory	vCPUs	GPU	EC2 Instance	GCP Instance	Azure Instance
< 1'000'000	32GB	8	T4	g4dn.2xlarge	n1-standard-8	Standard_NC8as_T4_v3
< 10'000'000	64GB	16	T4	g4dn.4xlarge	n1-standard-16	Standard_NC16as_T4_v3
> 10'000'000	128GB	32	T4	g4dn.8xlarge	n1-standard-32	Standard_NC64as_T4_v3

Input Images

System Memory

vCPUs

GPU

EC2 Instance

GCP Instance

Azure Instance

1'000'000

32GB

g4dn.2xlarge

n1-standard-8

Standard_NC8as_T4_v3

10'000'000

64GB

g4dn.4xlarge

n1-standard-16

Standard_NC16as_T4_v3

> 10'000'000

128GB

g4dn.8xlarge

n1-standard-32

Standard_NC64as_T4_v3

For training self-supervised models or improved inference speed we recommend a V100, a A10 GPU, or better.

📘
Cloud Resource Quota
Requesting GPU resources in the cloud on AWS, GCP or Azure the process can take up to 72 hours. We recommend increasing quota early even if the resource will only be used later.

Operating System

For cloud instances we recommend a deep learning image with GPU support optimized for PyTorch. For example, on AWS, we recommend Deep Learning AMI GPU PyTorch 1.13.1 (Amazon Linux 2) 20230104(or similar). These images are regularly updated.

Runtime Estimates

The expected runtime of LightlyOne Worker depends on the exact configuration. The table below shows the times measured on a g4dn.2xlarge instance on AWS EC2 for a few example datasets and configurations.

Dataset	Crop Objects	Input	Input Images	Selected Images	EC2 Instance	Expected Runtime
Openimages	No	Images	100'000	10'000	g4dn.2xlarge	< 60min
Berkley DeepDrive	No	Videos	150'160	14'966	g4dn.2xlarge	< 60min
Comma10k	Yes	Images	10'000	1977	g4dn.2xlarge	< 20min

Dataset

Crop Objects

Input

Input Images

Selected Images

EC2 Instance

Expected Runtime

Openimages

Images

100'000

10'000

g4dn.2xlarge

60min

Berkley DeepDrive

Videos

150'160

14'966

g4dn.2xlarge

60min

Comma10k

Yes

Images

10'000

1977

g4dn.2xlarge

20min

Cost Estimates

Most cloud providers offer a price calculator to estimate costs based on instance uptime and expected egress.

Instance cost estimate

A seen in the Runtime Estimates just above, the LightlyOne Worker needs less than 1h for processing 100'000 images from the Openimages dataset. As the AWS EC2 instance costs about $0.75/h as a spot instance, the costs are less than $0.75 for processing the whole dataset.

Egress costs

Streaming data from cloud storage (AWS S3, Google Cloud Storage, Azure) using the datasource feature can lead to high egress costs and slow down data loading. To prevent this, we highly recommend the cloud storage and compute instance be in the same region.

GPU

Supported GPUs

LightlyOne Worker supports all Nvidia GPUs with CUDA support.

CPU only

Although not recommended, it's possible to run LightlyOne Worker without GPU. This may be especially useful to do dry runs and iterate quickly, for example when configuring the LightlyOne Worker for the first time. Processing will be significantly slower, especially for large datasets. To start the worker on a CPU-only machine, remove the --gpus allconfig option from the docker run command.

vCPUs

It's recommended to have at least eight vCPUs to make use of multiprocessing and multithreading. By default, LightlyOne Worker selects a reasonable number of process and threads based on the number of available vCPUs. The number of processes and threads can be configured with the following arguments:

num_processes (defaults to -1)
num_threads (defaults to-1)

📘
Maximum Number of Processes/Threads
When LightlyOne Worker selects the number of processes or threads based on the number of vCPUs it will never go above 32 processes or 64 threads. To circumvent this use the configuration options num_processes and num_threads to set the number of processes and threads explicitly.

RAM

We recommend having 4GB of memory per vCPU. If you have too little memory, we recommend reducing the number of processes and threads such that there is 4GB of memory per process and 2GB per thread. E.g. if you have 32GB of RAM in total, set num_processes=8and num_threads=16when running the worker. See the section on vCPUs above for details.

Running the LightlyOne Worker on Local Machines

Although not recommended, it is also possible to run the LightlyOne Worker on a local (non-cloud) machine. The recommended specs are similar. Make sure you have at least 4 cores and 16GB of system memory. A consumer GPU of one of the newer generations from Nvidia (1000er series or newer) should be sufficient.

We recommend to use Linux, but we also had customers successfully using the LightlyOne Worker on a Windows machine using WSL2. You might need to modify our Docker image to make things work for the time being.

Find the Compute Speed Bottleneck

The performance and speed of the LightlyOne Worker could be limited by one of three potential bottlenecks. Different steps of the LightlyOne Worker use these resources to different extents. Thus the bottleneck changes throughout the run.

The potential bottlenecks can be:

GPU
I/O (How fast can data be loaded from the cloud bucket?)
CPU

The GPU is used during three steps:

training an embedding model (optional step)
pretagging your data (optional step)
embedding your data

The I/O and CPUs are used during the previous 3 steps and also used during the following steps, which may take longer:

initializing the dataset
corruptness check
dataset sync with the LightlyOne Platform

Before changing the hardware configuration of your compute instance, we recommend first determining the bottleneck by monitoring it:

You can see the ethernet usage using the terminal command ifstat.
You can find out your machine's current CPU and RAM usage using the terminal commands top or htop.
You can find out the current GPU usage (both compute and VRAM) using the terminal command watch nvidia-smi.
Note that you might need to install these commands using your package manager.

Additional to using these tools, you can also compare the relative duration of the different steps to see the bottleneck. E.g., if the embedding step takes much longer than the corruptness check, then the GPU is the bottleneck. Otherwise, it is the I/O or CPU.