Trigger a Docker Job from from the Platform or code

Warning

The Docker Archive documentation is deprecated

The old workflow described in these docs will not be supported with new Lightly Worker versions above 2.6. Please switch to our new documentation page instead.

Introduction

The Lightly workflow can be completely automated by using the Lightly Docker as a background worker. In this mode, the Lightly Docker waits for incoming jobs through the API. After receiving a job, it will download, embed, and subsample the provided dataset. The results are immediately available in the webapp for visualization and the selected samples are sent back to your cloud bucket.

Advantages

  • You can submit jobs through the API, fully automating the Lightly workflow.

  • You can automatically trigger a new job when data is added to your dataset.

  • Use Lightly docker a background worker and processes new jobs automatically.

Download the Lightly Docker

Please follow the instructions for the Setup.

Register the Lightly Docker as a Worker

To control the Lightly Docker from the API you have to register it as a worker. You can simply go to the Lightly web app and click on My Docker Runs –> My Compute Workers. Or just click on the direct link here: Docker Workers

../../_images/docker_workers_overview_empty.png

Click on “Register” in the bottom right corner and enter a name for your worker. After confirmation the worker should show up in the worker list.

../../_images/docker_workers_overview_registered.png

Copy the worker id and head over to your terminal. You can now start the docker with the worker id and it will connect to the API and wait for jobs. To start the docker execute the following command:

docker run --gpus all --rm -it \
    -v ${OUTPUT_DIR}:/home/output_dir \
    lightly/worker:latest \
    token=${YOUR_LIGHTLY_PLATFORM_TOKEN} \
    worker.worker_id=${YOUR_WORKER_ID}

The state of the worker on the Docker Workers page should now indicate that the worker is in an idle state.

Create a Dataset and Trigger a Job

There are two ways to trigger a new job. You can either use the user interface provided through our Web App or you can use our Python package and build a script.

Create a Dataset

This recipe requires that you already have a dataset in the Lightly Platform configured to use the data in your cloud bucket. Create such a dataset in two steps:

  1. Create a new dataset in Lightly. Make sure that you choose the input type Images or Videos correctly, depending on the type of files in your cloud storage bucket.

  2. Edit your dataset, select your cloud storage provider as your datasource and fill out the form. In our example we use an S3 bucket.

    Lightly S3 connection config

    Lightly S3 connection config

    If you don’t know how to fill out the form, follow the full tutorial to create a Lightly dataset connected to your bucket: AWS Secure Storage Solution (S3), Google Cloud Storage (GCS), Azure Blob Storage (Azure).

And now we can schedule a new job.

Trigger the Job

To trigger a new job you can click on the schedule run button on the dataset overview as shown in the screenshot below:

../../_images/schedule-compute-run.png

After clicking on the button you will see a wizard to configure the parameters for the job.

../../_images/schedule-compute-run-config.png

In our example we use the following parameters.

Docker Config
{
  enable_corruptness_check: true,
  remove_exact_duplicates: true,
  enable_training: false,
  pretagging: false,
  pretagging_debug: false,
  method: 'coreset',
  stopping_condition: {
    n_samples: 0.1,
    min_distance: -1
  },
  scorer: 'object-frequency',
  scorer_config: {
    frequency_penalty: 0.25,
    min_score: 0.9
  },
  active_learning: {
    task_name: '',
    score_name: 'uncertainty_margin'
  }
}
Lightly Config
{
  loader: {
    batch_size: 16,
    shuffle: true,
    num_workers: -1,
    drop_last: true
  },
  model: {
    name: 'resnet-18',
    out_dim: 128,
    num_ftrs: 32,
    width: 1
  },
  trainer: {
    gpus: 1,
    max_epochs: 100,
    precision: 32
  },
  criterion: {
    temperature: 0.5
  },
  optimizer: {
    lr: 1,
    weight_decay: 0.00001
  },
  collate: {
    input_size: 64,
    cj_prob: 0.8,
    cj_bright: 0.7,
    cj_contrast: 0.7,
    cj_sat: 0.7,
    cj_hue: 0.2,
    min_scale: 0.15,
    random_gray_scale: 0.2,
    gaussian_blur: 0.5,
    kernel_size: 0.1,
    vf_prob: 0,
    hf_prob: 0.5,
    rr_prob: 0
  }
}

Once the parameters are set you can schedule the run using a click on schedule.

View the progress of the Lightly Docker

To see the progress of your docker run, go to the Lightly Platform and head to My Docker Runs

../../_images/docker_runs_overview.png

Use your selected dataset

Once the docker run has finished, you can see your selected dataset in the Lightly Platform:

../../_images/webapp-explore-after-docker.jpg

In our case, we had 4 short street videos with about 1000 frames each in our cloud storage bucket and selected 50 frames from it. Now you can analyze your dataset in the embedding and metadata view of the Lightly Platform, subsample it further, or export it for labeling. In our case we come to the conclusion that the raw data we have does not cover enough cases and thus decide that we want to first collect more street videos.

Process new data in your bucket using a datapool

Over time you most likely will be receiving new raw data from your various sources and add these to your cloud storage bucket. In our case we added 4 additional street videos to the bucket. The new raw data might include samples which should be added to your dataset in the Lightly Platform, so you want to add a subset of them to your dataset.

This workflow is supported by the Lightly Platform using a datapool. It remembers which raw data in your bucket has already been processed and will ignore it in future docker runs. Thus you can send the same job again to the worker. It will find your new raw data in the bucket, stream, embed and subsample it and then add it to your existing dataset. The selection strategies will take the existing data in your dataset into account when selecting new data to be added to your dataset.

../../_images/webapp-embedding-after-2nd-docker1.png

After the docker run we can go to the embedding view of the Lightly Platform to see the newly added samples there in a new tag. We see that the new samples (in green) fill some gaps left by the images in the first iteration (in grey). However, there are still some gaps left, which could be filled by adding more videos to the bucket and running the docker again.

This workflow of iteratively growing your dataset with the Lightly Docker has the following advantages:

  • You can learn from your findings after each iteration to know which raw data you need to collect next.

  • Only your new data is processed, saving you time and compute cost.

  • You don’t need to configure anything, just run the same job again.

  • Only samples which are different to the existing ones are added to the dataset.

If you want to search all data in your bucket for new samples instead of only newly added data, then set datasource.process_all=True in your docker run command.