The Lightly Worker has an integrated reporting component that provides plots, statistics, and more information collected during the various processing steps to facilitate sustainability and reproducibility in machine learning. For example, there is information about the corruptness check, embedding process, and selection process.

Lightly puts the essential information into an automatically generated PDF report to make it easier for you to understand and discuss the dataset. You can download it for all completed worker runs from the runs page in the Lightly Platform

16861686

Download report from the Lightly Platform.

The report is also available as a report.json file. Both versions can be downloaded with the Lightly Python client:

from lightly.api import ApiWorkflowClient

# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN", dataset_id="MY_DATASET_ID")

scheduled_run_id = client.schedule_compute_worker_run(...)

# Get the scheduled run given its id.
run = client.get_compute_worker_run_from_scheduled_run(scheduled_run_id=scheduled_run_id)
# Download the report as pdf and json files.
client.download_compute_worker_run_report_pdf(run=run, output_path="my_run/artifacts/report.pdf")
client.download_compute_worker_run_report_json(run=run, output_path="my_run/artifacts/report.json")

# Alternatively, get all runs for a given dataset_id.
runs = client.get_compute_worker_runs(dataset_id=client.dataset_id)
run = runs[-1] # get the latest run

# Download the artifacts as before.
client.download_compute_worker_run_report_pdf(run=run, output_path="my_run/artifacts/report.pdf")
client.download_compute_worker_run_report_json(run=run, output_path="my_run/artifacts/report.json")

The report contains information about the number of available, corrupt, duplicate, and selected images. It also includes information on selected and discarded samples and different diagrams showing the difference between them with respect to specific properties.

Histograms and Plots

The report contains histograms of the pairwise distance between images before and after the selection process.

An example of such a histogram before and after filtering for the CamVid dataset consisting of 367 samples is shown below. We marked the region which is of particular interest with an orange rectangle. The goal is to make this histogram more symmetric by removing samples of short distances from each other.

If we remove 25 samples (7%) out of the 367 samples of the CamVid dataset, the histogram looks more symmetric, as shown below. In our experiments, removing 7% of the dataset results in a model with higher validation set accuracy.

17841784

📘

Why symmetric histograms are preferred: An asymmetric histogram can be the result of either a dataset with outliers or inliers. A heavy tail for low distances means that there is at least one high-density region with many samples very close to each other within the main cluster. Having such a high-density region can lead to biased models trained on this particular dataset. A heavy tail towards high distances shows that there is at least one high-density region outside the main cluster of samples.