Class-Balancing a Dataset Using Predictions From Detectron2

In this tutorial, you will perform a selection of the images in your dataset based on the diversity of objects in images. You'll apply the concepts in object balancing in a more concrete example.

You will learn the following:

Prerequisites

To upload predictions to a Lightly datasource, you will need the following things:

pip install lightly
  • A configured datasource with predictions. You can find a tutorial on how to do that under Work with Predictions. This tutorial is intended as an extension of that tutorial.

Start the Lightly Worker

Start the Lightly Worker in waiting mode. In this mode, the worker will long-poll the Lightly API for new jobs to process.

docker run --shm-size="1024m" --gpus all --rm -it \
    -e LIGHTLY_TOKEN={MY_LIGHTLY_TOKEN} \
    lightly/worker:latest \
    worker.worker_id={MY_WORKER_ID}

Set Up a Dataset and Link It to Your Datasource

If you followed all the prerequisites, you should already have a datasource with predictions in your preferred cloud infrastructure. If not, you can follow the documentation page Set up your first Dataset to set up your dataset. In this tutorial, AWS S3 is used. You can create a dataset from the Python client using this script:

from lightly.api import ApiWorkflowClient
from lightly.openapi_generated.swagger_client import DatasetType

# Create the Lightly client to connect to the API.
client = ApiWorkflowClient(token="MY_LIGHTLY_TOKEN")

# Create a new dataset on the Lightly Platform. We name it comma10k for continuity
# with the tutorial "Adding Predictions"
client.create_dataset(dataset_name="comma10k", dataset_type=DatasetType.IMAGES)
dataset_id = client.dataset_id

After creating the dataset, you can configure the datasource for it:

from lightly.openapi_generated.swagger_client import DatasourcePurpose

# Configure the Input datasource.
client.set_s3_config(
    resource_path="s3://bucket/input/",
    region="eu-central-1",
    access_key="S3-ACCESS-KEY",
    secret_access_key="S3-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.INPUT,
)
# Configure the Lightly datasource.
client.set_s3_config(
    resource_path="s3://bucket/lightly/",
    region="eu-central-1",
    access_key="S3-ACCESS-KEY",
    secret_access_key="S3-SECRET-ACCESS-KEY",
    purpose=DatasourcePurpose.LIGHTLY,
)

Balancing by Predictions

Lightly allows the balancing of selected images by predictions. This enables oversampling of underrepresented classes and debiasing your dataset. In order to apply a balancing strategy, recall the category names specified in the schema. In this tutorial, you used the COCO classes:

coco_classes = [
        "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", 
        "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat",
        "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella",
        "handbag", "tie", "suitcase", "frisbee", "skis","snowboard", "sports ball", "kite", "baseball bat",
        "baseball glove", "skateboard", "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork", 
        "knife", "spoon", "bowl", "banana", "apple", "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", 
        "donut", "cake", "chair", "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", 
        "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", 
        "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
]

The following selection strategy tells the Lightly Worker to make a selection such that every predicted class makes up an equal proportion of the resulting dataset.

scheduled_run_id = client.schedule_compute_worker_run(
    worker_config={},
    selection_config={
        "n_samples": 100,
        "strategies": [
            {
                "input": {
                    "type": "PREDICTIONS",
                    "name": "CLASS_DISTRIBUTION",
                    "task": "object_detection_comma10k",
                },
                "strategy": {
                    "type": "BALANCE",
                    "target": {
                        class_name: 1 / len(coco_classes) for class_name in coco_classes
                    },
                },
            }
        ],
    },
)

You can find more information about selection strategies in Customize a Selection.

Monitor the Run and Download the Results

The Lightly Worker will pick up the run and start working on it within a few seconds. The status of the current run and other scheduled runs can be seen on the runs view of the Lightly Platform. Alternatively, you can also monitor it from Python:

# You can use this code to track and print the state of the Lightly Worker.
# The loop will end once the run has finished, was canceled, or failed.
print(scheduled_run_id)
for run_info in client.compute_worker_run_info_generator(scheduled_run_id=scheduled_run_id):
    print(f"Lightly Worker run is now in state='{run_info.state}' with message='{run_info.message}'")

if run_info.ended_successfully():
    print("SUCCESS")
else:
    print("FAILURE")

Lightly puts the essential information about the selection process into an automatically generated PDF report to make it easier for you to understand your dataset before and after the selection. You can download it for all completed worker runs from the runs page in the Lightly Platform, or you can use this script to download it with the Lightly Python client:

# Get the scheduled run given its id.
run = client.get_compute_worker_run_from_scheduled_run(scheduled_run_id=scheduled_run_id)
# Download the report as pdf and json files.
client.download_compute_worker_run_report_pdf(run=run, output_path="my_run/artifacts/report.pdf")
client.download_compute_worker_run_report_json(run=run, output_path="my_run/artifacts/report.json")

In the report, you will find a histogram plot of the predicted classes before and after the selection by Lightly. It should look similar to the image below. You can see that the underrepresented classes in the input dataset are more frequent in the output. Of course, certain classes, such as zebras, make no appearance in the comma10k dataset and can, therefore, not be oversampled.

848848

Change in object distribution before and after the selection.