Corruptness Check

When processing large raw datasets, it can often happen that some of the files are broken. The Lightly Worker has a built-in capability to deal with what we call "corrupt" images or frames.

The corruptness check is done independently of where the data is stored. To figure out which files were marked as corrupted by the Lightly Worker, please see Corruptness Check Information.

Why is this required?

Whenever you process a dataset, the Lightly Worker needs to list and index the configured datasource to ensure it knows which files should be processed. If you're working with videos, we keep track of which frames might be broken and which frames are fine to use for processing.

Corruptness Check for Images

When working with images, the following tests are performed during the corruptness check:

  • File can be accessed and opened
  • Image can be decoded and is a valid image file (readable by Pillow)

Corruptness Check for Videos

When working with videos, we perform the image level corruptness check for every frame. Additionally, we do the following checks:

  • Video can be accessed and opened
  • Video can be decoded and is a valid file
  • Video frame can be decoded
  • Video frame is not broken (see below)

Broken Video Frames

When working with static cameras, there might be errors in some of the frames. These errors can appear due to various issues (rolling shutter that did not have enough time to capture the full image, sensor or hardware of the camera being damaged, etc.). The Lightly Worker has a simple but effective built-in detection system for finding common issues that result in fixed color stripes and artifacts in a video frame.

The Lightly Worker uses FFmpeg internally to decode videos. And if you've ever used FFmpeg to play videos, you know that it can decode and play almost everything, even if there are missing pieces :)

641

Broken video frame example

Configure Corruptness Check

Since the corruptness check for videos does more than simple readability checks of the frames, we also provide a parameter (corruption_threshold) to tweak the feature.
The default value of the corruptness check might too aggressively flag frames as corrupt (false positives) if the video is of low quality or if the captured scene primarily consists of a single color.

worker_config={
    "corruptness_check": {
            # Threshold in [0, 1] which determines the sensitivity of the corruptness check
            # for video frames. Every frame which has an internally computed corruptness
            # score larger than the specified threshold will be classified as corrupted.
            "corruption_threshold": 0.1,
        },
}

If you notice that the corruptness check is too sensitive (too much flag as being corrupt), you can increase the corruption_threshold from its default parameter to give it more leeway. You can find the number of corrupt images in the pdf report which is created for every run.

worker_config={
    "corruptness_check": {
            "corruption_threshold": 0.5,
        },
}