Corruption Check

When processing large raw datasets, some files can be broken. The Lightly Worker has a built-in capability to deal with what we call "corrupt" images or frames.

The corruption check is done independently of where the data is stored. To figure out which files were marked as corrupt by the Lightly Worker, please see Corruptness Check Information.

Why is this required?

Whenever you process a dataset, the Lightly Worker needs to list and index the configured datasource to ensure it knows which files should be processed. If you're working with videos, we keep track of which frames might be broken and which frames are fine to use for processing.

What checks are performed?

In simple terms, an image or video frame is flagged corrupt if its pixel data cannot be loaded.

When working with images, the following tests are performed:

  • File can be accessed and opened
  • Image can be decoded and is a valid image file (readable by Pillow)

When working with videos, we perform the image level corruption check for every frame. Additionally, we check the following:

  • Video can be accessed and opened
  • Video can be decoded and is a valid file
  • Every video frame can be decoded

Broken Video Frames

When working with static cameras, there might be artifacts in pixel data for some frames. These artifacts can appear due to various reasons (rolling shutter did not have enough time to capture the image, damaged sensor or hardware of the camera, etc.). Typically these result in fixed color stripes in a video frame.

Because such frames can be decoded they are not flagged as corrupt.

641

Broken video frame example.

Configure Selection to Remove Broken Frames

(New in 2.7.) Lightly Worker provides a simple way to remove such broken frames. Use lightly.uniformRowRatio Lightly metadata to remove images with uniform rows that typically indicate decoding artifacts. The following configuration keeps images with less than 2.5% of uniform rows.

{
    "n_samples": 100, # set to the number of samples you want to select
    "strategies": [
        {
            "input": {
                "type": "METADATA",
                "key": "lightly.uniformRowRatio"
            },
            "strategy": {
                "type": "THRESHOLD",
                "threshold": 0.025,
                "operation": "SMALLER"
            }
        }
    ]
}

We recommend to set the threshold to 0.025. If you find the check too sensitive try increasing the value. You can also inspect the distribution of the metadata values in the Lightly Platform and set the threshold based on that.

Understanding Uniform Row Ratio

The algorithm to compute lightly.uniformRowRatio is explained in detail here. The metadata value is the proportion of rows that are considered uniform, which are essentially rows with a single color. In reality, the condition is more involved to be robust to image encoding.

For example, the dark green rows in the image above would be considered uniform. Artifacts that do not span the whole width of the row are not considered uniform. Note that this check does not detect e.g. blurry frames.