Prediction Format

Lightly can use images you provided in a datasource together with predictions of a machine learning model. They are used to improve your selection results, either with an active learning or a balancing strategy. Object or keypoint detection predictions can also be used to run Lightly with object diversity.

By providing predictions in the datasource, you have complete control over them. If you add new samples to your datasource, you can simultaneously add their predictions to the datasource. If you already have labels instead of predictions, you can treat them just as predictions and upload them the same way.

Predictions Folder Structure

In the following, we will outline the format of the predictions required by the Lightly Worker. Everything regarding predictions will occur in a subfolder of your configured Lightly datasource called .lightly/predictions. The general structure of your input and Lightly datasource will look like this:

s3://bucket/input/
├── image_0.png
└── subdir/
    ├── image_1.png
    ├── image_2.png
    ├── ...
    └── image_N.png

s3://bucket/lightly/
└── .lightly/predictions/
    ├── tasks.json
    ├── task_1/
    │   ├── schema.json
    │   ├── image_0.json
    │   └── subdir/
    │       ├── image_1.json
    │       ├── image_2.json
    │       ├── ...
    │       └── image_N.json
    └── task_2/
        ├── schema.json
        ├── image_0.json
        └── subdir/
            ├── image_1.json
            ├── image_2.json
            ├── ...
            └── image_N.json

Each subfolder in .lightly/predictions corresponds to one prediction task (e.g., a classification task and an object detection task). All of the files are explained in the following sections.

📘

Check out our reference project

If you want to have a look at a reference project with the proper folder structure to work with predictions and metadata we suggest having a look at our example here: https://github.com/lightly-ai/object_detection_example_structure

Prediction Tasks

To let Lightly know what kind of prediction tasks you want to work with, Lightly needs to know their names. A task name is the name of the corresponding subfolder in .lightly/predictions. You can make them available for Lightly by adding the list of task names to a tasks.json file in the .lightly/predictions directory.

For example, let’s say we are working with the following folder structure:

s3://bucket/lightly/
└── .lightly/predictions/
    ├── tasks.json
    ├── classification_weather/
    │   ├── schema.json
    │   └── ...
    ├── object_detection_people/
    │   ├── schema.json
    │   └── ...
    ├── semantic_segmentation_cars/
    │   ├── schema.json
    │   └── ...
    └── some_directory_containing_irrelevant_things/
        └── ...

Then we can specify which subfolders contain relevant predictions in the tasks.json file:

[
    "classification_weather",
    "object_detection_people",
    "semantic_segmentation_cars"
]

🚧

Always add a tasks.json

Only the task names listed within tasks.json will be considered by the Lightly Worker! When adding a new subfolder with predictions, always remember to add the subfolder name to the tasks.json file.

🚧

Don't forget the schema.json

If you list a folder in tasks.json that doesn’t contain a valid schema.json file, the Lightly Worker will report an error! See below how to create a suitable schema.json file.

Prediction Schema

Every prediction task needs a schema defining the format of the predictions. The Lightly Platform uses the schema to identify and display prediction classes correctly. It also helps to prevent errors as all loaded predictions are validated against this schema.

You can provide this information to Lightly by adding a schema.json file to the folder of a prediction task. The schema.json file must have a key categories with a list of categories following the COCO annotation format. It must also have a key task_type indicating the type of the predictions. The task_type must be one of the following:

  • classification
  • object-detection
  • keypoint-detection
  • instance-segmentation
  • semantic-segmentation

For example, let’s say we are working with a classification model predicting the weather on an image. The three classes are sunny, clouded, and rainy. Then the schema.json file should look as follows:

{
    "task_type": "classification",
    "categories": [
        {
            "id": 0,
            "name": "sunny"
        },
        {
            "id": 1,
            "name": "clouded"
        },
        {
            "id": 2,
            "name": "rainy"
        }
    ]
}

Prediction Files

Lightly requires a single prediction file per image. Predictions are saved as JSON files following the Prediction Format. They are stored in the subfolder .lightly/predictions/${TASK_NAME} in the Lightly datasource the dataset was configured with. To make sure Lightly can match the predictions to the correct source image, it’s necessary to follow the naming convention:

# Filename of the prediction for image in s3://bucket/input/FILENAME.EXT
s3://bucket/lightly/.lightly/predictions/${TASK_NAME}/${FILENAME}.json

# Example
# Image: s3://bucket/input/subdir/image_1.png
# Task name: my_classification_task
# Prediction file must be at:
s3://bucket/lightly/.lightly/predictions/my_classification_task/subdir/image_1.json

# Example
# Image: s3://bucket/input/image_0.png
# Task name: my_classification_task
# Prediction file must be at:
s3://bucket/lightly/.lightly/predictions/my_classification_task/image_0.json

See Create Prediction Files from COCO on how to convert coco predictions into the format required by Lightly automatically.

Prediction Files for Videos

When working with videos, Lightly requires a prediction file per frame. The prediction file name must contain the original video name, video extension, and frame number in the following format:

{VIDEO_NAME}-{FRAME_NUMBER}-{VIDEO_EXTENSION}.json

Frame numbers are zero-padded to the total length of the number of frames in a video. A video with 200 frames must have the frame number padded to length three. For example, the frame number for frame 99 becomes 099. A video with 1000 frames must have frame numbers padded to length four (99 becomes 0099).

Examples are shown below:

# Filename of the predictions of the Xth frame of video s3://bucket/input/FILENAME.EXT
# with 200 frames (padding: len(str(200)) = 3)
s3://bucket/lightly/.lightly/predictions/${TASK_NAME}/${FILENAME}-${X:03d}-${EXT}.json

# Example
# Video: s3://bucket/input/subdir/video_1.mp4
# Task name: my_classification_task
# Prediction file for frame 99 of 200 frames:
s3://bucket/lightly/.lightly/predictions/my_classification_task/subdir/video_1-099-mp4.json

# Example
# Video: s3://bucket/input/subdir/video_0.mp4
# Task name: my_classification_task
# Prediction file for frame 99 of 200 frames:
s3://bucket/lightly/.lightly/predictions/my_classification_task/video_0-099-mp4.json

See Create Prediction Files for Videos on how to extract video frames and create predictions using FFmpeg or Python.

🚧

Prediction Filename for Videos

When creating predictions of videos the Lightly Worker automatically extracts the frames as PNGs.
If you have a prediction for frame 123 of a video myVideo.mp4 the file_name within the myVideo-123-mp4.json file must have a .png ending.
The file_name would therefore be myVideo-123-mp4.png!

Prediction Format

Predictions for an image must have a file_name and a list of predictions. Here, file_name serves as a unique identifier to retrieve the image for which the predictions are made, while predictions is a list of Prediction Singletons. Each entry in the predictions list contains the information for a single prediction. This is typically an image classification, a detected object, or a segmentation mask.

Predictions have the following entries:

  • category_id is the id of the predicted class
  • score is the final prediction score/confidence, values must be in [0, 1]
  • probabilities are the per-class probabilities of the prediction, values must be in [0, 1]
    and sum up to 1.0

Depending on the prediction task, additional entries might be required. For details, see Prediction Singletons.

Example Classification

{
    "file_name": "subdir/image_1.png",
    "predictions": [ // classes: [sunny, clouded, rainy]
        {
            "category_id": 0,                // category in [0, num categories - 1]
            "probabilities": [0.8, 0.1, 0.1] // values in [0, 1], sum up to 1.0
        }
    ]
}

Example Object Detection

{
    "file_name": "subdir/image_1.png",
    "predictions": [ // classes: [person, car]
        {
            "category_id": 0,           // category in [0, num categories - 1]
            "bbox": [140, 100, 80, 90], // x, y, w, h coordinates in pixels
                                        // x, y >= 0 and w, h >= 1
            "score": 0.8,               // prediction score in [0, 1]
            "probabilities": [0.2, 0.8] // optional, values in [0, 1], sum up to 1.0
        },
        {
            "category_id": 1,
            "bbox": [...],
            "score": 0.9,
            "probabilities": [0.9, 0.1]
        },
        {
            "category_id": 0,
            "bbox": [...],
            "score": 0.5,
            "probabilities": [0.6, 0.4]
        }
    ]
}

Example Keypoint Detection

{
    "file_name": "subdir/image_1.png",
    "predictions": [ // classes: [person, car]
        {
            "category_id": 0,           // category in [0, num categories - 1]
	          // keypoints in [x1, y1, s1, x2, y2, s3, ...] format
            // x, y are coordinates in pixels
            // s is a keypoint score in [0, 1]
            "keypoints": [100, 100, 0.95, 13, 29, 0.8, 30, 35, 0.5],
            "score": 0.8,               // prediction score in [0, 1]
            "bbox": [140, 100, 80, 90], // optional, x, y, w, h coordinates in pixels
                                        // x, y >= 0 and w, h >= 1
            "probabilities": [0.2, 0.8] // optional, values in [0, 1], sum up to 1.0
        },
        {
            "category_id": 1,
            "keypoints": [10, 20, 1, 30, 40, 1],
            "score": 0.9,
            "probabilities": [0.9, 0.1]
        },
    ]
}

Example Instance Segmentation

{
    "file_name": "subdir/image_1.png",
    "predictions": [ // classes: [person, car]
        {
            "category_id": 0,           // category in [0, num categories - 1]
            "segmentation": [100, 80, 90, 85, ...], //run length encoded binary segmentation mask
            "score": 0.8,               // prediction score in [0, 1]
            "bbox": [140, 100, 80, 90], // x, y, w, h coordinates in pixels
                                        // x, y >= 0 and w, h >= 1
            "probabilities": [0.8, 0.2] // optional, values in [0, 1], sum up to 1.0
        },
        {
            "category_id": 1,
            "segmentation": [...],
            "score": 0.9,
            "bbox": [...],
            "probabilities": [0.1, 0.9]
        },
    ]
}

Example Semantic Segmentation

{
    "file_name": "subdir/image_1.png",
    "predictions": [ // classes: [background, car, tree]
        {
            "category_id": 0,                  // category in [0, num categories - 1]
            "segmentation": [100, 80, 90, 85, ...], //run length encoded binary segmentation mask
            "score": 0.8,                      // prediction score in [0, 1]	
            "probabilities": [0.15, 0.8, 0.05] // optional, values in [0, 1], sum up to 1.0
        },
        {
            "category_id": 1,
            "segmentation": [...],
            "score": 0.9,
            "probabilities": [0.02, 0.08, 0.9]
        },
    ]
}

❗️

file_name should always be the relative path of the image to the root directory of your input datasource.

For example, if the input datasource has s3://bucket/input/ as the root directory and the image is saved at s3://bucket/input/subdir/image_1.png, then file_name should be subdir/image_1.png.

Prediction Singletons

The prediction singletons follow the COCO results format while dropping the image_id. Note that the category_id must be the same as the one defined in the schema and that the probabilities, if provided, must follow the order of the category ids.

Please use the following formats for each specific task type.

Classification

For classification, please use the following format:

[{
    "category_id"       : int,              // category in [0, num categories - 1]
    "probabilities"     : [p0, p1, ..., pN] // optional, sum up to 1.0
}]

Object Detection

For detection with bounding boxes, please use the following format:

[{
    "category_id"   : int,               // category in [0, num categories - 1]
    "bbox"          : [x, y, w, h],      // coordinates in pixels from the top left image corner
                                         // x, y >= 0 and w, h >= 1
    "score"         : float,             // prediction score in [0, 1]
    "probabilities" : [p0, p1, ..., pN]  // optional, values in [0, 1], sum up to 1.0
}]

The bounding box format follows the COCO results documentation.

🚧

Boundig box format

Following COCO, Lightly uses [x, y, width, height] lists as bounding box format. Remember to convert your bounding boxes if you use a different format such as [x1, y1, x2, y2]!

📘

Objectness scores and class probabilities

Some frameworks only provide the score as the model output. The score is typically calculated during the Non-Maximum Suppression (NMS) by multiplying the objectness probability with the highest class probability.

Having not only a single score, but a separate objectness score and class probabilities can be valuable information for active learning. For example, an object detection model could have a score of 0.6, and the predicted class is a tree. However, we cannot know the prediction margin or entropy without class probabilities. With the class probabilities, we would also know whether the model thought it’s 0.5 tree, 0.4 person, and 0.1 car or 0.5 tree, 0.25 person, and 0.25 car.
Providing the probabilities makes the computation of active learnings scores and the class distribution more precise.

The active learnings scores are usually computed out of the probabilities vector. In case it is not defined, the probabilities vector is approximated: The class defined by the category_id is set to have the same probability, as the score. The other classes are assumed to have an equal probability such that that all probabilities sum up to 1.0. An example: The category_id is 0, the score is 0.7 and there are 4 classes defined in the schema.json -> The probabilities are approximated as [0.7, 0.1, 0.1, 0.1]
In the case that only 1 class is defined in the schema.json, it is assumed just for computing the active learning scores that a second class exists.

The class distribution is usually set to the probabilities vector. If the probabilities are missing, it is set to be 1.0 for the class defined by the category_id and 0.0 for the other classes.

Keypoint Detection

For keypoint detection, please use the following format:

[{
    "category_id"   : int,               // category in [0, num categories - 1]
    // keypoints in [x1, y1, s1, x2, y2, s3, ...] format
    // x, y are coordinates in pixels
    // s is a keypoint score in [0, 1]
    "keypoints"     : [x0, y0, s0, x1, y1, s1, ...]
    "score"         : float,             // prediction score in [0, 1]
    "bbox"          : [x, y, w, h],      // optional, coordinates in pixels from the top left image corner
                                         // x, y >= 0 and w, h >= 1
    "probabilities" : [p0, p1, ..., pN]  // optional, values in [0, 1], sum up to 1.0
}]

The keypoint detection format follows the COCO results documentation. The x and y coordinates represent pixels from the top left corner of the image.

Each keypoint prediction contains the keypoints, an optional bounding box, and optional class probabilities. If the bounding box is omitted, Lightly will infer it from the keypoints directly by drawing a tight bounding box around all keypoints (including non-visible ones).

📘

Multi-class Keypoint Detections

Lightly supports multi-class keypoint detections with a variable number of keypoints per class. For example, a keypoint prediction could consist of a detection for the class "Person" with 13 keypoints and a detection for a class "Car" with 10 keypoints. Each of the detections is then represented by one keypoint detection singleton.

Semantic Segmentation

For semantic segmentation, please use the following format:

[{
    "category_id"       : int,              // category in [0, num categories - 1]
    "segmentation"      : [int, int, ...],  // run length encoded binary segmentation mask
    "score"             : float,            // prediction score in [0, 1]
    "probabilities"     : [p0, p1, ..., pN] // optional, values in [0, 1], sum up to 1.0
}]

Each segmentation prediction contains the binary mask for one category and a corresponding score. The score determines the likelihood of the segmentation belonging to that category. Optionally, a list of probabilities can be provided containing a probability for each category, indicating the likeliness that the segment belongs to that category.

To kickstart using Lightly with semantic segmentation predictions, we created an example script that takes model predictions and converts them to the correct format as follows. Here we provide examples for predictions in NumPy arrays and PyTorch Tensors.

Segmentations are defined with binary masks where each pixel is set to 0 or 1 if it belongs to the background or the object. The segmentation masks are compressed using run length encoding to reduce file size. Binary segmentation masks can be converted to the required format using the following function:

import numpy as np
from numpy.typing import NDArray
from typing import List


def encode(binary_mask: NDArray[np.int_]) -> List[int]:
    """Encodes a (H, W) binary segmentation mask with run length encoding.

    The run length encoding is an array with counts of subsequent 0s and 1s
    in the binary mask. The first value in the array is always the count of
    initial 0s.

    Examples:

        >>> binary_mask = np.array([
        >>>     [0, 0, 1, 1],
        >>>     [0, 1, 1, 1],
        >>>     [0, 0, 0, 1],
        >>> ])
        >>> encode(binary_mask)
        [2, 2, 1, 3, 3, 1]
    """
    assert np.all((binary_mask == 1) | (binary_mask == 0))
    flat = np.concatenate(([-1], np.ravel(binary_mask), [-1]))
    borders = np.nonzero(np.diff(flat))[0]
    rle = np.diff(borders)
    if flat[1]:
        rle = np.concatenate(([0], rle))
    return rle.tolist()
 
import numpy as np
from typing import List
import torch


def encode(binary_mask_tensor: torch.Tensor) -> List[int]:
    """Encodes a (H, W) binary segmentation mask with run length encoding.

    The run length encoding is an array with counts of subsequent 0s and 1s
    in the binary mask. The first value in the array is always the count of
    initial 0s.
    
    Note that the shape of the input mask must be (H, W). Other libraries might
    give masks in a different shape.

    Examples:

        >>> binary_mask = torch.tensor([
        >>>     [0, 0, 1, 1],
        >>>     [0, 1, 1, 1],
        >>>     [0, 0, 0, 1],
        >>> ], dtype=torch.int)
        >>> encode(binary_mask)
        [2, 2, 1, 3, 3, 1]
    """
    binary_mask = binary_mask_tensor.detach().cpu().numpy()
    assert np.all((binary_mask == 1) | (binary_mask == 0))
    flat = np.concatenate(([-1], np.ravel(binary_mask), [-1]))
    borders = np.nonzero(np.diff(flat))[0]
    rle = np.diff(borders)
    if flat[1]:
        rle = np.concatenate(([0], rle))
    return rle.tolist()
 

🚧

The shape of the input mask must be (H, W). Input masks acquired with some libraries, e.g., Tensorflow, might be in a different shape.

Segmentation models often output a probability for each pixel and category. Storing such probabilities can quickly result in large file sizes if the input images have a high resolution. Lightly expects only a single score or probability per segmentation to reduce storage requirements. If you have scores or probabilities for each pixel in the image, you must first aggregate them into a single score/probability. We recommend taking the median or mean score/probability over all pixels within the segmentation mask. The example below shows how pixel-wise segmentation predictions can be converted to the format required by Lightly.

import numpy as np
from numpy.typing import NDArray
from typing import List, Dict

PredictionType = Dict[str, Union[int, float, List[int]]]


def convert_to_lightly_predictions(model_predictions: NDArray[np.float_]) -> List[PredictionType]:
    """Converts model predictions to Lightly semantic segmentation predictions.

    Shape of `model_predictions`: (N, C, H, W)
        - N: number of images
        - C: category count
        - H: image height
        - W: image width

    Examples:
        >>> images = np.random.randn(3, 4, 5, 6)
        >>> convert_to_lightly_predictions(images)
        [{'category_id': 0, 'segmentation': [6, 1, 3, ..., 5], 'score': 0.95}, ...]

    Args:
        model_predictions:
            Predictions generated by a model for semantic segmentation.

    Returns:
        A list of Lightly semantic segmentation predictions.
    """
    lightly_predictions: List[PredictionType] = []

    for prediction in model_predictions:
        prediction_argmax = np.argmax(prediction, axis=0)
        for category_id in np.unique(prediction_argmax):
            binary_mask = prediction_argmax == category_id
            median_score = np.median(prediction[category_id, binary_mask])
            lightly_predictions.append(
                {
                    "category_id": int(category_id),
                    "segmentation": encode(binary_mask),
                    "score": float(median_score),
                }
            )

    return lightly_predictions
 
from typing import List, Dict, Union
import torch
import numpy as np

PredictionType = Dict[str, Union[int, float, List[int]]]


def convert_to_lightly_predictions(model_predictions: torch.Tensor) -> List[PredictionType]:
    """Converts model predictions to Lightly semantic segmentation predictions.

    Shape of `model_predictions`: (N, C, H, W)
        - N: number of images
        - C: category count
        - H: image height
        - W: image width

    Examples:
        >>> images = torch.randn(3, 4, 5, 6)
        >>> convert_to_lightly_predictions(images)
        [{'category_id': 0, 'segmentation': [6, 1, 3, ..., 5], 'score': 0.95}, ...]

    Args:
        model_predictions:
            Predictions generated by a model for semantic segmentation.

    Returns:
        A list of Lightly semantic segmentation predictions.
    """
    lightly_predictions: List[PredictionType] = []

    for prediction in model_predictions.detach().cpu().numpy():
        prediction_argmax = np.argmax(prediction, axis=0)
        for category_id in np.unique(prediction_argmax):
            binary_mask = prediction_argmax == category_id
            median_score = np.median(prediction[category_id, binary_mask])
            lightly_predictions.append(
                {
                    "category_id": int(category_id),
                    "segmentation": encode(binary_mask),
                    "score": float(median_score),
                }
            )

    return lightly_predictions

Instance Segmentation

For instance segmentation, please use the following format:

[{
    "category_id"   : int,               // category in [0, num categories - 1]
    "segmentation"  : [int, int, ...]    // run length encoded binary segmentation mask 
    "score"         : float,             // prediction score in [0, 1]
    "bbox"          : [x, y, w, h],      // coordinates in pixels from the top left image corner
                                         // x, y >= 0 and width, w, h >= 1
    "probabilities" : [p0, p1, ..., pN]  // optional, values in [0, 1], sum up to 1.0
}]

Each instance segmentation prediction contains the run length encoding (RLE) of the binary mask for the object instance, a bounding box that encloses the object instance, a score, and optional class probabilities. The score determines the likelihood of the segmentation belonging to that category. Optionally, a list of probabilities can be provided containing a probability for each category, indicating the likeliness that the instance belongs to that category.

The bounding box format follows the COCO results documentation, as for object detection, while the segmentation mask follows the format described in semantic segmentation. The segmentation mask must span the full input image, i.e., the size of the decoded segmentation mask must match the size of the input image.

Create Object Detection Prediction Files from COCO

For creating the predictions folder .lightly/predictions, we recommend writing a script that takes your predictions and saves them in the format just outlined. You can either save the predictions on your local machine and then upload them to your datasource or save them directly to your datasource.

For example, the following script takes an object detection COCO predictions file. It needs the path to the predictions file and the Lightly datasource where the .lightly folder should be created as input. Don't forget to change these two parameters at the top of the script.

import json
import os
from pathlib import Path


### CHANGE THESE PARAMETERS
output_filepath = "/path/to/create/.lightly/dir"
annotation_filepath = "/path/to/_annotations.coco.json"

### Optionally change these parameters
task_name = "my_object_detection_task"
task_type = "object-detection"

# create prediction directory
path_predictions = os.path.join(output_filepath, ".lightly/predictions")
Path(path_predictions).mkdir(exist_ok=True, parents=True)

# Create task.json
path_task_json = os.path.join(path_predictions, "tasks.json")
tasks = [task_name]
with open(path_task_json, "w") as f:
    json.dump(tasks, f)

# read coco annotations
with open(annotation_filepath, "r") as f:
    coco_dict = json.load(f)

# Create schema.json for task
path_predictions_task = os.path.join(path_predictions, tasks[0])
Path(path_predictions_task).mkdir(exist_ok=True)
schema = {"task_type": task_type, "categories": coco_dict["categories"]}
path_schema_json = os.path.join(path_predictions_task, "schema.json")
with open(path_schema_json, "w") as f:
    json.dump(schema, f)

# Create predictions themselves
image_id_to_prediction = dict()
for image in coco_dict["images"]:
    prediction = {
        "file_name": image["file_name"],
        "predictions": [],
    }
    image_id_to_prediction[image["id"]] = prediction
for ann in coco_dict["annotations"]:
    pred = {
        "category_id": ann["category_id"],
        "bbox": ann["bbox"],
        "score": ann.get("score", 0),
    }
    image_id_to_prediction[ann["image_id"]]["predictions"].append(pred)

for prediction in image_id_to_prediction.values():
    filename_prediction = os.path.splitext(prediction["file_name"])[0] + ".json"
    path_to_prediction = os.path.join(path_predictions_task, filename_prediction)
    with open(path_to_prediction, "w") as f:
        json.dump(prediction, f)

Create Prediction Files for Videos

Lightly expects one prediction file per frame in a video. Predictions can be created following the Python example code below. Make sure that PyAV is installed on your system for it to work correctly.

import av
import json
from pathlib import Path
from typing import List, Dict

dataset_dir = Path("/datasets/my_dataset")
predictions_dir = dataset_dir / ".lightly" / "predictions" / "my_prediction_task"


def model_predict(frame) -> List[Dict]:
    # This function must be overwritten to generate predictions for a frame using
    # a prediction model of your choice. Here we just return an example prediction.
    # See https://docs.lightly.ai/docker/advanced/datasource_predictions.html#prediction-format
    # for possible prediction formats.
    return [{"category_id": 0, "bbox": [0, 10, 100, 30], "score": 0.8}]


for video_path in dataset_dir.glob("**/*.mp4"):
    # get predictions for frames
    predictions = []
    with av.open(str(video_path)) as container:
        stream = container.streams.video[0]
        for frame in container.decode(stream):
            predictions.append(model_predict(frame.to_image()))

    # save predictions
    num_frames = len(predictions)
    zero_padding = len(str(num_frames))
    for frame_index, frame_predictions in enumerate(predictions):
        video_name = video_path.relative_to(dataset_dir).with_suffix("")
        frame_name = Path(
            f"{video_name}-{frame_index:0{zero_padding}}-{video_path.suffix[1:]}.png"
        )
        prediction = {
            "file_name": str(frame_name),
            "predictions": frame_predictions,
        }
        out_path = predictions_dir / frame_name.with_suffix(".json")
        out_path.parent.mkdir(parents=True, exist_ok=True)
        with open(out_path, "w") as file:
            json.dump(prediction, file)


# example directory structure before
# .
# ├── test
# │   └── video_0.mp4
# └── train
#     ├── video_1.mp4
#     └── video_2.mp4
#
# example directory structure after
# .
# ├── .lightly
# │   └── predictions
# │       └── my_prediction_task
# │           ├── test
# │           │   ├── video_0-000-mp4.json
# │           │   ├── video_0-001-mp4.json
# │           │   ├── video_0-002-mp4.json
# │           │   └── ...
# │           └── train
# │               ├── video_1-000-mp4.json
# │               ├── video_1-001-mp4.json
# │               ├── video_1-002-mp4.json
# |               ├── ...
# |               ├── video_2-000-mp4.json
# |               ├── video_2-001-mp4.json
# |               ├── video_2-002-mp4.json
# │               └── ...
# ├── test
# │   └── video_0.mp4
# └── train
#     ├── video_1.mp4
#     └── video_2.mp4

🚧

It is discouraged to use another library than PyAV for loading videos with Python as the order and number of loaded frames might differ.

Extract Frames with FFmpeg

Alternatively to creating predictions directly from video, frames can first be extracted as images with FFmpeg and then further processed by any prediction model supporting images. The example command below shows how to extract frames and save them with the filename expected by Lightly. Ensure that FFmpeg is installed on your system before running the command.

VIDEO=video.mp4; NUM_FRAMES=$(ffprobe -v error -select_streams v:0 -count_frames -show_entries stream=nb_read_frames -of csv=p=0 ${VIDEO}); ffmpeg -r 1 -i ${VIDEO} -start_number 0 ${VIDEO%.mp4}-%0${#NUM_FRAMES}d-mp4.png

# results in the following file structure
.
├── video.mp4
├── video-000-mp4.png
├── video-001-mp4.png
├── video-002-mp4.png
├── video-003-mp4.png
└── ...