Dataset¶

LightlyStudio Dataset.

Dataset ¶

Dataset(collection: CollectionTable)

Bases: Generic[T], ABC

A LightlyStudio Dataset, a generic base for all dataset classes.

collection_id `property` ¶

collection_id: UUID

Get the collection ID.

dataset_id `property` ¶

dataset_id: UUID

Get the dataset ID.

name `property` ¶

name: str

Get the dataset name.

getitem ¶

__getitem__(key: _SliceType) -> DatasetQuery[T]

Create a query on the dataset and enable bracket notation for slicing.

Parameters:

Name	Type	Description	Default
`key`	`_SliceType`	A slice object (e.g., [10:20], [:50], [100:]).	required

Returns:

Type	Description
`DatasetQuery[T]`	DatasetQuery with slice applied.

Raises:

Type	Description
`TypeError`	If key is not a slice object.
`ValueError`	If slice contains unsupported features or conflicts with existing slice.

iter ¶

__iter__() -> Iterator[T]

Iterate over samples in the dataset.

compute_similarity_metadata ¶

compute_similarity_metadata(
    query_tag_name: str, embedding_model_name: str | None = None, metadata_name: str | None = None
) -> str

Computes similarity with respect to a query tag.

Parameters:

Name	Type	Description	Default
`query_tag_name`	`str`	The name of the tag to use for the query.	required
`embedding_model_name`	`str \| None`	The name of the embedding model to use. If not given, the default embedding model is used.	`None`
`metadata_name`	`str \| None`	The name of the metadata to store the similarity values in. If not given, a name is generated automatically.	`None`

Returns:

Type	Description
`str`	The name of the metadata storing the similarity values.

compute_typicality_metadata ¶

compute_typicality_metadata(
    embedding_model_name: str | None = None, metadata_name: str = "typicality"
) -> None

Computes typicality from embeddings, for K nearest neighbors.

Parameters:

Name	Type	Description	Default
`embedding_model_name`	`str \| None`	The name of the embedding model to use. If not given, the default embedding model is used.	`None`
`metadata_name`	`str`	The name of the metadata to store the typicality values in. If not give, the default name "typicality" is used.	`'typicality'`

get_sample `abstractmethod` ¶

get_sample(sample_id: UUID) -> T

Get a single sample from the dataset by its ID.

match ¶

match(match_expression: MatchExpression) -> DatasetQuery[T]

Create a query on the dataset and store a field condition for filtering.

Parameters:

Name	Type	Description	Default
`match_expression`	`MatchExpression`	Defines the filter.	required

Returns:

Type	Description
`DatasetQuery[T]`	DatasetQuery for method chaining.

order_by ¶

order_by(*order_by: OrderByExpression) -> DatasetQuery[T]

Create a query on the dataset and store ordering expressions.

Parameters:

Name	Type	Description	Default
`order_by`	`OrderByExpression`	One or more ordering expressions. They are applied in order. E.g. first ordering by sample width and then by sample file_name will only order the samples with the same sample width by file_name.	`()`

Returns:

Type	Description
`DatasetQuery[T]`	DatasetQuery for method chaining.

query ¶

query() -> DatasetQuery[T]

Create a DatasetQuery for this dataset.

Returns:

Type	Description
`DatasetQuery[T]`	A DatasetQuery instance for querying samples in this dataset.

sample_class `abstractmethod` `staticmethod` ¶

sample_class() -> type[T]

Returns the sample class type.

sample_type `abstractmethod` `staticmethod` ¶

sample_type() -> SampleType

Returns the sample type.

slice ¶

slice(offset: int = 0, limit: int | None = None) -> DatasetQuery[T]

Create a query on the dataset and apply offset and limit to results.

Parameters:

Name	Type	Description	Default
`offset`	`int`	Number of items to skip from beginning (default: 0).	`0`
`limit`	`int \| None`	Maximum number of items to return (None = no limit).	`None`

Returns:

Type	Description
`DatasetQuery[T]`	DatasetQuery for method chaining.

update_metadata ¶

update_metadata(sample_metadata: list[tuple[UUID, Mapping[str, Any]]]) -> None

Bulk update metadata for multiple samples in the dataset.

If a sample does not have metadata, a new metadata row is created. If a sample already has metadata, the new key-value pairs are merged with the existing metadata.

Note: we do not check for performance reasons if the sample IDs actually belong to this dataset.

Parameters:

Name	Type	Description	Default
`sample_metadata`	`list[tuple[UUID, Mapping[str, Any]]]`	List of `(sample ID, metadata_map)` tuples, where `metadata_map` is a mapping from string to any type, for example `{"weather": "cloudy", "temperature": 25}`.	required

Example

dataset.update_metadata([
    (UUID("..."), {"weather": "sunny"}),
    (UUID("..."), {"weather": "cloudy", "temperature": 25}),
])

ImageDataset¶

ImageDataset ¶

ImageDataset(collection: CollectionTable)

Bases: BaseSampleDataset[ImageSample]

Image dataset.

It can be created or loaded using one of the static methods:

dataset = ImageDataset.create()
dataset = ImageDataset.load()
dataset = ImageDataset.load_or_create()

Samples can be added to the dataset using various methods:

dataset.add_images_from_path(...)
dataset.add_samples_from_yolo(...)
dataset.add_samples_from_coco(...)
dataset.add_samples_from_coco_caption(...)
dataset.add_samples_from_labelformat(...)

The dataset samples can be queried directly by iterating over it or slicing it:

dataset = ImageDataset.load("my_dataset")
first_ten_samples = dataset[:10]
for sample in dataset:
    print(sample.file_name)
    sample.metadata["new_key"] = "new_value"

For filtering or ordering samples first, use the query interface:

from lightly_studio.core.dataset_query.image_sample_field import ImageSampleField

dataset = ImageDataset.load("my_dataset")
query = dataset.match(ImageSampleField.width > 10).order_by(ImageSampleField.file_name)
for sample in query:
    ...

collection_id `property` ¶

collection_id: UUID

Get the collection ID.

dataset_id `property` ¶

dataset_id: UUID

Get the dataset ID.

name `property` ¶

name: str

Get the dataset name.

getitem ¶

__getitem__(key: _SliceType) -> DatasetQuery[T]

Create a query on the dataset and enable bracket notation for slicing.

Parameters:

Name	Type	Description	Default
`key`	`_SliceType`	A slice object (e.g., [10:20], [:50], [100:]).	required

Returns:

Type	Description
`DatasetQuery[T]`	DatasetQuery with slice applied.

Raises:

Type	Description
`TypeError`	If key is not a slice object.
`ValueError`	If slice contains unsupported features or conflicts with existing slice.

iter ¶

__iter__() -> Iterator[T]

Iterate over samples in the dataset.

add_annotations_from_coco ¶

add_annotations_from_coco(
    annotations_json: PathLike,
    images_root: PathLike,
    annotation_source: str,
    annotation_type: AnnotationType = OBJECT_DETECTION,
    embed_annotations: bool = True,
) -> None

Attach COCO annotations to images already in the dataset.

Parameters:

Name	Type	Description	Default
`annotations_json`	`PathLike`	Path to the COCO annotations JSON file.	required
`images_root`	`PathLike`	Root path used for matching image filenames.	required
`annotation_source`	`str`	Name of the annotation source.	required
`annotation_type`	`AnnotationType`	`OBJECT_DETECTION` or `SEGMENTATION_MASK`.	`OBJECT_DETECTION`
`embed_annotations`	`bool`	If True, generate embeddings for the annotation crops.	`True`

add_annotations_from_labelformat ¶

add_annotations_from_labelformat(
    input_labels: ObjectDetectionInput | InstanceSegmentationInput,
    images_root: PathLike,
    annotation_source: str,
    embed_annotations: bool = True,
) -> None

Attach annotations from a labelformat input to images already in the dataset.

Images are matched by relative path under images_root. Annotations are grouped under an annotation source identified by annotation_source; reusing the same annotation_source appends to that source.

Parameters:

Name	Type	Description	Default
`input_labels`	`ObjectDetectionInput \| InstanceSegmentationInput`	Labelformat input object (e.g. `COCOObjectDetectionInput`).	required
`images_root`	`PathLike`	Root path used to construct absolute image paths for matching.	required
`annotation_source`	`str`	Name of the annotation source.	required
`embed_annotations`	`bool`	If True, generate embeddings for the annotation crops.	`True`

add_annotations_from_pascal_voc_segmentations ¶

add_annotations_from_pascal_voc_segmentations(
    masks_path: PathLike,
    images_root: PathLike,
    class_id_to_name: Mapping[int, str],
    annotation_source: str,
) -> None

Attach Pascal VOC semantic segmentation masks to images already in the dataset.

Parameters:

Name	Type	Description	Default
`masks_path`	`PathLike`	Path to the folder containing the segmentation masks.	required
`images_root`	`PathLike`	Root path used for matching image filenames.	required
`class_id_to_name`	`Mapping[int, str]`	Mapping from class IDs to class names.	required
`annotation_source`	`str`	Name of the annotation source.	required

add_annotations_from_yolo ¶

add_annotations_from_yolo(
    data_yaml: PathLike,
    annotation_source: str,
    input_split: str | None = None,
    embed_annotations: bool = True,
) -> None

Attach YOLO annotations to images already in the dataset.

Parameters:

Name	Type	Description	Default
`data_yaml`	`PathLike`	Path to the YOLO `data.yaml` file.	required
`annotation_source`	`str`	Name of the annotation source.	required
`input_split`	`str \| None`	Specific split (e.g. `"train"`). `None` loads all splits.	`None`
`embed_annotations`	`bool`	If True, generate embeddings for the annotation crops.	`True`

add_images_from_path ¶

add_images_from_path(
    path: PathLike,
    allowed_extensions: Iterable[str] | None = None,
    embed: bool = True,
    tag_depth: int = 0,
    limit: int | None = None,
) -> None

Adding images from the specified path to the dataset.

Parameters:

Name	Type	Description	Default
`path`	`PathLike`	Path to the folder containing the images to add.	required
`allowed_extensions`	`Iterable[str] \| None`	An iterable container of allowed image file extensions.	`None`
`embed`	`bool`	If True, generate embeddings for the newly added images.	`True`
`tag_depth`	`int`	Defines the tagging behavior based on directory depth. - `tag_depth=0` (default): No automatic tagging is performed. - `tag_depth=1`: Automatically creates a tag for each image based on its parent directory's name.	`0`
`limit`	`int \| None`	Maximum number of samples to load. By default, all samples are loaded.	`None`

Raises:

Type	Description
`NotImplementedError`	If tag_depth > 1.
`ValueError`	If limit is not None and not greater than 0.
`AllInputFilesFailedError`	If every image in the path is missing or broken.

add_samples_from_coco ¶

add_samples_from_coco(
    annotations_json: PathLike,
    images_path: PathLike,
    annotation_type: AnnotationType = OBJECT_DETECTION,
    split: str | None = None,
    embed: bool = True,
    annotation_source: str | None = None,
    embed_annotations: bool = True,
    limit: int | None = None,
) -> None

Load a dataset in COCO Object Detection format and store in DB.

Parameters:

Name	Type	Description	Default
`annotations_json`	`PathLike`	Path to the COCO annotations JSON file.	required
`images_path`	`PathLike`	Path to the folder containing the images.	required
`annotation_type`	`AnnotationType`	The type of annotation to be loaded (e.g., 'ObjectDetection', 'InstanceSegmentation').	`OBJECT_DETECTION`
`split`	`str \| None`	Optional split name to tag samples (e.g., 'train', 'val'). If provided, all samples will be tagged with this name.	`None`
`embed`	`bool`	If True, generate embeddings for the newly added samples.	`True`
`annotation_source`	`str \| None`	Name of the annotation source to add the annotations to. Reusing the same source name appends to that source. If `None`, a default source is used.	`None`
`embed_annotations`	`bool`	If True, generate embeddings for the annotation crops.	`True`
`limit`	`int \| None`	Maximum number of samples to load. By default, all samples are loaded.	`None`

Raises:

Type	Description
`ValueError`	If limit is not None and not greater than 0.

add_samples_from_coco_caption ¶

add_samples_from_coco_caption(
    annotations_json: PathLike,
    images_path: PathLike,
    split: str | None = None,
    embed: bool = True,
    limit: int | None = None,
) -> None

Load a dataset in COCO caption format and store in DB.

Parameters:

Name	Type	Description	Default
`annotations_json`	`PathLike`	Path to the COCO caption JSON file.	required
`images_path`	`PathLike`	Path to the folder containing the images.	required
`split`	`str \| None`	Optional split name to tag samples (e.g., 'train', 'val'). If provided, all samples will be tagged with this name.	`None`
`embed`	`bool`	If True, generate embeddings for the newly added samples.	`True`
`limit`	`int \| None`	Maximum number of samples to load. By default, all samples are loaded.	`None`

Raises:

Type	Description
`ValueError`	If limit is not None and not greater than 0.

add_samples_from_labelformat ¶

add_samples_from_labelformat(
    input_labels: ObjectDetectionInput | InstanceSegmentationInput,
    images_path: PathLike,
    split: str | None = None,
    embed: bool = True,
    annotation_source: str | None = None,
    embed_annotations: bool = True,
    limit: int | None = None,
) -> None

Load a dataset from a labelformat object and store in database.

Parameters:

Name	Type	Description	Default
`input_labels`	`ObjectDetectionInput \| InstanceSegmentationInput`	The labelformat input object.	required
`images_path`	`PathLike`	Path to the folder containing the images.	required
`split`	`str \| None`	Optional split name to tag samples (e.g., 'train', 'val'). If provided, all samples will be tagged with this name.	`None`
`embed`	`bool`	If True, generate embeddings for the newly added samples.	`True`
`annotation_source`	`str \| None`	Name of the annotation source to add the annotations to. Reusing the same source name appends to that source. If `None`, a default source is used.	`None`
`embed_annotations`	`bool`	If True, generate embeddings for the annotation crops.	`True`
`limit`	`int \| None`	Maximum number of samples to load. By default, all samples are loaded.	`None`

Raises:

Type	Description
`ValueError`	If limit is not None and not greater than 0.

add_samples_from_lightly ¶

add_samples_from_lightly(
    input_folder: PathLike,
    images_rel_path: str = "../images",
    split: str | None = None,
    embed: bool = True,
    annotation_source: str | None = None,
    embed_annotations: bool = True,
    limit: int | None = None,
) -> None

Load a dataset in Lightly format and store in DB.

Parameters:

Name	Type	Description	Default
`input_folder`	`PathLike`	Path to the folder containing the annotations/predictions.	required
`images_rel_path`	`str`	Relative path to images folder from label folder.	`'../images'`
`split`	`str \| None`	Optional split name to tag samples (e.g., 'train', 'val'). If provided, all samples will be tagged with this name.	`None`
`embed`	`bool`	If True, generate embeddings for the newly added samples.	`True`
`annotation_source`	`str \| None`	Name of the annotation source to add the annotations to. Reusing the same source name appends to that source. If `None`, a default source is used.	`None`
`embed_annotations`	`bool`	If True, generate embeddings for the annotation crops.	`True`
`limit`	`int \| None`	Maximum number of samples to load. By default, all samples are loaded.	`None`

Raises:

Type	Description
`ValueError`	If limit is not None and not greater than 0.

add_samples_from_pascal_voc_segmentations ¶

add_samples_from_pascal_voc_segmentations(
    images_path: PathLike,
    masks_path: PathLike,
    class_id_to_name: Mapping[int, str],
    split: str | None = None,
    embed: bool = True,
    annotation_source: str | None = None,
    limit: int | None = None,
) -> None

Load a Pascal VOC segmentation dataset and store in DB.

Pascal VOC masks encode class IDs per pixel (semantic segmentation). Imported masks are persisted as AnnotationType.SEGMENTATION_MASK. Query and export workflows should use segmentation mask type filters.

Parameters:

Name	Type	Description	Default
`images_path`	`PathLike`	Path to the folder containing the images.	required
`masks_path`	`PathLike`	Path to the folder containing the masks.	required
`class_id_to_name`	`Mapping[int, str]`	Mapping from class IDs to class names.	required
`split`	`str \| None`	Optional split name to tag samples (e.g., 'train', 'val'). If provided, all samples will be tagged with this name.	`None`
`embed`	`bool`	If True, generate embeddings for the newly added samples.	`True`
`annotation_source`	`str \| None`	Name of the annotation source to add the annotations to. Reusing the same source name appends to that source. If `None`, a default source is used.	`None`
`limit`	`int \| None`	Maximum number of samples to load. By default, all samples are loaded.	`None`

Raises:

Type	Description
`ValueError`	If limit is not None and not greater than 0.

add_samples_from_yolo ¶

add_samples_from_yolo(
    data_yaml: PathLike,
    input_split: str | None = None,
    embed: bool = True,
    annotation_source: str | None = None,
    embed_annotations: bool = True,
    limit: int | None = None,
) -> None

Load a dataset in YOLO format and store in DB.

Parameters:

Name	Type	Description	Default
`data_yaml`	`PathLike`	Path to the YOLO data.yaml file.	required
`input_split`	`str \| None`	The split to load (e.g., 'train', 'val', 'test'). If None, all available splits will be loaded and assigned a corresponding tag.	`None`
`embed`	`bool`	If True, generate embeddings for the newly added samples.	`True`
`annotation_source`	`str \| None`	Name of the annotation source to add the annotations to. Reusing the same source name appends to that source. If `None`, a default source is used.	`None`
`embed_annotations`	`bool`	If True, generate embeddings for the annotation crops.	`True`
`limit`	`int \| None`	Maximum number of samples to load, in total across all processed splits. By default, all samples are loaded.	`None`

Raises:

Type	Description
`ValueError`	If limit is not None and not greater than 0.

compute_similarity_metadata ¶

compute_similarity_metadata(
    query_tag_name: str, embedding_model_name: str | None = None, metadata_name: str | None = None
) -> str

Computes similarity with respect to a query tag.

Parameters:

Name	Type	Description	Default
`query_tag_name`	`str`	The name of the tag to use for the query.	required
`embedding_model_name`	`str \| None`	The name of the embedding model to use. If not given, the default embedding model is used.	`None`
`metadata_name`	`str \| None`	The name of the metadata to store the similarity values in. If not given, a name is generated automatically.	`None`

Returns:

Type	Description
`str`	The name of the metadata storing the similarity values.

compute_typicality_metadata ¶

compute_typicality_metadata(
    embedding_model_name: str | None = None, metadata_name: str = "typicality"
) -> None

Computes typicality from embeddings, for K nearest neighbors.

Parameters:

Name	Type	Description	Default
`embedding_model_name`	`str \| None`	The name of the embedding model to use. If not given, the default embedding model is used.	`None`
`metadata_name`	`str`	The name of the metadata to store the typicality values in. If not give, the default name "typicality" is used.	`'typicality'`

create `classmethod` ¶

create(name: str | None = None) -> Self

Create a new dataset.

Parameters:

Name	Type	Description	Default
`name`	`str \| None`	The name of the dataset. If None, a default name is used.	`None`

evaluate ¶

evaluate(query: DatasetQuery | None = None) -> ImageDatasetEvaluate

Return the evaluation facade for this dataset.

The returned object exposes task-specific evaluation methods, e.g. dataset.evaluate().object_detection(...).

Parameters:

Name	Type	Description	Default
`query`	`DatasetQuery \| None`	The dataset query to evaluate. If None, the default query `self.query()` is used.	`None`

export ¶

export(query: DatasetQuery | None = None) -> ImageDatasetExport

Return an ImageDatasetExport instance which can export the dataset in various formats.

Parameters:

Name	Type	Description	Default
`query`	`DatasetQuery \| None`	The dataset query to export. If None, the default query `self.query()` is used.	`None`

get_sample ¶

get_sample(sample_id: UUID) -> ImageSample

Get a single sample from the dataset by its ID.

Parameters:

Name	Type	Description	Default
`sample_id`	`UUID`	The UUID of the sample to retrieve.	required

Returns:

Type	Description
`ImageSample`	A single ImageSample object.

Raises:

Type	Description
`IndexError`	If no sample is found with the given sample_id.

load `classmethod` ¶

load(name: str | None = None) -> Self

Load an existing dataset.

load_or_create `classmethod` ¶

load_or_create(name: str | None = None) -> Self

Create a new image dataset or load an existing one.

Parameters:

Name	Type	Description	Default
`name`	`str \| None`	The name of the dataset. If None, a default name is used.	`None`

match ¶

match(match_expression: MatchExpression) -> DatasetQuery[T]

Create a query on the dataset and store a field condition for filtering.

Parameters:

Name	Type	Description	Default
`match_expression`	`MatchExpression`	Defines the filter.	required

Returns:

Type	Description
`DatasetQuery[T]`	DatasetQuery for method chaining.

order_by ¶

order_by(*order_by: OrderByExpression) -> DatasetQuery[T]

Create a query on the dataset and store ordering expressions.

Parameters:

Name	Type	Description	Default
`order_by`	`OrderByExpression`	One or more ordering expressions. They are applied in order. E.g. first ordering by sample width and then by sample file_name will only order the samples with the same sample width by file_name.	`()`

Returns:

Type	Description
`DatasetQuery[T]`	DatasetQuery for method chaining.

query ¶

query() -> DatasetQuery[T]

Create a DatasetQuery for this dataset.

Returns:

Type	Description
`DatasetQuery[T]`	A DatasetQuery instance for querying samples in this dataset.

sample_class `staticmethod` ¶

sample_class() -> type[ImageSample]

Returns the sample class.

sample_type `staticmethod` ¶

sample_type() -> SampleType

Returns the sample type.

slice ¶

slice(offset: int = 0, limit: int | None = None) -> DatasetQuery[T]

Create a query on the dataset and apply offset and limit to results.

Parameters:

Name	Type	Description	Default
`offset`	`int`	Number of items to skip from beginning (default: 0).	`0`
`limit`	`int \| None`	Maximum number of items to return (None = no limit).	`None`

Returns:

Type	Description
`DatasetQuery[T]`	DatasetQuery for method chaining.

update_metadata ¶

update_metadata(sample_metadata: list[tuple[UUID, Mapping[str, Any]]]) -> None

Bulk update metadata for multiple samples in the dataset.

If a sample does not have metadata, a new metadata row is created. If a sample already has metadata, the new key-value pairs are merged with the existing metadata.

Note: we do not check for performance reasons if the sample IDs actually belong to this dataset.

Parameters:

Name	Type	Description	Default
`sample_metadata`	`list[tuple[UUID, Mapping[str, Any]]]`	List of `(sample ID, metadata_map)` tuples, where `metadata_map` is a mapping from string to any type, for example `{"weather": "cloudy", "temperature": 25}`.	required

Example

dataset.update_metadata([
    (UUID("..."), {"weather": "sunny"}),
    (UUID("..."), {"weather": "cloudy", "temperature": 25}),
])

VideoDataset¶

VideoDataset ¶

VideoDataset(collection: CollectionTable)

Bases: BaseSampleDataset[VideoSample]

Video dataset.

It can be created or loaded using one of the static methods:

dataset = VideoDataset.create()
dataset = VideoDataset.load()
dataset = VideoDataset.load_or_create()

Samples can be added to the dataset using:

dataset.add_videos_from_path(...)

The dataset samples can be queried directly by iterating over it or slicing it:

dataset = VideoDataset.load("my_dataset")
first_ten_samples = dataset[:10]
for sample in dataset:
    print(sample.file_name)
    sample.metadata["new_key"] = "new_value"

For filtering or ordering samples first, use the query interface:

from lightly_studio.core.dataset_query.video_sample_field import VideoSampleField

dataset = VideoDataset.load("my_dataset")
query = dataset.match(VideoSampleField.width > 10).order_by(VideoSampleField.file_name)
for sample in query:
    ...

collection_id `property` ¶

collection_id: UUID

Get the collection ID.

dataset_id `property` ¶

dataset_id: UUID

Get the dataset ID.

name `property` ¶

name: str

Get the dataset name.

getitem ¶

__getitem__(key: _SliceType) -> DatasetQuery[T]

Create a query on the dataset and enable bracket notation for slicing.

Parameters:

Name	Type	Description	Default
`key`	`_SliceType`	A slice object (e.g., [10:20], [:50], [100:]).	required

Returns:

Type	Description
`DatasetQuery[T]`	DatasetQuery with slice applied.

Raises:

Type	Description
`TypeError`	If key is not a slice object.
`ValueError`	If slice contains unsupported features or conflicts with existing slice.

iter ¶

__iter__() -> Iterator[T]

Iterate over samples in the dataset.

add_videos_from_path ¶

add_videos_from_path(
    path: PathLike,
    allowed_extensions: Iterable[str] | None = None,
    num_decode_threads: int | None = None,
    embed: bool = True,
    target_fps: float | None = None,
    limit: int | None = None,
) -> None

Adding video frames from the specified path to the dataset.

Parameters:

Name	Type	Description	Default
`path`	`PathLike`	Path to the folder containing the videos to add.	required
`allowed_extensions`	`Iterable[str] \| None`	An iterable container of allowed video file extensions in lowercase, including the leading dot. If None, uses default VIDEO_EXTENSIONS.	`None`
`num_decode_threads`	`int \| None`	Optional override for the number of FFmpeg decode threads. If omitted, the available CPU cores - 1 (max 16) are used.	`None`
`embed`	`bool`	If True, generate embeddings for the newly added videos.	`True`
`target_fps`	`float \| None`	Optional target frame rate for subsampling. When set below the source frame rate, only selected frames are kept. frame_number values remain original. Must be greater than 0.	`None`
`limit`	`int \| None`	Maximum number of samples to load. By default, all samples are loaded.	`None`

add_videos_from_youtube_vis ¶

add_videos_from_youtube_vis(
    annotations_json: PathLike,
    videos_path: PathLike,
    allowed_extensions: Iterable[str] | None = None,
    annotation_type: AnnotationType = OBJECT_DETECTION,
    embed: bool = True,
    limit: int | None = None,
) -> None

Load videos and YouTube-VIS annotations and store them in the database.

Parameters:

Name	Type	Description	Default
`annotations_json`	`PathLike`	Path to the YouTube-VIS annotations JSON file.	required
`videos_path`	`PathLike`	Path to the folder containing the videos.	required
`allowed_extensions`	`Iterable[str] \| None`	An iterable container of allowed video file extensions in lowercase, including the leading dot. If None, uses default VIDEO_EXTENSIONS. Note: This is used when a path in YouTube-VIS does not contain the file extension.	`None`
`annotation_type`	`AnnotationType`	The type of annotation to be loaded (e.g., 'ObjectDetection', 'InstanceSegmentation').	`OBJECT_DETECTION`
`embed`	`bool`	If True, generate embeddings for the newly added videos.	`True`
`limit`	`int \| None`	Maximum number of samples to load. By default, all samples are loaded. Annotations of videos beyond the limit are skipped.	`None`

Raises:

Type	Description
`ValueError`	If limit is not None and not greater than 0.

compute_similarity_metadata ¶

compute_similarity_metadata(
    query_tag_name: str, embedding_model_name: str | None = None, metadata_name: str | None = None
) -> str

Computes similarity with respect to a query tag.

Parameters:

Name	Type	Description	Default
`query_tag_name`	`str`	The name of the tag to use for the query.	required
`embedding_model_name`	`str \| None`	The name of the embedding model to use. If not given, the default embedding model is used.	`None`
`metadata_name`	`str \| None`	The name of the metadata to store the similarity values in. If not given, a name is generated automatically.	`None`

Returns:

Type	Description
`str`	The name of the metadata storing the similarity values.

compute_typicality_metadata ¶

compute_typicality_metadata(
    embedding_model_name: str | None = None, metadata_name: str = "typicality"
) -> None

Computes typicality from embeddings, for K nearest neighbors.

Parameters:

Name	Type	Description	Default
`embedding_model_name`	`str \| None`	The name of the embedding model to use. If not given, the default embedding model is used.	`None`
`metadata_name`	`str`	The name of the metadata to store the typicality values in. If not give, the default name "typicality" is used.	`'typicality'`

create `classmethod` ¶

create(name: str | None = None) -> Self

Create a new dataset.

Parameters:

Name	Type	Description	Default
`name`	`str \| None`	The name of the dataset. If None, a default name is used.	`None`

export ¶

export(query: DatasetQuery[VideoSample] | None = None) -> VideoDatasetExport

Return an export interface for the (optionally filtered) video dataset.

frames ¶

frames() -> VideoFrameDataset

Return a dataset over the individual frames of this dataset's videos.

Returns:

Type	Description
`VideoFrameDataset`	A VideoFrameDataset exposing the video frames as queryable samples.

get_sample ¶

get_sample(sample_id: UUID) -> VideoSample

Get a single sample from the dataset by its ID.

Parameters:

Name	Type	Description	Default
`sample_id`	`UUID`	The UUID of the sample to retrieve.	required

Returns:

Type	Description
`VideoSample`	A single VideoSample object.

Raises:

Type	Description
`IndexError`	If no sample is found with the given sample_id.

load `classmethod` ¶

load(name: str | None = None) -> Self

Load an existing dataset.

load_or_create `classmethod` ¶

load_or_create(name: str | None = None) -> Self

Create a new image dataset or load an existing one.

Parameters:

Name	Type	Description	Default
`name`	`str \| None`	The name of the dataset. If None, a default name is used.	`None`

match ¶

match(match_expression: MatchExpression) -> DatasetQuery[T]

Create a query on the dataset and store a field condition for filtering.

Parameters:

Name	Type	Description	Default
`match_expression`	`MatchExpression`	Defines the filter.	required

Returns:

Type	Description
`DatasetQuery[T]`	DatasetQuery for method chaining.

order_by ¶

order_by(*order_by: OrderByExpression) -> DatasetQuery[T]

Create a query on the dataset and store ordering expressions.

Parameters:

Name	Type	Description	Default
`order_by`	`OrderByExpression`	One or more ordering expressions. They are applied in order. E.g. first ordering by sample width and then by sample file_name will only order the samples with the same sample width by file_name.	`()`

Returns:

Type	Description
`DatasetQuery[T]`	DatasetQuery for method chaining.

query ¶

query() -> DatasetQuery[T]

Create a DatasetQuery for this dataset.

Returns:

Type	Description
`DatasetQuery[T]`	A DatasetQuery instance for querying samples in this dataset.

sample_class `staticmethod` ¶

sample_class() -> type[VideoSample]

Returns the sample class.

sample_type `staticmethod` ¶

sample_type() -> SampleType

Returns the sample type.

slice ¶

slice(offset: int = 0, limit: int | None = None) -> DatasetQuery[T]

Create a query on the dataset and apply offset and limit to results.

Parameters:

Name	Type	Description	Default
`offset`	`int`	Number of items to skip from beginning (default: 0).	`0`
`limit`	`int \| None`	Maximum number of items to return (None = no limit).	`None`

Returns:

Type	Description
`DatasetQuery[T]`	DatasetQuery for method chaining.

update_metadata ¶

update_metadata(sample_metadata: list[tuple[UUID, Mapping[str, Any]]]) -> None

Bulk update metadata for multiple samples in the dataset.

If a sample does not have metadata, a new metadata row is created. If a sample already has metadata, the new key-value pairs are merged with the existing metadata.

Note: we do not check for performance reasons if the sample IDs actually belong to this dataset.

Parameters:

Name	Type	Description	Default
`sample_metadata`	`list[tuple[UUID, Mapping[str, Any]]]`	List of `(sample ID, metadata_map)` tuples, where `metadata_map` is a mapping from string to any type, for example `{"weather": "cloudy", "temperature": 25}`.	required

Example

dataset.update_metadata([
    (UUID("..."), {"weather": "sunny"}),
    (UUID("..."), {"weather": "cloudy", "temperature": 25}),
])

ImageDatasetExport¶

Exports datasets from Lightly Studio into various formats.

ImageDatasetExport ¶

ImageDatasetExport(session: Session, dataset_id: UUID, samples: Iterable[ImageSample])

Provides methods to export a dataset or a subset of it.

This class is typically not instantiated directly but returned by Dataset.export(). It allows exporting data in various formats.

Parameters:

Name	Type	Description	Default
`session`	`Session`	The database session.	required
`dataset_id`	`UUID`	The dataset ID for label retrieval.	required
`samples`	`Iterable[ImageSample]`	Samples to export.	required

to_coco_captions ¶

to_coco_captions(output_json: PathLike | None = None) -> None

Exports captions to a COCO format JSON file.

Parameters:

Name	Type	Description	Default
`output_json`	`PathLike \| None`	The path to the output COCO JSON file. If not provided, defaults to "coco_export.json" in the current working directory.	`None`

to_coco_object_detections ¶

to_coco_object_detections(
    output_json: PathLike | None = None, annotation_collection_id: UUID | None = None
) -> None

Exports object detection annotations to a COCO format JSON file.

Parameters:

Name	Type	Description	Default
`output_json`	`PathLike \| None`	The path to the output COCO JSON file. If not provided, defaults to "coco_export.json" in the current working directory.	`None`
`annotation_collection_id`	`UUID \| None`	If provided, only annotations from this collection are exported. If None, all annotations are exported.	`None`

Raises:

Type	Description
`ValueError`	If the annotation source with the given name does not exist.

to_coco_segmentation_masks ¶

to_coco_segmentation_masks(
    output_json: PathLike | None = None, annotation_collection_id: UUID | None = None
) -> None

Exports segmentation masks to a COCO format JSON file.

Parameters:

Name	Type	Description	Default
`output_json`	`PathLike \| None`	The path to the output COCO JSON file. If not provided, defaults to "coco_export.json" in the current working directory.	`None`
`annotation_collection_id`	`UUID \| None`	If provided, only annotations from this collection are exported. If None, all annotations are exported.	`None`

to_pascalvoc_segmentation_mask ¶

to_pascalvoc_segmentation_mask(
    output_folder: PathLike, annotation_collection_id: UUID | None = None
) -> None

Exports segmentation mask annotations to Pascal VOC format.

Creates a folder with per-pixel class masks (PNG) and a class map (JSON).

Parameters:

Name	Type	Description	Default
`output_folder`	`PathLike`	The folder where Pascal VOC segmentation files are written. The folder contains a `SegmentationClass` subfolder with PNG masks and a `class_id_to_name.json` file.	required
`annotation_collection_id`	`UUID \| None`	If provided, only annotations from this collection are exported. If None, all annotations are exported.	`None`

to_yolo_object_detections ¶

to_yolo_object_detections(
    output_folder: PathLike, annotation_collection_id: UUID | None = None
) -> None

Exports object detection annotations to YOLO (Ultralytics YOLOv8) format.

Creates a folder with a data.yaml dataset config and a labels subfolder containing one .txt file per image with normalized <class_id> <x_center> <y_center> <width> <height> rows.

Parameters:

Name	Type	Description	Default
`output_folder`	`PathLike`	The folder where YOLO files are written.	required
`annotation_collection_id`	`UUID \| None`	If provided, only annotations from this collection are exported. If None, all annotations are exported.	`None`

VideoDatasetExport¶

Exports video datasets from Lightly Studio into various formats.

VideoDatasetExport ¶

VideoDatasetExport(session: Session, samples: Iterable[VideoSample])

Provides methods to export a video dataset or a subset of it.

to_youtube_vis_segmentation_mask ¶

to_youtube_vis_segmentation_mask(output_json: PathLike = DEFAULT_EXPORT_FILENAME) -> None

Export video segmentation mask tracks to YouTube-VIS format JSON file.

Parameters:

Name	Type	Description	Default
`output_json`	`PathLike`	Optional path to the output JSON file. If not provided, defaults to "youtube_vis_export.json".	`DEFAULT_EXPORT_FILENAME`

Dataset¶

Dataset¶

Dataset ¶

collection_id property ¶

dataset_id property ¶

name property ¶

__getitem__ ¶

__iter__ ¶

compute_similarity_metadata ¶

compute_typicality_metadata ¶

get_sample abstractmethod ¶

match ¶

order_by ¶

query ¶

sample_class abstractmethod staticmethod ¶

sample_type abstractmethod staticmethod ¶

slice ¶

update_metadata ¶

ImageDataset¶

ImageDataset ¶

collection_id property ¶

dataset_id property ¶

name property ¶

__getitem__ ¶

__iter__ ¶

add_annotations_from_coco ¶

add_annotations_from_labelformat ¶

add_annotations_from_pascal_voc_segmentations ¶

add_annotations_from_yolo ¶

add_images_from_path ¶

add_samples_from_coco ¶

add_samples_from_coco_caption ¶

add_samples_from_labelformat ¶

add_samples_from_lightly ¶

add_samples_from_pascal_voc_segmentations ¶

add_samples_from_yolo ¶

compute_similarity_metadata ¶

compute_typicality_metadata ¶

create classmethod ¶

evaluate ¶

export ¶

get_sample ¶

load classmethod ¶

load_or_create classmethod ¶

match ¶

order_by ¶

query ¶

sample_class staticmethod ¶

sample_type staticmethod ¶

slice ¶

update_metadata ¶

VideoDataset¶

VideoDataset ¶

collection_id property ¶

dataset_id property ¶

name property ¶

__getitem__ ¶

__iter__ ¶

add_videos_from_path ¶

add_videos_from_youtube_vis ¶

compute_similarity_metadata ¶

compute_typicality_metadata ¶

create classmethod ¶

export ¶

frames ¶

get_sample ¶

load classmethod ¶

load_or_create classmethod ¶

match ¶

order_by ¶

query ¶

sample_class staticmethod ¶

sample_type staticmethod ¶

slice ¶

update_metadata ¶

ImageDatasetExport¶

ImageDatasetExport ¶

to_coco_captions ¶

to_coco_object_detections ¶

to_coco_segmentation_masks ¶

collection_id `property` ¶

dataset_id `property` ¶

name `property` ¶

getitem ¶

iter ¶

get_sample `abstractmethod` ¶

sample_class `abstractmethod` `staticmethod` ¶

sample_type `abstractmethod` `staticmethod` ¶

collection_id `property` ¶

dataset_id `property` ¶

name `property` ¶

getitem ¶

iter ¶

create `classmethod` ¶

load `classmethod` ¶

load_or_create `classmethod` ¶

sample_class `staticmethod` ¶

sample_type `staticmethod` ¶

collection_id `property` ¶

dataset_id `property` ¶

name `property` ¶

getitem ¶

iter ¶

create `classmethod` ¶

load `classmethod` ¶

load_or_create `classmethod` ¶

sample_class `staticmethod` ¶

sample_type `staticmethod` ¶