Skip to content

Dataset

ImageDataset

ImageDataset

ImageDataset(collection: CollectionTable)

Bases: Dataset[ImageSample]

Image dataset.

It can be created or loaded using one of the static methods:

dataset = ImageDataset.create()
dataset = ImageDataset.load()
dataset = ImageDataset.load_or_create()

Samples can be added to the dataset using various methods:

dataset.add_images_from_path(...)
dataset.add_samples_from_yolo(...)
dataset.add_samples_from_coco(...)
dataset.add_samples_from_coco_caption(...)
dataset.add_samples_from_labelformat(...)

The dataset samples can be queried directly by iterating over it or slicing it:

dataset = ImageDataset.load("my_dataset")
first_ten_samples = dataset[:10]
for sample in dataset:
    print(sample.file_name)
    sample.metadata["new_key"] = "new_value"

For filtering or ordering samples first, use the query interface:

from lightly_studio.core.dataset_query.image_sample_field import ImageSampleField

dataset = ImageDataset.load("my_dataset")
query = dataset.match(ImageSampleField.width > 10).order_by(ImageSampleField.file_name)
for sample in query:
    ...

dataset_id property

dataset_id: UUID

Get the dataset ID.

name property

name: str

Get the dataset name.

__getitem__

__getitem__(key: _SliceType) -> DatasetQuery[T]

Create a query on the dataset and enable bracket notation for slicing.

Parameters:

Name Type Description Default
key _SliceType

A slice object (e.g., [10:20], [:50], [100:]).

required

Returns:

Type Description
DatasetQuery[T]

DatasetQuery with slice applied.

Raises:

Type Description
TypeError

If key is not a slice object.

ValueError

If slice contains unsupported features or conflicts with existing slice.

__iter__

__iter__() -> Iterator[T]

Iterate over samples in the dataset.

add_images_from_path

add_images_from_path(
    path: PathLike,
    allowed_extensions: Iterable[str] | None = None,
    embed: bool = True,
    tag_depth: int = 0,
) -> None

Adding images from the specified path to the dataset.

Parameters:

Name Type Description Default
path PathLike

Path to the folder containing the images to add.

required
allowed_extensions Iterable[str] | None

An iterable container of allowed image file extensions.

None
embed bool

If True, generate embeddings for the newly added images.

True
tag_depth int

Defines the tagging behavior based on directory depth. - tag_depth=0 (default): No automatic tagging is performed. - tag_depth=1: Automatically creates a tag for each image based on its parent directory's name.

0

Raises:

Type Description
NotImplementedError

If tag_depth > 1.

add_samples_from_coco

add_samples_from_coco(
    annotations_json: PathLike,
    images_path: PathLike,
    annotation_type: AnnotationType = AnnotationType.OBJECT_DETECTION,
    split: str | None = None,
    embed: bool = True,
) -> None

Load a dataset in COCO Object Detection format and store in DB.

Parameters:

Name Type Description Default
annotations_json PathLike

Path to the COCO annotations JSON file.

required
images_path PathLike

Path to the folder containing the images.

required
annotation_type AnnotationType

The type of annotation to be loaded (e.g., 'ObjectDetection', 'InstanceSegmentation').

OBJECT_DETECTION
split str | None

Optional split name to tag samples (e.g., 'train', 'val'). If provided, all samples will be tagged with this name.

None
embed bool

If True, generate embeddings for the newly added samples.

True

add_samples_from_coco_caption

add_samples_from_coco_caption(
    annotations_json: PathLike, images_path: PathLike, split: str | None = None, embed: bool = True
) -> None

Load a dataset in COCO caption format and store in DB.

Parameters:

Name Type Description Default
annotations_json PathLike

Path to the COCO caption JSON file.

required
images_path PathLike

Path to the folder containing the images.

required
split str | None

Optional split name to tag samples (e.g., 'train', 'val'). If provided, all samples will be tagged with this name.

None
embed bool

If True, generate embeddings for the newly added samples.

True

add_samples_from_labelformat

add_samples_from_labelformat(
    input_labels: ObjectDetectionInput | InstanceSegmentationInput,
    images_path: PathLike,
    embed: bool = True,
) -> None

Load a dataset from a labelformat object and store in database.

Parameters:

Name Type Description Default
input_labels ObjectDetectionInput | InstanceSegmentationInput

The labelformat input object.

required
images_path PathLike

Path to the folder containing the images.

required
embed bool

If True, generate embeddings for the newly added samples.

True

add_samples_from_lightly

add_samples_from_lightly(
    input_folder: PathLike, images_rel_path: str = "../images", embed: bool = True
) -> None

Load a dataset in Lightly format and store in DB.

Parameters:

Name Type Description Default
input_folder PathLike

Path to the folder containing the annotations/predictions.

required
images_rel_path str

Relative path to images folder from label folder.

'../images'
embed bool

If True, generate embeddings for the newly added samples.

True

add_samples_from_yolo

add_samples_from_yolo(
    data_yaml: PathLike, input_split: str | None = None, embed: bool = True
) -> None

Load a dataset in YOLO format and store in DB.

Parameters:

Name Type Description Default
data_yaml PathLike

Path to the YOLO data.yaml file.

required
input_split str | None

The split to load (e.g., 'train', 'val', 'test'). If None, all available splits will be loaded and assigned a corresponding tag.

None
embed bool

If True, generate embeddings for the newly added samples.

True

compute_similarity_metadata

compute_similarity_metadata(
    query_tag_name: str, embedding_model_name: str | None = None, metadata_name: str | None = None
) -> str

Computes similarity with respect to a query tag.

Parameters:

Name Type Description Default
query_tag_name str

The name of the tag to use for the query.

required
embedding_model_name str | None

The name of the embedding model to use. If not given, the default embedding model is used.

None
metadata_name str | None

The name of the metadata to store the similarity values in. If not given, a name is generated automatically.

None

Returns:

Type Description
str

The name of the metadata storing the similarity values.

compute_typicality_metadata

compute_typicality_metadata(
    embedding_model_name: str | None = None, metadata_name: str = "typicality"
) -> None

Computes typicality from embeddings, for K nearest neighbors.

Parameters:

Name Type Description Default
embedding_model_name str | None

The name of the embedding model to use. If not given, the default embedding model is used.

None
metadata_name str

The name of the metadata to store the typicality values in. If not give, the default name "typicality" is used.

'typicality'

create classmethod

create(name: str | None = None) -> Self

Create a new dataset.

Parameters:

Name Type Description Default
name str | None

The name of the dataset. If None, a default name is used.

None

export

export(query: DatasetQuery | None = None) -> DatasetExport

Return a DatasetExport instance which can export the dataset in various formats.

Parameters:

Name Type Description Default
query DatasetQuery | None

The dataset query to export. If None, the default query self.query() is used.

None

get_sample

get_sample(sample_id: UUID) -> ImageSample

Get a single sample from the dataset by its ID.

Parameters:

Name Type Description Default
sample_id UUID

The UUID of the sample to retrieve.

required

Returns:

Type Description
ImageSample

A single ImageSample object.

Raises:

Type Description
IndexError

If no sample is found with the given sample_id.

load classmethod

load(name: str | None = None) -> Self

Load an existing dataset.

load_or_create classmethod

load_or_create(name: str | None = None) -> Self

Create a new image dataset or load an existing one.

Parameters:

Name Type Description Default
name str | None

The name of the dataset. If None, a default name is used.

None

match

match(match_expression: MatchExpression) -> DatasetQuery[T]

Create a query on the dataset and store a field condition for filtering.

Parameters:

Name Type Description Default
match_expression MatchExpression

Defines the filter.

required

Returns:

Type Description
DatasetQuery[T]

DatasetQuery for method chaining.

order_by

order_by(*order_by: OrderByExpression) -> DatasetQuery[T]

Create a query on the dataset and store ordering expressions.

Parameters:

Name Type Description Default
order_by OrderByExpression

One or more ordering expressions. They are applied in order. E.g. first ordering by sample width and then by sample file_name will only order the samples with the same sample width by file_name.

()

Returns:

Type Description
DatasetQuery[T]

DatasetQuery for method chaining.

query

query() -> DatasetQuery[T]

Create a DatasetQuery for this dataset.

Returns:

Type Description
DatasetQuery[T]

A DatasetQuery instance for querying samples in this dataset.

sample_class staticmethod

sample_class() -> type[ImageSample]

Returns the sample class.

sample_type staticmethod

sample_type() -> SampleType

Returns the sample type.

slice

slice(offset: int = 0, limit: int | None = None) -> DatasetQuery[T]

Create a query on the dataset and apply offset and limit to results.

Parameters:

Name Type Description Default
offset int

Number of items to skip from beginning (default: 0).

0
limit int | None

Maximum number of items to return (None = no limit).

None

Returns:

Type Description
DatasetQuery[T]

DatasetQuery for method chaining.

VideoDataset

VideoDataset

VideoDataset(collection: CollectionTable)

Bases: Dataset[VideoSample]

Video dataset.

It can be created or loaded using one of the static methods:

dataset = VideoDataset.create()
dataset = VideoDataset.load()
dataset = VideoDataset.load_or_create()

Samples can be added to the dataset using:

dataset.add_videos_from_path(...)

The dataset samples can be queried directly by iterating over it or slicing it:

dataset = VideoDataset.load("my_dataset")
first_ten_samples = dataset[:10]
for sample in dataset:
    print(sample.file_name)
    sample.metadata["new_key"] = "new_value"

For filtering or ordering samples first, use the query interface:

from lightly_studio.core.dataset_query.video_sample_field import VideoSampleField

dataset = VideoDataset.load("my_dataset")
query = dataset.match(VideoSampleField.width > 10).order_by(VideoSampleField.file_name)
for sample in query:
    ...

dataset_id property

dataset_id: UUID

Get the dataset ID.

name property

name: str

Get the dataset name.

__getitem__

__getitem__(key: _SliceType) -> DatasetQuery[T]

Create a query on the dataset and enable bracket notation for slicing.

Parameters:

Name Type Description Default
key _SliceType

A slice object (e.g., [10:20], [:50], [100:]).

required

Returns:

Type Description
DatasetQuery[T]

DatasetQuery with slice applied.

Raises:

Type Description
TypeError

If key is not a slice object.

ValueError

If slice contains unsupported features or conflicts with existing slice.

__iter__

__iter__() -> Iterator[T]

Iterate over samples in the dataset.

add_videos_from_path

add_videos_from_path(
    path: PathLike,
    allowed_extensions: Iterable[str] | None = None,
    num_decode_threads: int | None = None,
    embed: bool = True,
) -> None

Adding video frames from the specified path to the dataset.

Parameters:

Name Type Description Default
path PathLike

Path to the folder containing the videos to add.

required
allowed_extensions Iterable[str] | None

An iterable container of allowed video file extensions in lowercase, including the leading dot. If None, uses default VIDEO_EXTENSIONS.

None
num_decode_threads int | None

Optional override for the number of FFmpeg decode threads. If omitted, the available CPU cores - 1 (max 16) are used.

None
embed bool

If True, generate embeddings for the newly added videos.

True

compute_similarity_metadata

compute_similarity_metadata(
    query_tag_name: str, embedding_model_name: str | None = None, metadata_name: str | None = None
) -> str

Computes similarity with respect to a query tag.

Parameters:

Name Type Description Default
query_tag_name str

The name of the tag to use for the query.

required
embedding_model_name str | None

The name of the embedding model to use. If not given, the default embedding model is used.

None
metadata_name str | None

The name of the metadata to store the similarity values in. If not given, a name is generated automatically.

None

Returns:

Type Description
str

The name of the metadata storing the similarity values.

compute_typicality_metadata

compute_typicality_metadata(
    embedding_model_name: str | None = None, metadata_name: str = "typicality"
) -> None

Computes typicality from embeddings, for K nearest neighbors.

Parameters:

Name Type Description Default
embedding_model_name str | None

The name of the embedding model to use. If not given, the default embedding model is used.

None
metadata_name str

The name of the metadata to store the typicality values in. If not give, the default name "typicality" is used.

'typicality'

create classmethod

create(name: str | None = None) -> Self

Create a new dataset.

Parameters:

Name Type Description Default
name str | None

The name of the dataset. If None, a default name is used.

None

get_sample

get_sample(sample_id: UUID) -> VideoSample

Get a single sample from the dataset by its ID.

Parameters:

Name Type Description Default
sample_id UUID

The UUID of the sample to retrieve.

required

Returns:

Type Description
VideoSample

A single VideoSample object.

Raises:

Type Description
IndexError

If no sample is found with the given sample_id.

load classmethod

load(name: str | None = None) -> Self

Load an existing dataset.

load_or_create classmethod

load_or_create(name: str | None = None) -> Self

Create a new image dataset or load an existing one.

Parameters:

Name Type Description Default
name str | None

The name of the dataset. If None, a default name is used.

None

match

match(match_expression: MatchExpression) -> DatasetQuery[T]

Create a query on the dataset and store a field condition for filtering.

Parameters:

Name Type Description Default
match_expression MatchExpression

Defines the filter.

required

Returns:

Type Description
DatasetQuery[T]

DatasetQuery for method chaining.

order_by

order_by(*order_by: OrderByExpression) -> DatasetQuery[T]

Create a query on the dataset and store ordering expressions.

Parameters:

Name Type Description Default
order_by OrderByExpression

One or more ordering expressions. They are applied in order. E.g. first ordering by sample width and then by sample file_name will only order the samples with the same sample width by file_name.

()

Returns:

Type Description
DatasetQuery[T]

DatasetQuery for method chaining.

query

query() -> DatasetQuery[T]

Create a DatasetQuery for this dataset.

Returns:

Type Description
DatasetQuery[T]

A DatasetQuery instance for querying samples in this dataset.

sample_class staticmethod

sample_class() -> type[VideoSample]

Returns the sample class.

sample_type staticmethod

sample_type() -> SampleType

Returns the sample type.

slice

slice(offset: int = 0, limit: int | None = None) -> DatasetQuery[T]

Create a query on the dataset and apply offset and limit to results.

Parameters:

Name Type Description Default
offset int

Number of items to skip from beginning (default: 0).

0
limit int | None

Maximum number of items to return (None = no limit).

None

Returns:

Type Description
DatasetQuery[T]

DatasetQuery for method chaining.