Dataset¶

LightlyStudio Dataset.

Dataset ¶

Dataset(dataset: DatasetTable)

Bases: Generic[T]

A LightlyStudio Dataset.

It can be created or loaded using one of the static methods:

dataset = Dataset.create()
dataset = Dataset.load()
dataset = Dataset.load_or_create()

Samples can be added to the dataset using various methods:

dataset.add_images_from_path(...)
dataset.add_samples_from_yolo(...)
dataset.add_samples_from_coco(...)
dataset.add_samples_from_coco_caption(...)
dataset.add_samples_from_labelformat(...)
dataset.add_videos_from_path(...)

The dataset samples can be queried directly by iterating over it or slicing it:

dataset = Dataset.load("my_dataset")
first_ten_samples = dataset[:10]
for sample in dataset:
    print(sample.file_name)
    sample.metadata["new_key"] = "new_value"

For filtering or ordering samples first, use the query interface:

from lightly_studio.core.dataset_query.sample_field import SampleField

dataset = Dataset.load("my_dataset")
query = dataset.match(SampleField.width > 10).order_by(SampleField.file_name)
for sample in query:
    ...

dataset_id `property` ¶

dataset_id: UUID

Get the dataset ID.

name `property` ¶

name: str

Get the dataset name.

getitem ¶

__getitem__(key: _SliceType) -> DatasetQuery

Create a query on the dataset and enable bracket notation for slicing.

Parameters:

Name	Type	Description	Default
`key`	`_SliceType`	A slice object (e.g., [10:20], [:50], [100:]).	required

Returns:

Type	Description
`DatasetQuery`	DatasetQuery with slice applied.

Raises:

Type	Description
`TypeError`	If key is not a slice object.
`ValueError`	If slice contains unsupported features or conflicts with existing slice.

iter ¶

__iter__() -> Iterator[ImageSample]

Iterate over samples in the dataset.

add_images_from_path ¶

add_images_from_path(
    path: PathLike,
    allowed_extensions: Iterable[str] | None = None,
    embed: bool = True,
    tag_depth: int = 0,
) -> None

Adding images from the specified path to the dataset.

Parameters:

Name	Type	Description	Default
`path`	`PathLike`	Path to the folder containing the images to add.	required
`allowed_extensions`	`Iterable[str] \| None`	An iterable container of allowed image file extensions.	`None`
`embed`	`bool`	If True, generate embeddings for the newly added images.	`True`
`tag_depth`	`int`	Defines the tagging behavior based on directory depth. - `tag_depth=0` (default): No automatic tagging is performed. - `tag_depth=1`: Automatically creates a tag for each image based on its parent directory's name.	`0`

Raises:

Type	Description
`NotImplementedError`	If tag_depth > 1.

add_samples_from_coco ¶

add_samples_from_coco(
    annotations_json: PathLike,
    images_path: PathLike,
    annotation_type: AnnotationType = AnnotationType.OBJECT_DETECTION,
    split: str | None = None,
    embed: bool = True,
) -> None

Load a dataset in COCO Object Detection format and store in DB.

Parameters:

Name	Type	Description	Default
`annotations_json`	`PathLike`	Path to the COCO annotations JSON file.	required
`images_path`	`PathLike`	Path to the folder containing the images.	required
`annotation_type`	`AnnotationType`	The type of annotation to be loaded (e.g., 'ObjectDetection', 'InstanceSegmentation').	`OBJECT_DETECTION`
`split`	`str \| None`	Optional split name to tag samples (e.g., 'train', 'val'). If provided, all samples will be tagged with this name.	`None`
`embed`	`bool`	If True, generate embeddings for the newly added samples.	`True`

add_samples_from_coco_caption ¶

add_samples_from_coco_caption(
    annotations_json: PathLike, images_path: PathLike, split: str | None = None, embed: bool = True
) -> None

Load a dataset in COCO caption format and store in DB.

Parameters:

Name	Type	Description	Default
`annotations_json`	`PathLike`	Path to the COCO caption JSON file.	required
`images_path`	`PathLike`	Path to the folder containing the images.	required
`split`	`str \| None`	Optional split name to tag samples (e.g., 'train', 'val'). If provided, all samples will be tagged with this name.	`None`
`embed`	`bool`	If True, generate embeddings for the newly added samples.	`True`

add_samples_from_labelformat ¶

add_samples_from_labelformat(
    input_labels: ObjectDetectionInput | InstanceSegmentationInput,
    images_path: PathLike,
    embed: bool = True,
) -> None

Load a dataset from a labelformat object and store in database.

Parameters:

Name	Type	Description	Default
`input_labels`	`ObjectDetectionInput \| InstanceSegmentationInput`	The labelformat input object.	required
`images_path`	`PathLike`	Path to the folder containing the images.	required
`embed`	`bool`	If True, generate embeddings for the newly added samples.	`True`

add_samples_from_yolo ¶

add_samples_from_yolo(
    data_yaml: PathLike, input_split: str | None = None, embed: bool = True
) -> None

Load a dataset in YOLO format and store in DB.

Parameters:

Name	Type	Description	Default
`data_yaml`	`PathLike`	Path to the YOLO data.yaml file.	required
`input_split`	`str \| None`	The split to load (e.g., 'train', 'val', 'test'). If None, all available splits will be loaded and assigned a corresponding tag.	`None`
`embed`	`bool`	If True, generate embeddings for the newly added samples.	`True`

add_videos_from_path ¶

add_videos_from_path(
    path: PathLike,
    allowed_extensions: Iterable[str] | None = None,
    num_decode_threads: int | None = None,
    embed: bool = True,
) -> None

Adding video frames from the specified path to the dataset.

Parameters:

Name	Type	Description	Default
`path`	`PathLike`	Path to the folder containing the videos to add.	required
`allowed_extensions`	`Iterable[str] \| None`	An iterable container of allowed video file extensions in lowercase, including the leading dot. If None,	`None`
`num_decode_threads`	`int \| None`	Optional override for the number of FFmpeg decode threads. If omitted, the available CPU cores - 1 (max 16) are used.	`None`
`embed`	`bool`	If True, generate embeddings for the newly added videos.	`True`

compute_similarity_metadata ¶

compute_similarity_metadata(
    query_tag_name: str, embedding_model_name: str | None = None, metadata_name: str | None = None
) -> str

Computes similarity with respect to a query tag.

Parameters:

Name	Type	Description	Default
`query_tag_name`	`str`	The name of the tag to use for the query.	required
`embedding_model_name`	`str \| None`	The name of the embedding model to use. If not given, the default embedding model is used.	`None`
`metadata_name`	`str \| None`	The name of the metadata to store the similarity values in. If not given, a name is generated automatically.	`None`

Returns:

Type	Description
`str`	The name of the metadata storing the similarity values.

compute_typicality_metadata ¶

compute_typicality_metadata(
    embedding_model_name: str | None = None, metadata_name: str = "typicality"
) -> None

Computes typicality from embeddings, for K nearest neighbors.

Parameters:

Name	Type	Description	Default
`embedding_model_name`	`str \| None`	The name of the embedding model to use. If not given, the default embedding model is used.	`None`
`metadata_name`	`str`	The name of the metadata to store the typicality values in. If not give, the default name "typicality" is used.	`'typicality'`

create `staticmethod` ¶

create(name: str | None = None, sample_type: SampleType = SampleType.IMAGE) -> Dataset

Create a new dataset.

Parameters:

Name	Type	Description	Default
`name`	`str \| None`	The name of the dataset. If None, a default name is used.	`None`
`sample_type`	`SampleType`	The type of samples in the dataset. Defaults to SampleType.IMAGE.	`IMAGE`

get_sample ¶

get_sample(sample_id: UUID) -> ImageSample

Get a single sample from the dataset by its ID.

Parameters:

Name	Type	Description	Default
`sample_id`	`UUID`	The UUID of the sample to retrieve.	required

Returns:

Type	Description
`ImageSample`	A single ImageTable object.

Raises:

Type	Description
`IndexError`	If no sample is found with the given sample_id.

load `staticmethod` ¶

load(name: str | None = None) -> Dataset

Load an existing dataset.

load_or_create `staticmethod` ¶

load_or_create(name: str | None = None, sample_type: SampleType = SampleType.IMAGE) -> Dataset

Create a new dataset or load an existing one.

Parameters:

Name	Type	Description	Default
`name`	`str \| None`	The name of the dataset. If None, a default name is used.	`None`
`sample_type`	`SampleType`	The type of samples in the dataset. Defaults to SampleType.IMAGE.	`IMAGE`

match ¶

match(match_expression: MatchExpression) -> DatasetQuery

Create a query on the dataset and store a field condition for filtering.

Parameters:

Name	Type	Description	Default
`match_expression`	`MatchExpression`	Defines the filter.	required

Returns:

Type	Description
`DatasetQuery`	DatasetQuery for method chaining.

order_by ¶

order_by(*order_by: OrderByExpression) -> DatasetQuery

Create a query on the dataset and store ordering expressions.

Parameters:

Name	Type	Description	Default
`order_by`	`OrderByExpression`	One or more ordering expressions. They are applied in order. E.g. first ordering by sample width and then by sample file_name will only order the samples with the same sample width by file_name.	`()`

Returns:

Type	Description
`DatasetQuery`	DatasetQuery for method chaining.

query ¶

query() -> DatasetQuery

Create a DatasetQuery for this dataset.

Returns:

Type	Description
`DatasetQuery`	A DatasetQuery instance for querying samples in this dataset.

slice ¶

slice(offset: int = 0, limit: int | None = None) -> DatasetQuery

Create a query on the dataset and apply offset and limit to results.

Parameters:

Name	Type	Description	Default
`offset`	`int`	Number of items to skip from beginning (default: 0).	`0`
`limit`	`int \| None`	Maximum number of items to return (None = no limit).	`None`

Returns:

Type	Description
`DatasetQuery`	DatasetQuery for method chaining.

Dataset¶

Dataset ¶

dataset_id property ¶

name property ¶

__getitem__ ¶

__iter__ ¶

add_images_from_path ¶

add_samples_from_coco ¶

add_samples_from_coco_caption ¶

add_samples_from_labelformat ¶

add_samples_from_yolo ¶

add_videos_from_path ¶

compute_similarity_metadata ¶

compute_typicality_metadata ¶

create staticmethod ¶

get_sample ¶

load staticmethod ¶

load_or_create staticmethod ¶

match ¶

order_by ¶

query ¶

slice ¶

dataset_id `property` ¶

name `property` ¶

getitem ¶

iter ¶

create `staticmethod` ¶

load `staticmethod` ¶

load_or_create `staticmethod` ¶