Skip to content

Dataset

LightlyStudio Dataset.

Dataset

Dataset(dataset: DatasetTable)

A LightlyStudio Dataset.

It can be created or loaded using one of the static methods:

dataset = Dataset.create()
dataset = Dataset.load()
dataset = Dataset.load_or_create()

Samples can be added to the dataset using various methods:

dataset.add_samples_from_path(...)
dataset.add_samples_from_yolo(...)
dataset.add_samples_from_coco(...)
dataset.add_samples_from_coco_caption(...)
dataset.add_samples_from_labelformat(...)

The dataset samples can be queried directly by iterating over it or slicing it:

dataset = Dataset.load("my_dataset")
first_ten_samples = dataset[:10]
for sample in dataset:
    print(sample.file_name)
    sample.metadata["new_key"] = "new_value"

For filtering or ordering samples first, use the query interface:

from lightly_studio.core.dataset_query.sample_field import SampleField

dataset = Dataset.load("my_dataset")
query = dataset.match(SampleField.width > 10).order_by(SampleField.file_name)
for sample in query:
    ...

dataset_id property

dataset_id: UUID

Get the dataset ID.

name property

name: str

Get the dataset name.

__getitem__

__getitem__(key: _SliceType) -> DatasetQuery

Create a query on the dataset and enable bracket notation for slicing.

Parameters:

Name Type Description Default
key _SliceType

A slice object (e.g., [10:20], [:50], [100:]).

required

Returns:

Type Description
DatasetQuery

DatasetQuery with slice applied.

Raises:

Type Description
TypeError

If key is not a slice object.

ValueError

If slice contains unsupported features or conflicts with existing slice.

__iter__

__iter__() -> Iterator[Sample]

Iterate over samples in the dataset.

add_samples_from_coco

add_samples_from_coco(
    annotations_json: PathLike,
    images_path: PathLike,
    annotation_type: AnnotationType = AnnotationType.OBJECT_DETECTION,
    split: str | None = None,
    embed: bool = True,
) -> None

Load a dataset in COCO Object Detection format and store in DB.

Parameters:

Name Type Description Default
annotations_json PathLike

Path to the COCO annotations JSON file.

required
images_path PathLike

Path to the folder containing the images.

required
annotation_type AnnotationType

The type of annotation to be loaded (e.g., 'ObjectDetection', 'InstanceSegmentation').

OBJECT_DETECTION
split str | None

Optional split name to tag samples (e.g., 'train', 'val'). If provided, all samples will be tagged with this name.

None
embed bool

If True, generate embeddings for the newly added samples.

True

add_samples_from_coco_caption

add_samples_from_coco_caption(
    annotations_json: PathLike, images_path: PathLike, split: str | None = None, embed: bool = True
) -> None

Load a dataset in COCO caption format and store in DB.

Parameters:

Name Type Description Default
annotations_json PathLike

Path to the COCO caption JSON file.

required
images_path PathLike

Path to the folder containing the images.

required
split str | None

Optional split name to tag samples (e.g., 'train', 'val'). If provided, all samples will be tagged with this name.

None
embed bool

If True, generate embeddings for the newly added samples.

True

add_samples_from_labelformat

add_samples_from_labelformat(
    input_labels: ObjectDetectionInput | InstanceSegmentationInput,
    images_path: PathLike,
    embed: bool = True,
) -> None

Load a dataset from a labelformat object and store in database.

Parameters:

Name Type Description Default
input_labels ObjectDetectionInput | InstanceSegmentationInput

The labelformat input object.

required
images_path PathLike

Path to the folder containing the images.

required
embed bool

If True, generate embeddings for the newly added samples.

True

add_samples_from_path

add_samples_from_path(
    path: PathLike, allowed_extensions: Iterable[str] | None = None, embed: bool = True
) -> None

Adding samples from the specified path to the dataset.

Parameters:

Name Type Description Default
path PathLike

Path to the folder containing the images to add.

required
allowed_extensions Iterable[str] | None

An iterable container of allowed image file extensions.

None
embed bool

If True, generate embeddings for the newly added samples.

True

add_samples_from_yolo

add_samples_from_yolo(
    data_yaml: PathLike, input_split: str | None = None, embed: bool = True
) -> None

Load a dataset in YOLO format and store in DB.

Parameters:

Name Type Description Default
data_yaml PathLike

Path to the YOLO data.yaml file.

required
input_split str | None

The split to load (e.g., 'train', 'val', 'test'). If None, all available splits will be loaded and assigned a corresponding tag.

None
embed bool

If True, generate embeddings for the newly added samples.

True

compute_typicality_metadata

compute_typicality_metadata(
    embedding_model_name: str | None = None, metadata_name: str = "typicality"
) -> None

Computes typicality from embeddings, for K nearest neighbors.

Parameters:

Name Type Description Default
embedding_model_name str | None

The name of the embedding model to use. If not given, the default embedding model is used.

None
metadata_name str

The name of the metadata to store the typicality values in. If not give, the default name "typicality" is used.

'typicality'

create staticmethod

create(name: str | None = None) -> Dataset

Create a new dataset.

get_sample

get_sample(sample_id: UUID) -> Sample

Get a single sample from the dataset by its ID.

Parameters:

Name Type Description Default
sample_id UUID

The UUID of the sample to retrieve.

required

Returns:

Type Description
Sample

A single SampleTable object.

Raises:

Type Description
IndexError

If no sample is found with the given sample_id.

load staticmethod

load(name: str | None = None) -> Dataset

Load an existing dataset.

load_or_create staticmethod

load_or_create(name: str | None = None) -> Dataset

Create a new dataset or load an existing one.

match

match(match_expression: MatchExpression) -> DatasetQuery

Create a query on the dataset and store a field condition for filtering.

Parameters:

Name Type Description Default
match_expression MatchExpression

Defines the filter.

required

Returns:

Type Description
DatasetQuery

DatasetQuery for method chaining.

order_by

order_by(*order_by: OrderByExpression) -> DatasetQuery

Create a query on the dataset and store ordering expressions.

Parameters:

Name Type Description Default
order_by OrderByExpression

One or more ordering expressions. They are applied in order. E.g. first ordering by sample width and then by sample file_name will only order the samples with the same sample width by file_name.

()

Returns:

Type Description
DatasetQuery

DatasetQuery for method chaining.

query

query() -> DatasetQuery

Create a DatasetQuery for this dataset.

Returns:

Type Description
DatasetQuery

A DatasetQuery instance for querying samples in this dataset.

slice

slice(offset: int = 0, limit: int | None = None) -> DatasetQuery

Create a query on the dataset and apply offset and limit to results.

Parameters:

Name Type Description Default
offset int

Number of items to skip from beginning (default: 0).

0
limit int | None

Maximum number of items to return (None = no limit).

None

Returns:

Type Description
DatasetQuery

DatasetQuery for method chaining.