Dataset¶
LightlyStudio Dataset.
Dataset ¶
Dataset(dataset: DatasetTable)
A LightlyStudio Dataset.
It can be created or loaded using one of the static methods:
dataset = Dataset.create()
dataset = Dataset.load()
dataset = Dataset.load_or_create()
Samples can be added to the dataset using various methods:
dataset.add_samples_from_path(...)
dataset.add_samples_from_yolo(...)
dataset.add_samples_from_coco(...)
dataset.add_samples_from_coco_caption(...)
dataset.add_samples_from_labelformat(...)
The dataset samples can be queried directly by iterating over it or slicing it:
dataset = Dataset.load("my_dataset")
first_ten_samples = dataset[:10]
for sample in dataset:
print(sample.file_name)
sample.metadata["new_key"] = "new_value"
For filtering or ordering samples first, use the query interface:
from lightly_studio.core.dataset_query.sample_field import SampleField
dataset = Dataset.load("my_dataset")
query = dataset.match(SampleField.width > 10).order_by(SampleField.file_name)
for sample in query:
...
__getitem__ ¶
__getitem__(key: _SliceType) -> DatasetQuery
Create a query on the dataset and enable bracket notation for slicing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key |
_SliceType
|
A slice object (e.g., [10:20], [:50], [100:]). |
required |
Returns:
Type | Description |
---|---|
DatasetQuery
|
DatasetQuery with slice applied. |
Raises:
Type | Description |
---|---|
TypeError
|
If key is not a slice object. |
ValueError
|
If slice contains unsupported features or conflicts with existing slice. |
add_samples_from_coco ¶
add_samples_from_coco(
annotations_json: PathLike,
images_path: PathLike,
annotation_type: AnnotationType = AnnotationType.OBJECT_DETECTION,
split: str | None = None,
embed: bool = True,
) -> None
Load a dataset in COCO Object Detection format and store in DB.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
annotations_json |
PathLike
|
Path to the COCO annotations JSON file. |
required |
images_path |
PathLike
|
Path to the folder containing the images. |
required |
annotation_type |
AnnotationType
|
The type of annotation to be loaded (e.g., 'ObjectDetection', 'InstanceSegmentation'). |
OBJECT_DETECTION
|
split |
str | None
|
Optional split name to tag samples (e.g., 'train', 'val'). If provided, all samples will be tagged with this name. |
None
|
embed |
bool
|
If True, generate embeddings for the newly added samples. |
True
|
add_samples_from_coco_caption ¶
add_samples_from_coco_caption(
annotations_json: PathLike, images_path: PathLike, split: str | None = None, embed: bool = True
) -> None
Load a dataset in COCO caption format and store in DB.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
annotations_json |
PathLike
|
Path to the COCO caption JSON file. |
required |
images_path |
PathLike
|
Path to the folder containing the images. |
required |
split |
str | None
|
Optional split name to tag samples (e.g., 'train', 'val'). If provided, all samples will be tagged with this name. |
None
|
embed |
bool
|
If True, generate embeddings for the newly added samples. |
True
|
add_samples_from_labelformat ¶
add_samples_from_labelformat(
input_labels: ObjectDetectionInput | InstanceSegmentationInput,
images_path: PathLike,
embed: bool = True,
) -> None
Load a dataset from a labelformat object and store in database.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_labels |
ObjectDetectionInput | InstanceSegmentationInput
|
The labelformat input object. |
required |
images_path |
PathLike
|
Path to the folder containing the images. |
required |
embed |
bool
|
If True, generate embeddings for the newly added samples. |
True
|
add_samples_from_path ¶
add_samples_from_path(
path: PathLike, allowed_extensions: Iterable[str] | None = None, embed: bool = True
) -> None
Adding samples from the specified path to the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
PathLike
|
Path to the folder containing the images to add. |
required |
allowed_extensions |
Iterable[str] | None
|
An iterable container of allowed image file extensions. |
None
|
embed |
bool
|
If True, generate embeddings for the newly added samples. |
True
|
add_samples_from_yolo ¶
add_samples_from_yolo(
data_yaml: PathLike, input_split: str | None = None, embed: bool = True
) -> None
Load a dataset in YOLO format and store in DB.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_yaml |
PathLike
|
Path to the YOLO data.yaml file. |
required |
input_split |
str | None
|
The split to load (e.g., 'train', 'val', 'test'). If None, all available splits will be loaded and assigned a corresponding tag. |
None
|
embed |
bool
|
If True, generate embeddings for the newly added samples. |
True
|
compute_typicality_metadata ¶
compute_typicality_metadata(
embedding_model_name: str | None = None, metadata_name: str = "typicality"
) -> None
Computes typicality from embeddings, for K nearest neighbors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embedding_model_name |
str | None
|
The name of the embedding model to use. If not given, the default embedding model is used. |
None
|
metadata_name |
str
|
The name of the metadata to store the typicality values in. If not give, the default name "typicality" is used. |
'typicality'
|
get_sample ¶
get_sample(sample_id: UUID) -> Sample
Get a single sample from the dataset by its ID.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sample_id |
UUID
|
The UUID of the sample to retrieve. |
required |
Returns:
Type | Description |
---|---|
Sample
|
A single SampleTable object. |
Raises:
Type | Description |
---|---|
IndexError
|
If no sample is found with the given sample_id. |
load_or_create
staticmethod
¶
load_or_create(name: str | None = None) -> Dataset
Create a new dataset or load an existing one.
match ¶
match(match_expression: MatchExpression) -> DatasetQuery
Create a query on the dataset and store a field condition for filtering.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
match_expression |
MatchExpression
|
Defines the filter. |
required |
Returns:
Type | Description |
---|---|
DatasetQuery
|
DatasetQuery for method chaining. |
order_by ¶
order_by(*order_by: OrderByExpression) -> DatasetQuery
Create a query on the dataset and store ordering expressions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
order_by |
OrderByExpression
|
One or more ordering expressions. They are applied in order. E.g. first ordering by sample width and then by sample file_name will only order the samples with the same sample width by file_name. |
()
|
Returns:
Type | Description |
---|---|
DatasetQuery
|
DatasetQuery for method chaining. |
query ¶
query() -> DatasetQuery
Create a DatasetQuery for this dataset.
Returns:
Type | Description |
---|---|
DatasetQuery
|
A DatasetQuery instance for querying samples in this dataset. |
slice ¶
slice(offset: int = 0, limit: int | None = None) -> DatasetQuery
Create a query on the dataset and apply offset and limit to results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
offset |
int
|
Number of items to skip from beginning (default: 0). |
0
|
limit |
int | None
|
Maximum number of items to return (None = no limit). |
None
|
Returns:
Type | Description |
---|---|
DatasetQuery
|
DatasetQuery for method chaining. |