Train Settings¶
This page covers the settings available for training tasks like object detection and segmentation in LightlyTrain. For settings related to pretraining and distillation, please refer to the Pretrain & Distill Settings page.
Name |
Type |
Default |
Description |
|---|---|---|---|
|
— |
Output directory where checkpoints, logs, and exported models are written. |
|
|
— |
Dataset configuration dict, or path to a YAML file containing the dataset configuration. |
|
|
— |
Model identifier (e.g. |
|
|
|
Task/model-specific training hyperparameters. |
|
|
|
Number of training steps. |
|
|
|
Numeric precision mode (e.g. |
|
|
|
Global batch size across all devices. |
|
|
|
DataLoader worker processes per device. |
|
|
|
Devices to use for training. |
|
|
|
Number of nodes for distributed training. |
|
|
|
Resume an interrupted/crashed run from the same |
|
|
|
If |
|
|
|
Hardware backend: |
|
|
|
Distributed training strategy (e.g. |
|
|
|
Random seed for reproducibility. Set to |
|
|
|
Logger configuration dict. |
|
|
|
Data transform configuration (e.g. image size, normalization). |
|
|
|
Checkpoint saving configuration (e.g. save frequency). |
Tip
LightlyTrain automatically selects suitable default values based on the chosen model, dataset, and hardware. You only need to set parameters that you want to customize.
Look for the Resolved Args dictionary in the train.log file in the output directory
of your run to see the final settings that were applied. This will include any
overrides, automatically resolved values, and model-specific settings that are not
listed on this page.
Output¶
out¶
The output directory where the model checkpoints and logs are saved. Create a new
directory for each run! LightlyTrain will raise an error if the output directory already
exists unless the overwrite flag is set to True.
overwrite¶
Set to True to overwrite the contents of an existing out directory. By default,
LightlyTrain raises an error if the specified output directory already exists to prevent
accidental data loss.
Data¶
data¶
Dataset configuration. You can either provide a dictionary with dataset parameters or a path to a YAML file containing the dataset configuration. Each task (detection, segmentation, etc.) has different dataset requirements. Refer to the task documentation for details on the expected dataset structure and configuration options.
batch_size¶
Global batch size across all devices. The batch size per device/GPU is computed as
batch_size / (num_devices * num_nodes). By default, batch_size is set to "auto",
which selects a model-dependent default value.
num_workers¶
Number of background worker processes per device used by the PyTorch DataLoader. By
default, this is set to "auto", which selects a value based on the number of available
CPU cores.
Model¶
model¶
Model identifier (for example "dinov3/vitt16-ltdetr-coco") or the path to a checkpoint
or exported model file. LightlyTrain automatically downloads weights if needed.
To resume from a crashed or interrupted run, use the
resume_interrupted setting instead of pointing model to a
previous checkpoint. This ensures that optimizer state and training progress are
restored correctly.
model_args¶
Dictionary with model-specific training parameters. The available keys vary by architecture. The table lists the most commonly tuned options:
Key |
Type |
Description |
|---|---|---|
|
Base learning rate. |
|
|
Path to backbone weights to load. |
|
|
Whether to log class-wise metrics. |
lr¶
Base learning rate for the optimizer. All models come with a good default learning rate. The learning rate is automatically scaled based on the global batch size. It does not have to be manually adjusted in most cases. To find the optimal learning rate for your dataset, we recommend to perform learning rate sweeps by increasing and decreasing the learning rate by a factor of 3x.
import lightly_train
lightly_train.train_object_detection(
...,
model_args={
"lr": 0.0001,
},
)
backbone_weights¶
Path to a checkpoint or exported model containing backbone weights to load before training. This enables loading custom pretrained weights. See Pretrain & Distill for more details on pretraining on unlabeled data.
import lightly_train
lightly_train.train_object_detection(
...,
model="dinov3/vitt16-ltdetr", # Model without fine-tuned weights.
model_args={
"backbone_weights": "/path/to/backbone_weights.ckpt",
},
)
The backbone weights argument is ignored when loading an existing checkpoint via the
model argument:
import lightly_train
lightly_train.train_object_detection(
...,
model="out/my_experiment/checkpoints/last.ckpt", # Loads full checkpoint including backbone weights.
model_args={
"backbone_weights": "/path/to/backbone_weights.ckpt", # Ignored when loading from checkpoint.
},
)
Similarly, the backbone weights argument is also ignored when loading one of the built-in fine-tuned models:
import lightly_train
lightly_train.train_object_detection(
...,
model="dinov3/vitt16-ltdetr-coco", # Loads built-in fine-tuned model.
model_args={
"backbone_weights": "/path/to/backbone_weights.ckpt", # Ignored when loading built-in model.
},
)
The backbone weights are only loaded when training starts from scratch using a model
identifier without a dataset suffix (e.g. -coco, -cityscapes, etc.).
metric_log_classwise¶
If set to True, class-wise metrics (for example AP per class) are logged during
validation. Default is False to reduce logging overhead. Not all models support this
feature.
import lightly_train
lightly_train.train_object_detection(
...,
model_args={
"metric_log_classwise": True,
},
)
Training Loop¶
steps¶
Total number of training steps. The default is "auto", which selects a model-dependent
value. Reduce for shorter training or increase to train longer.
Epoch based training is currently not supported.
precision¶
Training precision setting. Must be one of the following strings:
"bf16-mixed": Default. Operations run in bfloat16 where supported, weights are saved in float32. Not supported on all hardware."16-true": All operations and weights are in float16. Fastest but may be unstable depending on model, hardware, and dataset."16-mixed": Most operations run in float16 precision. Not supported on all hardware."32-true": All operations and weights are in float32. Slower but more stable. Supported on all hardware.
seed¶
Controls reproducibility for data order, augmentation randomness, and initialization.
Set to None to use a random seed on each run. Default is 0.
Hardware¶
devices¶
Number of devices (CPUs/GPUs) to use for training. Accepts an integer (number of
devices), an explicit list of device indices, or a string with device ids such as
"1,2,3".
accelerator¶
Type of hardware accelerator to use. Valid options are "cpu", "gpu", "mps", or
"auto". "auto" selects the best available accelerator on the system.
num_nodes¶
Number of nodes for distributed training. By default a single node is used. We recommend
to keep this at 1.
strategy¶
Distributed training strategy, for example "ddp" or "fsdp". By default, this is set
to "auto", which selects a suitable strategy based on the chosen accelerator and
number of devices. We recommend to keep this at "auto" unless you have specific
requirements.
Resume Training¶
There are two ways to continue training from a previous run:
Resume an interrupted/crashed run and finish training with the same parameters.
You CANNOT change any training parameters (including steps)!
You CANNOT change the
outdirectory.YOU CANNOT change the dataset.
This restores the exact training state, including optimizer parameters and current step.
Load a checkpoint from a previous run and fine-tune with different parameters.
You CAN change training parameters.
You MUST specify a new
outdirectory.You CAN change the dataset.
This initializes model weights from the checkpoint but starts a fresh training state.
resume_interrupted¶
Use when a run terminates unexpectedly and you want to continue from the latest
checkpoint stored in out/checkpoints/last.ckpt. Do not modify any other training
arguments! This will restore the exact training state, including optimizer parameters,
current step, and any learning rate or other schedules from the previous run. The flag
is intended for crash recovery only. See Load Checkpoint for a New Run for
continuing training with different parameters, for example to train for more steps.
import lightly_train
lightly_train.train_object_detection(
out="out/my_experiment", # Same output directory as the interrupted run.
resume_interrupted=True, # Resume from last.ckpt in out directory.
)
Load Checkpoint for a New Run¶
To continue training from a previous run but change training parameters (for example to
train for more steps), set the model argument to the path of an exported model from a
previous run and specify a new out directory. This way, training starts fresh but
initializes weights from the provided checkpoint.
We recommend using the exported best model weights from
out/my_experiment/exported_models/exported_best.pt for this purpose.
See resume_interrupted if you want to recover from a crashed
run instead.
import lightly_train
lightly_train.train_object_detection(
out="out/my_new_experiment", # New output directory for new run.
model="out/my_experiment/exported_models/exported_best.pt", # Load model from previous run.
steps=2000, # Change training parameters as needed.
)
Checkpoint Saving¶
LightlyTrain saves two types of checkpoints during training:
out/my_experiment/checkpoints: Full checkpoints including optimizer, scheduler, and training state. Used to resume training withresume_interrupted.last.ckpt: Latest checkpoint saved at regular intervals.best.ckpt: Best-performing checkpoint based on a validation metric.
out/my_experiment/exported_models: Lightweight exported models containing only model weights. Used for inference and any further fine-tuning.exported_last.pt: Model weights from the latest checkpoint.exported_best.pt: Model weights from the best-performing checkpoint.
Use the exported models in out/my_experiment/exported_models/ for any downstream tasks
whenever training completed successfully. Use out/my_experiment/checkpoints/ only to
resume training with resume_interrupted after an unexpected
interruption.
save_checkpoint_args¶
Settings to configure checkpoint saving behavior. By default, LightlyTrain saves
last.ckpt and best.ckpt while tracking a validation metric defined by the selected
model.
Key |
Type |
Description |
|---|---|---|
|
Training step interval for saving checkpoints. |
|
|
Persist |
|
|
Track the best-performing checkpoint according to |
|
|
Validation metric name (for example |
|
|
Operation used when selecting the best checkpoint based on |
save_every_num_steps¶
Number of training steps between each checkpoint save. Default is 1000. Decrease to
save more frequently. Too frequent saving may slow down training.
import lightly_train
lightly_train.train_object_detection(
...,
save_checkpoint_args={
"save_every_num_steps": 500, # Save checkpoint every 500 steps.
},
)
save_last¶
If set to True, the latest checkpoint and exported model (last.ckpt and
exported_last.pt) are saved at each save interval. Default is True. Disable only
when storage space is limited.
import lightly_train
lightly_train.train_object_detection(
...,
save_checkpoint_args={
"save_last": False, # Disable saving last.ckpt
},
)
save_best¶
If set to True, the best-performing checkpoint and exported model (best.ckpt and
exported_best.pt) are tracked and saved based on the validation metric defined by
watch_metric. Default is True.
import lightly_train
lightly_train.train_object_detection(
...,
save_checkpoint_args={
"save_best": False, # Disable saving best.ckpt
},
)
watch_metric¶
Validation metric used to determine the best checkpoint when save_best
is True. The default metric depends on the selected model.
Default metrics:
Object Detection:
"val_metric/map"(Mean Average Precision)Instance Segmentation:
"val_metric/map"(Mean Average Precision)Panoptic Segmentation:
"val_metric/pq"(Panoptic Quality)Semantic Segmentation:
"val_metric/miou"(Mean Intersection over Union)
Check the logs for all available validation metrics for your task and model. See also
metric_log_classwise to enable class-wise metric logging.
import lightly_train
lightly_train.train_object_detection(
...,
save_checkpoint_args={
"watch_metric": "val_metric/map", # Use mAP as the best-checkpoint metric.
"mode": "max", # Higher is better for mAP. Set to "min" for metrics where lower is better.
},
)
mode¶
Operation used when selecting the best checkpoint based on
watch_metric. Must be either "min" (lower is better) or "max"
(higher is better). Default depends on the selected watch_metric.
Logging¶
logger_args¶
Dictionary to configure logging behavior. By default, LightlyTrain uses the built-in TensorBoard logger. You can customize logging frequency and enable/disable additional loggers like MLflow and Weights & Biases.
Key |
Type |
Description |
|---|---|---|
|
MLflow logger configuration. Disabled by default. |
|
|
Weights & Biases logger configuration. Disabled by default. |
|
|
TensorBoard logger configuration. Set to |
|
|
Training step interval for logging training metrics. |
|
|
Training step interval that triggers a validation run. |
|
|
Validation step interval for logging validation metrics. |
mlflow¶
MLFlow logger configuration. It is disabled by default. Requires MLFlow to be installed with:
pip install "lightly-train[mlflow]"
import lightly_train
lightly_train.train_object_detection(
...,
logger_args={
"mlflow": {
# Optional experiment name.
"experiment_name": "my_experiment",
# Optional custom run name.
"run_name": "my_run",
# Optional tags dictionary.
"tags": {"team": "research"},
# Optional address of local or remote tracking server, e.g. "http://localhost:5000"
"tracking_uri": "tracking_uri",
# Enable checkpoint uploading to MLflow. (default: False)
"log_model": True,
# Optional string to put at the beginning of metric keys.
"prefix": "",
# Optional location where artifacts are stored.
"artifact_location": "./mlruns",
},
},
)
See the PyTorch Lightning MLflow Logger documentation for details on the available configuration options.
wandb¶
Weights & Biases logger configuration. It is disabled by default. Requires Weights & Biases to be installed with:
pip install "lightly-train[wandb]"
import lightly_train
lightly_train.train_object_detection(
...,
logger_args={
"wandb": {
# Optional display name for the run.
"name": "my_run",
# Optional project name.
"project": "my_project",
# Optional version, mainly used to resume a previous run.
"version": "my_version",
# Optional, upload model checkpoints as artifacts. (default: False)
"log_model": False,
# Optional name for uploaded checkpoints. (default: None)
"checkpoint_name": "checkpoint.ckpt",
# Optional, run offline without syncing to the W&B server. (default: False)
"offline": False,
# Optional, configure anonymous logging. (default: False)
"anonymous": False,
# Optional string to put at the beginning of metric keys.
"prefix": "",
},
},
)
See the PyTorch Lightning Weights & Biases Logger documentation for details on the available configuration options.
tensorboard¶
Configuration for the built-in TensorBoard logger. TensorBoard logs are by default enabled and automatically saved to the output directory. Run TensorBoard in a new terminal to visualize the training progress:
tensorboard --logdir out/my_experiment
Disable TensorBoard logging by setting this argument to None:
import lightly_train
lightly_train.train_object_detection(
...,
logger_args={
"tensorboard": None, # Disable TensorBoard logging.
},
)
log_every_num_steps¶
Controls how frequently training metrics are written. Is set to "auto" by default,
which selects a value based on the dataset size. Decrease the value for more frequent
updates.
import lightly_train
lightly_train.train_object_detection(
...,
logger_args={
"log_every_num_steps": 50, # Log every 50 training steps.
},
)
val_every_num_steps¶
Controls how frequently validation is performed during training. Is set to "auto" by
default, which selects a value based on the dataset size. "auto" validates at least
once every 1000 steps. Decrease the value to validate more often.
import lightly_train
lightly_train.train_object_detection(
...,
logger_args={
"val_every_num_steps": 500, # Validate every 500 training steps.
},
)
val_log_every_num_steps¶
Controls how frequently progress is logged during validation runs. Is set to "auto" by
default, which selects a value based on the dataset size.
import lightly_train
lightly_train.train_object_detection(
...,
logger_args={
"val_log_every_num_steps": 20, # Log every 20 validation steps.
},
)
Transforms¶
LightlyTrain automatically applies suitable data augmentations and preprocessing steps
for each model and task. The default transforms are designed to work well in most
scenarios. You can customize transform parameters via the
transform_args setting.
transform_args¶
Dictionary to configure data transforms applied during training. The most commonly customized parameters are listed in the table below:
Key |
Type |
Description |
|---|---|---|
|
Image height and width after random cropping and resize. |
|
|
Mean and standard deviation used for input normalization. |
|
|
Horizontal or vertical flip probabilities. |
|
|
Rotation angle range and probability. |
|
|
90-degree rotation probability. |
|
|
Strength of color jitter augmentation. |
|
|
Channel dropping configuration for multi-channel datasets. |
|
|
Validation transform configuration. |
Check the respective task pages for the default transforms applied:
image_size¶
Tuple specifying the height and width of input images after cropping and resizing. The default size depends on the selected model. Increase for higher-resolution inputs or decrease to speed up training. Not all image sizes are supported by all models.
import lightly_train
lightly_train.train_object_detection(
...,
transform_args={
"image_size": (512, 512), # Random crop and resize images to (height, width)
},
)
normalize¶
Dictionary specifying the mean and standard deviation used for input normalization. ImageNet statistics are used by default. Change these values when working with datasets that have different color distributions.
import lightly_train
lightly_train.train_object_detection(
...,
transform_args={
"normalize": {
"mean": [0.485, 0.456, 0.406],
"std": [0.229, 0.224, 0.225],
},
},
)
random_flip¶
Dictionary to configure random flipping augmentation. By default, horizontal flipping is
applied with a probability of 0.5 and vertical flipping is disabled. Adjust the
probabilities as needed.
import lightly_train
lightly_train.train_object_detection(
...,
transform_args={
"random_flip": {
"horizontal_prob": 0.7, # 70% chance to flip horizontally
"vertical_prob": 0.2, # 20% chance to flip vertically
},
},
)
random_rotate¶
Dictionary to configure random rotation augmentation. By default, rotation is disabled. Specify the maximum rotation angle and probability to enable.
import lightly_train
lightly_train.train_object_detection(
...,
transform_args={
"random_rotate": {
"prob": 0.5, # 50% chance to apply rotation
"degrees": (-30, 30), # Rotate between -30 and +30 degrees
},
},
)
random_rotate_90¶
Dictionary to configure random rotation by a multiple of 90 degrees. By default, this is disabled. Specify the probability to enable.
import lightly_train
lightly_train.train_object_detection(
...,
transform_args={
"random_rotate_90": {
"prob": 0.3, # 30% chance to rotate by 90/180/270 degrees
},
},
)
color_jitter¶
Dictionary to configure color jitter augmentation. By default, color jitter is disabled. Not all models support color jitter augmentation.
import lightly_train
lightly_train.train_object_detection(
...,
transform_args={
"color_jitter": {
"prob": 0.8, # 80% chance to apply color jitter
"strength": 2.0, # Strength of color jitter. Multiplied with the individual parameters below.
"brightness": 0.4,
"contrast": 0.4,
"saturation": 0.4,
"hue": 0.1,
},
},
)
channel_drop¶
Dictionary to configure channel dropping augmentation for multi-channel datasets. It
randomly drops channels until only a specified number of channels remain. Useful for
training models on datasets with varying channel availability. Requires
LIGHTLY_TRAIN_IMAGE_MODE="UNCHANGED" to be set in the environment. See
Single- and Multi-Channel Images for details.
import lightly_train
lightly_train.train_object_detection(
...,
transform_args={
"channel_drop": {
"num_channels_keep": 3, # Number of channels to keep
"weight_drop": [1.0, 1.0, 0.0, 0.0], # Drop channels 1 and 2 with equal probability. Don't drop channels 3 and 4.
},
},
)
val¶
Dictionary to configure validation transforms. Can be used to override validation transforms separately from training transforms. By default, validation transforms use the same image size and normalization as training transforms, but disable other augmentations.
import lightly_train
lightly_train.train_object_detection(
...,
transform_args={
"image_size": (518, 518), # Resize training images to (height, width)
"val": {
"image_size": (512, 512), # Resize validation images to (height, width)
},
},
)