(train-settings)= # Train Settings This page covers the settings available for training tasks like object detection and segmentation in LightlyTrain. For settings related to pretraining and distillation, please refer to the [](pretrain-settings) page. | Name | Type | Default | Description | | ----------------------------------------------- | ----------------------------- | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [`out`](#out) | `str`
`Path` | — | Output directory where checkpoints, logs, and exported models are written. | | [`data`](#data) | `dict`
`str` | — | Dataset configuration dict, or path to a YAML file containing the dataset configuration. | | [`model`](#model) | `str`
`Path` | — | Model identifier (e.g. `"dinov3/vitt16-ltdetr-coco"`) or path to a local checkpoint to fine-tune from. | | [`model_args`](#model_args) | `dict` | `None` | Task/model-specific training hyperparameters. | | [`steps`](#steps) | `int` | `"auto"` | Number of training steps. `"auto"` selects a model-dependent default. | | [`precision`](#precision) | `str` | `"bf16-mixed"` | Numeric precision mode (e.g. `"16-true"`, `"32-true"`, `"bf16-mixed"`). | | [`batch_size`](#batch_size) | `int` | `"auto"` | Global batch size across all devices. | | [`num_workers`](#num_workers) | `int` | `"auto"` | DataLoader worker processes per device. `"auto"` chooses a value based on available CPU cores. | | [`devices`](#devices) | `int`
`str`
`list[int]` | `"auto"` | Devices to use for training. `"auto"` selects all available devices for the chosen `accelerator`. | | [`num_nodes`](#num_nodes) | `int` | `1` | Number of nodes for distributed training. | | [`resume_interrupted`](#resume_interrupted) | `bool` | `False` | Resume an interrupted/crashed run from the same `out` directory, including optimizer state and current step. Do not change any training parameters when using this. | | [`overwrite`](#overwrite) | `bool` | `False` | If `True`, overwrite the `out` directory if it already exists. | | [`accelerator`](#accelerator) | `str` | `"auto"` | Hardware backend: `"cpu"`, `"gpu"`, `"mps"`, or `"auto"` to pick the best available. | | [`strategy`](#strategy) | `str` | `"auto"` | Distributed training strategy (e.g. `"ddp"`). `"auto"` selects a suitable default. | | [`seed`](#seed) | `int`
`None` | `0` | Random seed for reproducibility. Set to `None` to disable seeding. | | [`logger_args`](#logger_args) | `dict` | `None` | Logger configuration dict. `None` uses defaults; keys configure or disable individual loggers. | | [`transform_args`](#transform_args) | `dict` | `None` | Data transform configuration (e.g. image size, normalization). | | [`save_checkpoint_args`](#save_checkpoint_args) | `dict` | `None` | Checkpoint saving configuration (e.g. save frequency). | ```{tip} LightlyTrain automatically selects suitable default values based on the chosen model, dataset, and hardware. You only need to set parameters that you want to customize. Look for the `Resolved Args` dictionary in the `train.log` file in the output directory of your run to see the final settings that were applied. This will include any overrides, automatically resolved values, and model-specific settings that are not listed on this page. ``` (train-settings-output)= ## Output ### `out` The output directory where the model checkpoints and logs are saved. Create a new directory for each run! LightlyTrain will raise an error if the output directory already exists unless the [`overwrite`](#overwrite) flag is set to `True`. ### `overwrite` Set to `True` to overwrite the contents of an existing `out` directory. By default, LightlyTrain raises an error if the specified output directory already exists to prevent accidental data loss. (train-settings-data)= ## Data ### `data` Dataset configuration. You can either provide a dictionary with dataset parameters or a path to a YAML file containing the dataset configuration. Each task (detection, segmentation, etc.) has different dataset requirements. Refer to the task documentation for details on the expected dataset structure and configuration options. - [Object Detection](object-detection-data) - [Instance Segmentation](instance-segmentation-data) - [Panoptic Segmentation](panoptic-segmentation-data) - [Semantic Segmentation](semantic-segmentation-data) ### `batch_size` Global batch size across all devices. The batch size per device/GPU is computed as `batch_size / (num_devices * num_nodes)`. By default, `batch_size` is set to `"auto"`, which selects a model-dependent default value. ### `num_workers` Number of background worker processes per device used by the PyTorch DataLoader. By default, this is set to `"auto"`, which selects a value based on the number of available CPU cores. (train-settings-model)= ## Model ### `model` Model identifier (for example `"dinov3/vitt16-ltdetr-coco"`) or the path to a checkpoint or exported model file. LightlyTrain automatically downloads weights if needed. To resume from a crashed or interrupted run, use the [`resume_interrupted`](#resume_interrupted) setting instead of pointing `model` to a previous checkpoint. This ensures that optimizer state and training progress are restored correctly. ### `model_args` Dictionary with model-specific training parameters. The available keys vary by architecture. The table lists the most commonly tuned options: | Key | Type | Description | | ----------------------------------------------- | ------------------------- | ---------------------------------- | | [`lr`](#lr) | `float` | Base learning rate. | | [`backbone_weights`](#backbone_weights) | `Path`
`str`
`None` | Path to backbone weights to load. | | [`metric_log_classwise`](#metric_log_classwise) | `bool` | Whether to log class-wise metrics. | #### `lr` Base learning rate for the optimizer. All models come with a good default learning rate. The learning rate is automatically scaled based on the global batch size. It does not have to be manually adjusted in most cases. To find the optimal learning rate for your dataset, we recommend to perform learning rate sweeps by increasing and decreasing the learning rate by a factor of 3x. ```python import lightly_train lightly_train.train_object_detection( ..., model_args={ "lr": 0.0001, }, ) ``` #### `backbone_weights` Path to a checkpoint or exported model containing backbone weights to load before training. This enables loading custom pretrained weights. See [](pretrain-distill) for more details on pretraining on unlabeled data. ```python import lightly_train lightly_train.train_object_detection( ..., model="dinov3/vitt16-ltdetr", # Model without fine-tuned weights. model_args={ "backbone_weights": "/path/to/backbone_weights.ckpt", }, ) ``` The backbone weights argument is ignored when loading an existing checkpoint via the [`model`](#model) argument: ```python import lightly_train lightly_train.train_object_detection( ..., model="out/my_experiment/checkpoints/last.ckpt", # Loads full checkpoint including backbone weights. model_args={ "backbone_weights": "/path/to/backbone_weights.ckpt", # Ignored when loading from checkpoint. }, ) ``` Similarly, the backbone weights argument is also ignored when loading one of the built-in fine-tuned models: ```python import lightly_train lightly_train.train_object_detection( ..., model="dinov3/vitt16-ltdetr-coco", # Loads built-in fine-tuned model. model_args={ "backbone_weights": "/path/to/backbone_weights.ckpt", # Ignored when loading built-in model. }, ) ``` The backbone weights are only loaded when training starts from scratch using a model identifier without a dataset suffix (e.g. `-coco`, `-cityscapes`, etc.). #### `metric_log_classwise` If set to `True`, class-wise metrics (for example AP per class) are logged during validation. Default is `False` to reduce logging overhead. Not all models support this feature. ```python import lightly_train lightly_train.train_object_detection( ..., model_args={ "metric_log_classwise": True, }, ) ``` (train-settings-training-loop)= ## Training Loop ### `steps` Total number of training steps. The default is `"auto"`, which selects a model-dependent value. Reduce for shorter training or increase to train longer. Epoch based training is currently not supported. ### `precision` Training precision setting. Must be one of the following strings: - `"bf16-mixed"`: Default. Operations run in bfloat16 where supported, weights are saved in float32. Not supported on all hardware. - `"16-true"`: All operations and weights are in float16. Fastest but may be unstable depending on model, hardware, and dataset. - `"16-mixed"`: Most operations run in float16 precision. Not supported on all hardware. - `"32-true"`: All operations and weights are in float32. Slower but more stable. Supported on all hardware. ### `seed` Controls reproducibility for data order, augmentation randomness, and initialization. Set to `None` to use a random seed on each run. Default is `0`. (train-settings-hardware)= ## Hardware ### `devices` Number of devices (CPUs/GPUs) to use for training. Accepts an integer (number of devices), an explicit list of device indices, or a string with device ids such as `"1,2,3"`. ### `accelerator` Type of hardware accelerator to use. Valid options are `"cpu"`, `"gpu"`, `"mps"`, or `"auto"`. `"auto"` selects the best available accelerator on the system. ### `num_nodes` Number of nodes for distributed training. By default a single node is used. We recommend to keep this at `1`. ### `strategy` Distributed training strategy, for example `"ddp"` or `"fsdp"`. By default, this is set to `"auto"`, which selects a suitable strategy based on the chosen accelerator and number of devices. We recommend to keep this at `"auto"` unless you have specific requirements. (train-settings-resume-training)= ## Resume Training There are two ways to continue training from a previous run: 1. [Resume an interrupted/crashed run](#resume_interrupted) and finish training with the same parameters. - You **CANNOT** change any training parameters (including steps)! - You **CANNOT** change the `out` directory. - YOU **CANNOT** change the dataset. - This restores the exact training state, including optimizer parameters and current step. 1. [Load a checkpoint from a previous run](#load-checkpoint-for-a-new-run) and fine-tune with different parameters. - You **CAN** change training parameters. - You **MUST** specify a new `out` directory. - You **CAN** change the dataset. - This initializes model weights from the checkpoint but starts a fresh training state. ### `resume_interrupted` Use when a run terminates unexpectedly and you want to continue from the latest checkpoint stored in `out/checkpoints/last.ckpt`. Do not modify any other training arguments! This will restore the exact training state, including optimizer parameters, current step, and any learning rate or other schedules from the previous run. The flag is intended for crash recovery only. See [](#load-checkpoint-for-a-new-run) for continuing training with different parameters, for example to train for more steps. ```python import lightly_train lightly_train.train_object_detection( out="out/my_experiment", # Same output directory as the interrupted run. resume_interrupted=True, # Resume from last.ckpt in out directory. ) ``` ### Load Checkpoint for a New Run To continue training from a previous run but change training parameters (for example to train for more steps), set the `model` argument to the path of an exported model from a previous run and specify a new `out` directory. This way, training starts fresh but initializes weights from the provided checkpoint. We recommend using the exported best model weights from `out/my_experiment/exported_models/exported_best.pt` for this purpose. See [`resume_interrupted`](#resume_interrupted) if you want to recover from a crashed run instead. ```python import lightly_train lightly_train.train_object_detection( out="out/my_new_experiment", # New output directory for new run. model="out/my_experiment/exported_models/exported_best.pt", # Load model from previous run. steps=2000, # Change training parameters as needed. ) ``` (train-settings-checkpoint-saving)= ## Checkpoint Saving LightlyTrain saves two types of checkpoints during training: 1. `out/my_experiment/checkpoints`: Full checkpoints including optimizer, scheduler, and training state. Used to resume training with [`resume_interrupted`](#resume_interrupted). - `last.ckpt`: Latest checkpoint saved at regular intervals. - `best.ckpt`: Best-performing checkpoint based on a validation metric. 1. `out/my_experiment/exported_models`: Lightweight exported models containing only model weights. Used for inference and any further fine-tuning. - `exported_last.pt`: Model weights from the latest checkpoint. - `exported_best.pt`: Model weights from the best-performing checkpoint. Use the exported models in `out/my_experiment/exported_models/` for any downstream tasks whenever training completed successfully. Use `out/my_experiment/checkpoints/` only to resume training with [`resume_interrupted`](#resume_interrupted) after an unexpected interruption. ### `save_checkpoint_args` Settings to configure checkpoint saving behavior. By default, LightlyTrain saves `last.ckpt` and `best.ckpt` while tracking a validation metric defined by the selected model. | Key | Type | Description | | ----------------------------------------------- | ------------------ | ----------------------------------------------------------------------------------------------------- | | [`save_every_num_steps`](#save_every_num_steps) | `int` | Training step interval for saving checkpoints. | | [`save_last`](#save_last) | `bool` | Persist `last.ckpt` after each save cycle. Disable only when storage is constrained. | | [`save_best`](#save_best) | `bool` | Track the best-performing checkpoint according to [`watch_metric`](#watch_metric). | | [`watch_metric`](#watch_metric) | `str` | Validation metric name (for example `"val_metric/map"`) monitored when selecting the best checkpoint. | | [`mode`](#mode) | `"min"`
`"max"` | Operation used when selecting the best checkpoint based on [`watch_metric`](#watch_metric). | #### `save_every_num_steps` Number of training steps between each checkpoint save. Default is `1000`. Decrease to save more frequently. Too frequent saving may slow down training. ```python import lightly_train lightly_train.train_object_detection( ..., save_checkpoint_args={ "save_every_num_steps": 500, # Save checkpoint every 500 steps. }, ) ``` #### `save_last` If set to `True`, the latest checkpoint and exported model (`last.ckpt` and `exported_last.pt`) are saved at each save interval. Default is `True`. Disable only when storage space is limited. ```python import lightly_train lightly_train.train_object_detection( ..., save_checkpoint_args={ "save_last": False, # Disable saving last.ckpt }, ) ``` #### `save_best` If set to `True`, the best-performing checkpoint and exported model (`best.ckpt` and `exported_best.pt`) are tracked and saved based on the validation metric defined by [`watch_metric`](#watch_metric). Default is `True`. ```python import lightly_train lightly_train.train_object_detection( ..., save_checkpoint_args={ "save_best": False, # Disable saving best.ckpt }, ) ``` #### `watch_metric` Validation metric used to determine the best checkpoint when [`save_best`](#save_best) is `True`. The default metric depends on the selected model. Default metrics: - Object Detection: `"val_metric/map"` (Mean Average Precision) - Instance Segmentation: `"val_metric/map"` (Mean Average Precision) - Panoptic Segmentation: `"val_metric/pq"` (Panoptic Quality) - Semantic Segmentation: `"val_metric/miou"` (Mean Intersection over Union) Check the logs for all available validation metrics for your task and model. See also [`metric_log_classwise`](#metric_log_classwise) to enable class-wise metric logging. ```python import lightly_train lightly_train.train_object_detection( ..., save_checkpoint_args={ "watch_metric": "val_metric/map", # Use mAP as the best-checkpoint metric. "mode": "max", # Higher is better for mAP. Set to "min" for metrics where lower is better. }, ) ``` #### `mode` Operation used when selecting the best checkpoint based on [`watch_metric`](#watch_metric). Must be either `"min"` (lower is better) or `"max"` (higher is better). Default depends on the selected [`watch_metric`](#watch_metric). (train-settings-logging)= ## Logging ### `logger_args` Dictionary to configure logging behavior. By default, LightlyTrain uses the built-in TensorBoard logger. You can customize logging frequency and enable/disable additional loggers like MLflow and Weights & Biases. | Key | Type | Description | | ----------------------------------------------------- | ----------------- | ----------------------------------------------------------- | | [`mlflow`](#mlflow) | `dict`
`None` | MLflow logger configuration. Disabled by default. | | [`wandb`](#wandb) | `dict`
`None` | Weights & Biases logger configuration. Disabled by default. | | [`tensorboard`](#tensorboard) | `dict`
`None` | TensorBoard logger configuration. Set to `None` to disable. | | [`log_every_num_steps`](#log_every_num_steps) | `int`
`"auto"` | Training step interval for logging training metrics. | | [`val_every_num_steps`](#val_every_num_steps) | `int`
`"auto"` | Training step interval that triggers a validation run. | | [`val_log_every_num_steps`](#val_log_every_num_steps) | `int`
`"auto"` | Validation step interval for logging validation metrics. | #### `mlflow` MLFlow logger configuration. It is disabled by default. Requires MLFlow to be installed with: ```bash pip install "lightly-train[mlflow]" ``` ```python import lightly_train lightly_train.train_object_detection( ..., logger_args={ "mlflow": { # Optional experiment name. "experiment_name": "my_experiment", # Optional custom run name. "run_name": "my_run", # Optional tags dictionary. "tags": {"team": "research"}, # Optional address of local or remote tracking server, e.g. "http://localhost:5000" "tracking_uri": "tracking_uri", # Enable checkpoint uploading to MLflow. (default: False) "log_model": True, # Optional string to put at the beginning of metric keys. "prefix": "", # Optional location where artifacts are stored. "artifact_location": "./mlruns", }, }, ) ``` See the [PyTorch Lightning MLflow Logger documentation](https://lightning.ai/docs/pytorch/stable/extensions/generated/lightning.pytorch.loggers.MLFlowLogger.html#mlflowlogger) for details on the available configuration options. #### `wandb` Weights & Biases logger configuration. It is disabled by default. Requires Weights & Biases to be installed with: ```bash pip install "lightly-train[wandb]" ``` ```python import lightly_train lightly_train.train_object_detection( ..., logger_args={ "wandb": { # Optional display name for the run. "name": "my_run", # Optional project name. "project": "my_project", # Optional version, mainly used to resume a previous run. "version": "my_version", # Optional, upload model checkpoints as artifacts. (default: False) "log_model": False, # Optional name for uploaded checkpoints. (default: None) "checkpoint_name": "checkpoint.ckpt", # Optional, run offline without syncing to the W&B server. (default: False) "offline": False, # Optional, configure anonymous logging. (default: False) "anonymous": False, # Optional string to put at the beginning of metric keys. "prefix": "", }, }, ) ``` See the [PyTorch Lightning Weights & Biases Logger documentation](https://lightning.ai/docs/pytorch/stable/extensions/generated/lightning.pytorch.loggers.WandbLogger.html#wandblogger) for details on the available configuration options. #### `tensorboard` Configuration for the built-in TensorBoard logger. TensorBoard logs are by default enabled and automatically saved to the output directory. Run TensorBoard in a new terminal to visualize the training progress: ```bash tensorboard --logdir out/my_experiment ``` Disable TensorBoard logging by setting this argument to `None`: ```python import lightly_train lightly_train.train_object_detection( ..., logger_args={ "tensorboard": None, # Disable TensorBoard logging. }, ) ``` #### `log_every_num_steps` Controls how frequently training metrics are written. Is set to `"auto"` by default, which selects a value based on the dataset size. Decrease the value for more frequent updates. ```python import lightly_train lightly_train.train_object_detection( ..., logger_args={ "log_every_num_steps": 50, # Log every 50 training steps. }, ) ``` #### `val_every_num_steps` Controls how frequently validation is performed during training. Is set to `"auto"` by default, which selects a value based on the dataset size. `"auto"` validates at least once every 1000 steps. Decrease the value to validate more often. ```python import lightly_train lightly_train.train_object_detection( ..., logger_args={ "val_every_num_steps": 500, # Validate every 500 training steps. }, ) ``` #### `val_log_every_num_steps` Controls how frequently progress is logged during validation runs. Is set to `"auto"` by default, which selects a value based on the dataset size. ```python import lightly_train lightly_train.train_object_detection( ..., logger_args={ "val_log_every_num_steps": 20, # Log every 20 validation steps. }, ) ``` (train-settings-transforms)= ## Transforms LightlyTrain automatically applies suitable data augmentations and preprocessing steps for each model and task. The default transforms are designed to work well in most scenarios. You can customize transform parameters via the [`transform_args`](#transform_args) setting. ### `transform_args` Dictionary to configure data transforms applied during training. The most commonly customized parameters are listed in the table below: | Key | Type | Description | | --------------------------------------- | ----------------- | ---------------------------------------------------------- | | [`image_size`](#image_size) | `tuple[int, int]` | Image height and width after random cropping and resize. | | [`normalize`](#normalize) | `dict` | Mean and standard deviation used for input normalization. | | [`random_flip`](#random_flip) | `dict` | Horizontal or vertical flip probabilities. | | [`random_rotate`](#random_rotate) | `dict` | Rotation angle range and probability. | | [`random_rotate_90`](#random_rotate_90) | `dict` | 90-degree rotation probability. | | [`color_jitter`](#color_jitter) | `dict` | Strength of color jitter augmentation. | | [`channel_drop`](#channel_drop) | `dict` | Channel dropping configuration for multi-channel datasets. | | [`val`](#val) | `dict` | Validation transform configuration. | Check the respective task pages for the default transforms applied: - [Object Detection](object-detection-transform-args) - [Instance Segmentation](instance-segmentation-transform-args) - [Panoptic Segmentation](panoptic-segmentation-transform-args) - [Semantic Segmentation](semantic-segmentation-transform-args) #### `image_size` Tuple specifying the height and width of input images after cropping and resizing. The default size depends on the selected model. Increase for higher-resolution inputs or decrease to speed up training. Not all image sizes are supported by all models. ```python import lightly_train lightly_train.train_object_detection( ..., transform_args={ "image_size": (512, 512), # Random crop and resize images to (height, width) }, ) ``` #### `normalize` Dictionary specifying the mean and standard deviation used for input normalization. ImageNet statistics are used by default. Change these values when working with datasets that have different color distributions. ```python import lightly_train lightly_train.train_object_detection( ..., transform_args={ "normalize": { "mean": [0.485, 0.456, 0.406], "std": [0.229, 0.224, 0.225], }, }, ) ``` #### `random_flip` Dictionary to configure random flipping augmentation. By default, horizontal flipping is applied with a probability of `0.5` and vertical flipping is disabled. Adjust the probabilities as needed. ```python import lightly_train lightly_train.train_object_detection( ..., transform_args={ "random_flip": { "horizontal_prob": 0.7, # 70% chance to flip horizontally "vertical_prob": 0.2, # 20% chance to flip vertically }, }, ) ``` #### `random_rotate` Dictionary to configure random rotation augmentation. By default, rotation is disabled. Specify the maximum rotation angle and probability to enable. ```python import lightly_train lightly_train.train_object_detection( ..., transform_args={ "random_rotate": { "prob": 0.5, # 50% chance to apply rotation "degrees": (-30, 30), # Rotate between -30 and +30 degrees }, }, ) ``` #### `random_rotate_90` Dictionary to configure random rotation by a multiple of 90 degrees. By default, this is disabled. Specify the probability to enable. ```python import lightly_train lightly_train.train_object_detection( ..., transform_args={ "random_rotate_90": { "prob": 0.3, # 30% chance to rotate by 90/180/270 degrees }, }, ) ``` #### `color_jitter` Dictionary to configure color jitter augmentation. By default, color jitter is disabled. Not all models support color jitter augmentation. ```python import lightly_train lightly_train.train_object_detection( ..., transform_args={ "color_jitter": { "prob": 0.8, # 80% chance to apply color jitter "strength": 2.0, # Strength of color jitter. Multiplied with the individual parameters below. "brightness": 0.4, "contrast": 0.4, "saturation": 0.4, "hue": 0.1, }, }, ) ``` #### `channel_drop` Dictionary to configure channel dropping augmentation for multi-channel datasets. It randomly drops channels until only a specified number of channels remain. Useful for training models on datasets with varying channel availability. Requires `LIGHTLY_TRAIN_IMAGE_MODE="UNCHANGED"` to be set in the environment. See [](multi-channel) for details. ```python import lightly_train lightly_train.train_object_detection( ..., transform_args={ "channel_drop": { "num_channels_keep": 3, # Number of channels to keep "weight_drop": [1.0, 1.0, 0.0, 0.0], # Drop channels 1 and 2 with equal probability. Don't drop channels 3 and 4. }, }, ) ``` #### `val` Dictionary to configure validation transforms. Can be used to override validation transforms separately from training transforms. By default, validation transforms use the same image size and normalization as training transforms, but disable other augmentations. ```python import lightly_train lightly_train.train_object_detection( ..., transform_args={ "image_size": (518, 518), # Resize training images to (height, width) "val": { "image_size": (512, 512), # Resize validation images to (height, width) }, }, ) ``` ```{toctree} --- hidden: maxdepth: 1 --- self pretrain_settings ```