Trainยถ
The train command is a simple interface to pretrain a large number of models using different SSL methods. An example command looks like this:
import lightly_train
if __name__ == "__main__":
lightly_train.train(
out="out/my_experiment",
data="my_data_dir",
model="torchvision/resnet50",
method="distillation",
epochs=100,
batch_size=128,
)
lightly-train train out="out/my_experiment" data="my_data_dir" model="torchvision/resnet50" method="distillation" epochs=100 batch_size=128
Important
The default pretraining method distillation
is recommended, as it consistently outperforms others in extensive experiments. Batch sizes between 128
and 1536
strike a good balance between speed and performance. Moreover, long training runs, such as 2,000 epochs on COCO, significantly improve results. Check the Methods page for more details why distillation
is the best choice.
This will pretrain a ResNet-50 model from TorchVision using images from my_data_dir
and the DINOv2 distillation pretraining method. All training logs, model exports, and
checkpoints are saved to the output directory at out/my_experiment
.
Tip
See lightly_train.train()
for a complete list of available arguments.
Outยถ
The out
argument specifies the output directory where all training logs, model exports,
and checkpoints are saved. It looks like this after training:
out/my_experiment
โโโ checkpoints
โ โโโ epoch=99-step=123.ckpt # Intermediate checkpoint
โ โโโ last.ckpt # Last checkpoint
โโโ events.out.tfevents.1721899772.host.1839736.0 # TensorBoard logs
โโโ exported_models
| โโโ exported_last.pt # Final model exported
โโโ metrics.jsonl # Training metrics
โโโ train.log # Training logs
The final model checkpoint is saved to out/my_experiment/checkpoints/last.ckpt
. The
file out/my_experiment/exported_models/exported_last.pt
contains the final model,
exported in the default format (package_default
) of the used library (see
export format for more details).
Tip
Create a new output directory for each experiment to keep training logs, model exports, and checkpoints organized.
Dataยถ
LightlyTrain expects a folder containing images or a list of (possibly mixed) folders and image files. Any folder will be recursively traversed and finds all image files within it (even in nested subdirectories).
The following image formats are supported:
jpg
jpeg
png
ppm
bmp
pgm
tif
tiff
webp
Example of passing a single folder my_data_dir
:
my_data_dir
โโโ dir0
โ โโโ image0.jpg
โ โโโ image1.jpg
โโโ dir1
โโโ image0.jpg
lightly_train.train(
out="out/my_experiment", # Output directory
data="my_data_dir", # Directory with images
model="torchvision/resnet18", # Model to train
)
lightly-train train out="out/my_experiment" data="my_data_dir" model="torchvision/resnet18"
Example of passing a (mixed) list of files and folders:
โโโ image2.jpg
โโโ image3.jpg
โโโ my_data_dir
โโโ dir0
โ โโโ image0.jpg
โ โโโ image1.jpg
โโโ dir1
โโโ image0.jpg
lightly_train.train(
out="out/my_experiment", # Output directory
data=["image2.jpg", "image3.jpg", "my_data_dir"], # Directory with images
model="torchvision/resnet18", # Model to train
)
lightly-train train out="out/my_experiment" data='["image2.jpg", "image3.jpg", "my_data_dir"]' model="torchvision/resnet18"
Modelยถ
See supported libraries in the Models page for a detailed list of all supported libraries and their respective docs pages for all supported models.
Methodยถ
See Methods for a list of all supported methods.
Loggersยถ
Logging is configured with the loggers
argument. The following loggers are
supported:
jsonl
: Logs training metrics to a .jsonl file (enabled by default)mlflow
: Logs training metrics to MLflow (disabled by default, requires MLflow to be installed)tensorboard
: Logs training metrics to TensorBoard (enabled by default, requires TensorBoard to be installed)wandb
: Logs training metrics to Weights & Biases (disabled by default, requires Weights & Biases to be installed)
JSONLยถ
The JSONL logger is enabled by default and logs training metrics to a .jsonl file
at out/my_experiment/metrics.jsonl
.
Disable the JSONL logger with:
loggers={"jsonl": None}
loggers.jsonl=null
MLflowยถ
Important
MLflow must be installed with pip install "lightly-train[mlflow]"
.
The mlflow logger can be configured with the following arguments:
import lightly_train
if __name__ == "__main__":
lightly_train.train(
out="out/my_experiment",
data="my_data_dir",
model="torchvision/resnet50",
loggers={
"mlflow": {
"experiment_name": "my_experiment",
"run_name": "my_run",
"tracking_uri": "tracking_uri",
# "run_id": "my_run_id", # Use if resuming a training with resume=True
# "log_model": True, # Currently not supported
},
},
)
lightly-train train out="out/my_experiment" data="my_data_dir" model="torchvision/resnet50" loggers.mlflow.experiment_name="my_experiment" loggers.mlflow.run_name="my_run" loggers.mlflow.tracking_uri=tracking_uri
TensorBoardยถ
TensorBoard logs are automatically saved to the output directory. Run TensorBoard in a new terminal to visualize the training progress:
tensorboard --logdir out/my_experiment
Disable the TensorBoard logger with:
loggers={"tensorboard": None}
loggers.tensorboard=null
Weights & Biasesยถ
Important
Weights & Biases must be installed with pip install "lightly-train[wandb]"
.
The Weights & Biases logger can be configured with the following arguments:
import lightly_train
if __name__ == "__main__":
lightly_train.train(
out="out/my_experiment",
data="my_data_dir",
model="torchvision/resnet50",
loggers={
"wandb": {
"project": "my_project",
"name": "my_experiment",
"log_model": False, # Set to True to upload model checkpoints
},
},
)
lightly-train train out="out/my_experiment" data="my_data_dir" model="torchvision/resnet50" loggers.wandb.project="my_project" loggers.wandb.name="my_experiment" loggers.wandb.log_model=False
More configuration options are available through the Weights & Biases environment variables. See the Weights & Biases documentation for more information.
Disable the Weights & Biases logger with:
loggers={"wandb": None}
loggers.wandb=null
Resume Trainingยถ
There are two distinct ways to continue training, depending on your intention.
Resume Interrupted Trainingยถ
Use resume=True
to resume a previously interrupted or crashed training run. This will pick up exactly where the training left off.
You must use the same
output_dir
as the original run.You must not change any training parameters (e.g., learning rate, batch size, data, etc.).
This is intended for continuing the same run without modification.
Load Weights for a New Runยถ
Use checkpoint="path/to/checkpoint.ckpt"
to load model weights from a checkpoint, but start a new training run.
You are free to change training parameters.
This is useful for continuing training with a different setup.
General Notesยถ
Important
resume=True
andcheckpoint=...
are mutually exclusive and cannot be used together.If
overwrite=True
is set, training will start fresh, overwriting existing outputs or checkpoints in the specified output directory.
Advanced Optionsยถ
Input Image Resolutionยถ
The input image resolution can be set with the transform_args argument. By default a resolution of 224x224 pixels is used. A custom resolution can be set like this:
import lightly_train
if __name__ == "__main__":
lightly_train.train(
out="out/my_experiment", # Output directory
data="my_data_dir", # Directory with images
model="torchvision/resnet18", # Model to train
transform_args={"image_size": (448, 448)}, # (height, width)
)
lightly-train train out="out/my_experiment" data="my_data_dir" model="torchvision/resnet18" transform_args.image_size="[448,448]"
Warning
Not all models support all image sizes.
Image Transformsยถ
See Configuring Image Transforms on how to configure image transformations.
Method Argumentsยถ
Warning
In 99% of cases, it is not necessary to modify the default method arguments in LightlyTrain. The default settings are carefully tuned to work well for most use cases.
The method arguments can be set with the method_args
argument:
import lightly_train
if __name__ == "__main__":
lightly_train.train(
out="out/my_experiment", # Output directory
data="my_data_dir", # Directory with images
model="torchvision/resnet18", # Model to train
method="distillation", # Pretraining method
method_args={ # Override the default teacher model
"teacher": "dinov2_vit/vitl14_pretrain",
},
)
lightly-train train out="out/my_experiment" data="my_data_dir" model="torchvision/resnet18" method="distillation" method_args.teacher="dinov2_vit/vitl14_pretrain"
Each pretraining method has its own set of arguments that can be configured. LightlyTrain provides sensible defaults that are adjusted depending on the dataset and model used. The defaults for each method are listed in the respective Methods documentation pages.
Performance Optimizationsยถ
For performance optimizations, e.g. using accelerators, multi-GPU, multi-node, and half precision training, see the performance page.