(object-detection)=
# Object Detection
```{note}
🔥 LightlyTrain now supports training **LT-DETR**: **DINOv3**- and **DINOv2**-based object detection models
with the super fast RT-DETR detection architecture! Our largest model achieves an mAP50:95 of 60.0 on the COCO validation set!
```
(object-detection-benchmark-results)=
## Benchmark Results
Below we provide the model checkpoints and report the validation mAP50:95 and inference FPS of different DINOv3 and DINOv2-based models, fine-tuned on the COCO dataset. You can check [here](object-detection-use-model-weights) for how to use these model checkpoints for further fine-tuning. The average FPS values were measured using TensorRT in the version `10.13.3.9` and on a Nvidia T4 GPU with batch size 1.
### COCO
| Implementation | Backbone Model | AP50:95 | Latency (ms) | # Params (M) | Input Size | Checkpoint Name |
|:--------------:|:----------------------------:|:------------------:|:------------:|:------------:|:----------:|:---------------------------------:|
| LightlyTrain | dinov2/vits14-ltdetr | 55.7 | 16.87 | 55.3 | 644×644 | dinov2/vits14-noreg-ltdetr-coco |
| LightlyTrain | dinov3/convnext-tiny-ltdetr | 54.4 | 13.29 | 61.1 | 640×640 | dinov3/convnext-tiny-ltdetr-coco |
| LightlyTrain | dinov3/convnext-small-ltdetr | 56.9 | 17.65 | 82.7 | 640×640 | dinov3/convnext-small-ltdetr-coco |
| LightlyTrain | dinov3/convnext-base-ltdetr | 58.6 | 24.68 | 121.0 | 640×640 | dinov3/convnext-base-ltdetr-coco |
| LightlyTrain | dinov3/convnext-large-ltdetr | 60.0 | 42.30 | 230.0 | 640×640 | dinov3/convnext-large-ltdetr-coco |
## Object Detection with LT-DETR
Training an object detection model with LightlyTrain is straightforward and only
requires a few lines of code. See [data](#object-detection-data) for details on how
to prepare your dataset.
### Train an Object Detection Model
```python
import lightly_train
if __name__ == "__main__":
lightly_train.train_object_detection(
out="out/my_experiment",
model="dinov3/convnext-small-ltdetr-coco",
data={
"path": "base_path_to_your_dataset",
"train": "images/train2012",
"val": "images/val2012",
"names": {
0: "person",
1: "bicycle",
# ...
},
}
)
```
During training, both the
- best (with highest validation mAP50:95) and
- last (last validation round as determined by `save_checkpoint_args.save_every_num_steps`)
model weights are exported to `out/my_experiment/exported_models/`, unless disabled in
`save_checkpoint_args`. You can use these weights to continue fine-tuning on another
task by loading the weights via `model=""`:
```python
import lightly_train
if __name__ == "__main__":
lightly_train.train_object_detection(
out="out/my_experiment",
model="out/my_experiment/exported_models/exported_best.pt", # Use the best model to continue training
data={...},
)
```
(object-detection-use-model-weights)=
### Load the Trained Model from Checkpoint and Predict
After the training completes, you can load the best model checkpoints for inference like this:
```python
import lightly_train
model = lightly_train.load_model("out/my_experiment/exported_models/exported_best.pt")
results = model.predict("path/to/image.jpg")
```
Or use one of the pre-trained model weights directly from LightlyTrain:
```python
import lightly_train
model = lightly_train.load_model("dinov3/convnext-tiny-ltdetr-coco")
results = model.predict("path/to/image.jpg")
```
### Visualize the Result
After making the predictions with the model weights, you can visualize the predicted bounding boxes like this:
```python
# ruff: noqa: F821
import matplotlib.pyplot as plt
from torchvision import io, utils
import lightly_train
model = lightly_train.load_model("dinov3/convnext-tiny-ltdetr-coco")
labels, boxes, scores = model.predict(".jpg").values()
# Visualize predictions.
image_with_boxes = utils.draw_bounding_boxes(
image=io.read_image(".jpg"),
boxes=boxes,
labels=[model.classes[i.item()] for i in labels],
)
fig, ax = plt.subplots(figsize=(30, 30))
ax.imshow(image_with_boxes.permute(1, 2, 0))
fig.savefig("predictions.png")
```
The predicted boxes are in the absolute (x_min, y_min, x_max, y_max) format, i.e. represent the size of the dimension of the bounding boxes in pixels.
## Out
The `out` argument specifies the output directory where all training logs, model exports,
and checkpoints are saved. It looks like this after training:
```text
out/my_experiment
├── checkpoints
│ └── last.ckpt # Last checkpoint
├── exported_models
| └── exported_last.pt # Last model exported (unless disabled)
| └── exported_best.pt # Best model exported (unless disabled)
├── events.out.tfevents.1721899772.host.1839736.0 # TensorBoard logs
└── train.log # Training logs
```
The final model checkpoint is saved to `out/my_experiment/checkpoints/last.ckpt`. The last and best model weights are exported to `out/my_experiment/exported_models/` unless disabled in `save_checkpoint_args`.
```{tip}
Create a new output directory for each experiment to keep training logs, model exports,
and checkpoints organized.
```
(object-detection-data)=
## Data
Lightly**Train** supports training object detection models with images and bounding boxes.
Every image must have a corresponding annotation file (in [YOLO format](https://labelformat.com/formats/object-detection/yolov5/)) that contains for every object in the image a line with the class ID and 4 normalized bounding box coordinates (x_center, y_center, width, height). The file should have the `.txt` extension and an example annotation file for an image with two objects could look like this:
```text
0 0.716797 0.395833 0.216406 0.147222
1 0.687500 0.379167 0.255208 0.175000
```
The following image formats are supported:
- jpg
- jpeg
- png
- ppm
- bmp
- pgm
- tif
- tiff
- webp
Your dataset directory should be organized like this:
```text
base_path_to_your_dataset/
├── images
│ ├── train
│ │ ├── image1.jpg
│ │ ├── image2.jpg
│ │ └── ...
│ └── val
│ ├── image1.jpg
│ ├── image2.jpg
│ └── ...
└── labels
├── train
│ ├── image1.txt
│ ├── image2.txt
│ └── ...
└── val
├── image1.txt
├── image2.txt
└── ...
```
Alternatively, the splits can also be at the top level:
```text
base_path_to_your_dataset/
├── train
│ ├── images
│ │ ├── image1.jpg
│ │ ├── image2.jpg
│ │ └── ...
│ └── labels
│ ├── image1.txt
│ ├── image2.txt
│ └── ...
└── val
├── images
│ ├── image1.jpg
│ ├── image2.jpg
│ └── ...
└── labels
├── image1.txt
├── image2.txt
└── ...
```