Depth Estimation (NEW)¶

Note

LightlyTrain supports depth estimation inference with Depth Anything V2 and Depth Anything V3 models. Training support will be released soon!

LightlyTrain ports the Depth Anything V2 (DAv2) and V3 (DAv3) monocular depth estimation models. Both come in two flavors:

Relative depth predicts an unscaled depth map: it captures the ordering of the scene (what is closer and what is farther) but the values have no physical unit.
Metric depth predicts depth in meters, suitable for 3D reconstruction and any application that needs absolute scale.

Warning

The meaning of the predicted values is not the same across models. For DAv2 relative models, larger values are nearer. For DAv3 relative and for all metric models, larger values are farther.

Models¶

All models use a DINOv2 ViT backbone.

Depth Anything V3¶

Model	Type	Backbone
`dinov2/dav3-relative-large`	Relative	ViT-L/14
`dinov2/dav3-metric-large`	Metric	ViT-L/14

Depth Anything V2¶

Model	Type	Backbone
`dinov2/dav2-relative-small`	Relative	ViT-S/14
`dinov2/dav2-relative-base`*	Relative	ViT-B/14
`dinov2/dav2-relative-large`*	Relative	ViT-L/14
`dinov2/dav2-metric-small-hypersim`	Metric	ViT-S/14
`dinov2/dav2-metric-base-hypersim`*	Metric	ViT-B/14
`dinov2/dav2-metric-large-hypersim`*	Metric	ViT-L/14
`dinov2/dav2-metric-small-vkitti`*	Metric	ViT-S/14
`dinov2/dav2-metric-base-vkitti`*	Metric	ViT-B/14
`dinov2/dav2-metric-large-vkitti`*	Metric	ViT-L/14

* Not hosted by LightlyTrain. See the note below.

Note

All Depth Anything V3 models are hosted by LightlyTrain and downloaded automatically by load_model. For Depth Anything V2, only the two small Apache-2.0 models (dinov2/dav2-relative-small and dinov2/dav2-metric-small-hypersim) are hosted. The remaining DAv2 models (marked with * above) — the ViT-B/ViT-L variants and all VKITTI variants — are released under non-commercial licenses (CC-BY-NC-4.0 for the relative base/large and the Hypersim metric variants, CC-BY-NC-SA-3.0 for the VKITTI metric variants), so LightlyTrain does not host them. You can convert them from the official weights yourself, see Using Non-Hosted Depth Anything V2 Checkpoints.

Which model should I use?¶

Do you need depth in meters? If yes, pick a metric model. If you only need the relative ordering of the scene (closer vs. farther), pick a relative model, it is simpler to use and needs no camera information.
DAv3 or DAv2? DAv3 is the recent model and generally the most accurate. Choose DAv2 if you need a smaller and faster ViT-S or ViT-B model, or a model under a permissive license for commercial use, the two hosted DAv2 small models are Apache-2.0.
Which DAv2 metric model? The metric DAv2 models are trained per domain: use a hypersim model for indoor scenes (depth up to 20 m) and a vkitti model for outdoor driving scenes (depth up to 80 m).

Quick Start¶

Load a model and call predict on an image. The image can be a file path, a URL, a PIL image, or a (C, H, W) tensor. The result is a single (H, W) tensor with the same height and width as the input image.

import lightly_train

# Load a model hosted by LightlyTrain (downloaded and cached automatically).
model = lightly_train.load_model("dinov2/dav3-relative-large")

# Predict a relative-depth map. Returns a (H, W) tensor matching the input resolution.
depth = model.predict("image.jpg")

Tip

By default load_model runs on a GPU ("cuda" or "mps") if one is available and falls back to CPU otherwise. Pass device= to choose explicitly, e.g. lightly_train.load_model("dinov2/dav3-relative-large", device="cuda"). The default ViT-L models are sizable, so a GPU is recommended.

Visualize the Result¶

The depth map is a plain tensor, so you can colorize and display it with matplotlib:

import matplotlib.pyplot as plt
from PIL import Image

import lightly_train

model = lightly_train.load_model("dinov2/dav3-relative-large")
depth = model.predict("image.jpg")

# Colorize the depth map and save it next to the input image.
image = Image.open("image.jpg")
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
axes[0].imshow(image)
axes[0].set_title("Input")
axes[0].axis("off")
depth_vis = axes[1].imshow(depth.cpu(), cmap="Spectral_r")
axes[1].set_title("Relative depth (larger = farther)")
axes[1].axis("off")
fig.colorbar(depth_vis, ax=axes[1], fraction=0.046, pad=0.04)
fig.savefig("depth.png")

Predict Metric Depth¶

Metric models return depth in meters, with larger values corresponding to farther scene content.

Depth Anything V3¶

DAv3 metric models require the camera intrinsics of the input image, which set the absolute scale of the prediction. Pass a (3, 3) intrinsics matrix in the original image’s pixel coordinates via the intrinsics argument:

import math

import torch
from PIL import Image

import lightly_train

model = lightly_train.load_model("dinov2/dav3-metric-large")

# Approximate intrinsics from an assumed 60° horizontal field of view.
image = Image.open("image.jpg")
width, height = image.size
focal_px = (width / 2) / math.tan(math.radians(60.0) / 2)
intrinsics = torch.tensor(
    [
        [focal_px, 0.0, width / 2],
        [0.0, focal_px, height / 2],
        [0.0, 0.0, 1.0],
    ]
)

depth_m = model.predict("image.jpg", intrinsics=intrinsics)
print(f"depth range: {depth_m.min():.2f} m – {depth_m.max():.2f} m")

If you do not know the true intrinsics, an approximation from the field of view (as above) still gives a reasonable scale.

Depth Anything V2¶

DAv2 metric models are trained for a fixed domain and do not take camera intrinsics. Choose the model that matches your scene:

hypersim models are trained on indoor scenes (depth up to 20 m).
vkitti models are trained on outdoor driving scenes (depth up to 80 m).

import lightly_train

model = lightly_train.load_model("dinov2/dav2-metric-small-hypersim")
depth_m = model.predict("image.jpg")  # Metric depth in meters.

Batch Inference¶

Use predict_batch to run inference on several images at once. It returns a list of (H, W) tensors, one per image, each resized back to its original resolution.

import lightly_train

model = lightly_train.load_model("dinov2/dav3-relative-large")
depths = model.predict_batch(["image1.jpg", "image2.jpg"])

For DAv3 metric models, pass one intrinsics matrix per image:

model = lightly_train.load_model("dinov2/dav3-metric-large")
depths = model.predict_batch(
    ["image1.jpg", "image2.jpg"],
    intrinsics=[intrinsics1, intrinsics2],
)

Note

By default (process_res_method="square_resize") every image is resized to the same square processing resolution, so batches of differently sized images are handled without any cropping. The aspect-preserving methods ("upper_bound_resize" and "lower_bound_resize") can yield different processed sizes across a batch; those images are then center-cropped to the smallest processed size before inference, so their depth maps are slightly stretched when resized back to the original resolution. For pixel-perfect results with an aspect-preserving method, call predict on each image individually.

Using Non-Hosted Depth Anything V2 Checkpoints¶

To use a DAv2 model that LightlyTrain does not host, convert the official Depth Anything V2 weights into a LightlyTrain checkpoint. You are responsible for complying with each model’s license terms.

Download the official weights for the model you want from the corresponding Depth Anything V2 Hugging Face repository (for example depth_anything_v2_metric_vkitti_vitl.pth).

Convert them into a LightlyTrain checkpoint:

python -m lightly_train._task_models.depth_estimation_components.convert_checkpoint_dav2 \
    --model-name dinov2/dav2-metric-large-vkitti \
    --weights path/to/depth_anything_v2_metric_vkitti_vitl.pth \
    --out ckpt/dav2_metric_vkitti_large.pt

Load the converted checkpoint like any other model:

import lightly_train

model = lightly_train.load_model("ckpt/dav2_metric_vkitti_large.pt")
depth_m = model.predict("image.jpg")

Tip

Converting the Apache-2.0 models (dinov2/dav2-relative-small and dinov2/dav2-metric-small-hypersim) is not necessary, they are hosted by LightlyTrain and downloaded automatically. For these, the converter can fetch the official weights from Hugging Face directly, so the --weights argument can be omitted.

Exporting a Checkpoint to ONNX¶

Open Neural Network Exchange (ONNX) is a standard format for representing machine learning models in a framework independent manner. In particular, it is useful for deploying our models on edge devices where PyTorch is not available.

Note

The ONNX graph contains only the model forward pass. It outputs the raw depth map at the model’s processing resolution (plus a sky map for Depth Anything V3 models, which have a sky head). Preprocessing the input image and postprocessing the output (sky filling, metric scaling, and resizing back to the original image resolution) are not part of the graph and must be applied by the caller.

Requirements¶

Exporting to ONNX requires some additional packages to be installed. Namely

onnx
onnxruntime if verify is set to True.
onnxslim if simplify is set to True.

You can install them with:

pip install "lightly-train[onnx,onnxruntime,onnxslim]"

The following example shows how to export a model to ONNX.

import lightly_train

# Load a model.
model = lightly_train.load_model("dinov2/dav3-relative-large")

# Export to ONNX.
model.export_onnx(
    out="model.onnx",
    # precision="fp16", # Export model with FP16 weights for smaller size and faster inference.
)

See export_onnx() for all available options when exporting to ONNX.

The following notebook shows how to export a model to ONNX in Colab:

Exporting a Checkpoint to TensorRT¶

TensorRT engines are built from an ONNX representation of the model. The export_tensorrt method internally exports the model to ONNX (see the ONNX export section above) before building a TensorRT engine for fast GPU inference.

Requirements¶

TensorRT is not part of LightlyTrain’s dependencies and must be installed separately. Installation depends on your OS, Python version, GPU, and NVIDIA driver/CUDA setup. See the TensorRT documentation for more details.

On CUDA 12.x systems you can often install the Python package via:

pip install tensorrt-cu12

import lightly_train

# Load a model.
model = lightly_train.load_model("dinov2/dav3-relative-large")

# Export to TensorRT from an ONNX file.
model.export_tensorrt(
    out="model.trt", # TensorRT engine destination.
    # precision="fp16", # Export model with FP16 weights for smaller size and faster inference.
)

See export_tensorrt() for all available options when exporting to TensorRT.

Combining Conversion and Export¶

ONNX export and TensorRT engine building can be combined in a single call: building a TensorRT engine requires an ONNX model, and export_tensorrt exports it for you. You can fix the export resolution via onnx_args (height and width, both multiples of the patch size 14) and pick the precision at the same time:

import lightly_train

# A hosted model name, or a path to a converted DAv2 checkpoint.
model = lightly_train.load_model("dinov2/dav3-relative-large")


model.export_tensorrt(
    out="model.trt",
    onnx_args={"height": 504, "width": 504},
    precision="fp16",
)