Depth Estimation (NEW)¶
Note
LightlyTrain supports depth estimation inference with Depth Anything V2 and Depth Anything V3 models. Training support will be released soon!
LightlyTrain ports the Depth Anything V2 (DAv2) and V3 (DAv3) monocular depth estimation models. Both come in two flavors:
Relative depth predicts an unscaled depth map: it captures the ordering of the scene (what is closer and what is farther) but the values have no physical unit.
Metric depth predicts depth in meters, suitable for 3D reconstruction and any application that needs absolute scale.
Warning
The meaning of the predicted values is not the same across models. For DAv2 relative models, larger values are nearer. For DAv3 relative and for all metric models, larger values are farther.
Models¶
All models use a DINOv2 ViT backbone.
Depth Anything V3¶
Model |
Type |
Backbone |
|---|---|---|
|
Relative |
ViT-L/14 |
|
Metric |
ViT-L/14 |
Depth Anything V2¶
Model |
Type |
Backbone |
|---|---|---|
|
Relative |
ViT-S/14 |
|
Relative |
ViT-B/14 |
|
Relative |
ViT-L/14 |
|
Metric |
ViT-S/14 |
|
Metric |
ViT-B/14 |
|
Metric |
ViT-L/14 |
|
Metric |
ViT-S/14 |
|
Metric |
ViT-B/14 |
|
Metric |
ViT-L/14 |
* Not hosted by LightlyTrain. See the note below.
Note
All Depth Anything V3 models are hosted by LightlyTrain and downloaded automatically by
load_model. For Depth Anything V2, only the two small Apache-2.0 models
(dinov2/dav2-relative-small and dinov2/dav2-metric-small-hypersim) are hosted. The
remaining DAv2 models (marked with * above) — the ViT-B/ViT-L variants and all VKITTI
variants — are released under non-commercial licenses
(CC-BY-NC-4.0 for the relative
base/large and the Hypersim metric variants,
CC-BY-NC-SA-3.0
for the VKITTI metric variants), so LightlyTrain does not host them. You can convert
them from the official weights yourself, see
Using Non-Hosted Depth Anything V2 Checkpoints.
Which model should I use?¶
Do you need depth in meters? If yes, pick a metric model. If you only need the relative ordering of the scene (closer vs. farther), pick a relative model, it is simpler to use and needs no camera information.
DAv3 or DAv2? DAv3 is the recent model and generally the most accurate. Choose DAv2 if you need a smaller and faster ViT-S or ViT-B model, or a model under a permissive license for commercial use, the two hosted DAv2 small models are Apache-2.0.
Which DAv2 metric model? The metric DAv2 models are trained per domain: use a
hypersimmodel for indoor scenes (depth up to 20 m) and avkittimodel for outdoor driving scenes (depth up to 80 m).
Quick Start¶
Load a model and call predict on an image. The image can be a file path, a URL, a PIL
image, or a (C, H, W) tensor. The result is a single (H, W) tensor with the same
height and width as the input image.
import lightly_train
# Load a model hosted by LightlyTrain (downloaded and cached automatically).
model = lightly_train.load_model("dinov2/dav3-relative-large")
# Predict a relative-depth map. Returns a (H, W) tensor matching the input resolution.
depth = model.predict("image.jpg")
Tip
By default load_model runs on a GPU ("cuda" or "mps") if one is available and falls
back to CPU otherwise. Pass device= to choose explicitly, e.g.
lightly_train.load_model("dinov2/dav3-relative-large", device="cuda"). The default ViT-L
models are sizable, so a GPU is recommended.
Visualize the Result¶
The depth map is a plain tensor, so you can colorize and display it with matplotlib:
import matplotlib.pyplot as plt
from PIL import Image
import lightly_train
model = lightly_train.load_model("dinov2/dav3-relative-large")
depth = model.predict("image.jpg")
# Colorize the depth map and save it next to the input image.
image = Image.open("image.jpg")
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
axes[0].imshow(image)
axes[0].set_title("Input")
axes[0].axis("off")
depth_vis = axes[1].imshow(depth.cpu(), cmap="Spectral_r")
axes[1].set_title("Relative depth (larger = farther)")
axes[1].axis("off")
fig.colorbar(depth_vis, ax=axes[1], fraction=0.046, pad=0.04)
fig.savefig("depth.png")
Predict Metric Depth¶
Metric models return depth in meters, with larger values corresponding to farther scene content.
Depth Anything V3¶
DAv3 metric models require the camera intrinsics of the input image, which set the
absolute scale of the prediction. Pass a (3, 3) intrinsics matrix in the original
image’s pixel coordinates via the intrinsics argument:
import math
import torch
from PIL import Image
import lightly_train
model = lightly_train.load_model("dinov2/dav3-metric-large")
# Approximate intrinsics from an assumed 60° horizontal field of view.
image = Image.open("image.jpg")
width, height = image.size
focal_px = (width / 2) / math.tan(math.radians(60.0) / 2)
intrinsics = torch.tensor(
[
[focal_px, 0.0, width / 2],
[0.0, focal_px, height / 2],
[0.0, 0.0, 1.0],
]
)
depth_m = model.predict("image.jpg", intrinsics=intrinsics)
print(f"depth range: {depth_m.min():.2f} m – {depth_m.max():.2f} m")
If you do not know the true intrinsics, an approximation from the field of view (as above) still gives a reasonable scale.
Depth Anything V2¶
DAv2 metric models are trained for a fixed domain and do not take camera intrinsics. Choose the model that matches your scene:
hypersimmodels are trained on indoor scenes (depth up to 20 m).vkittimodels are trained on outdoor driving scenes (depth up to 80 m).
import lightly_train
model = lightly_train.load_model("dinov2/dav2-metric-small-hypersim")
depth_m = model.predict("image.jpg") # Metric depth in meters.
Batch Inference¶
Use predict_batch to run inference on several images at once. It returns a list of
(H, W) tensors, one per image, each resized back to its original resolution.
import lightly_train
model = lightly_train.load_model("dinov2/dav3-relative-large")
depths = model.predict_batch(["image1.jpg", "image2.jpg"])
For DAv3 metric models, pass one intrinsics matrix per image:
model = lightly_train.load_model("dinov2/dav3-metric-large")
depths = model.predict_batch(
["image1.jpg", "image2.jpg"],
intrinsics=[intrinsics1, intrinsics2],
)
Note
Images with different aspect ratios are center-cropped to the smallest processed size in
the batch before inference, so their depth maps are slightly stretched when resized back
to the original resolution. For pixel-perfect results on images of different sizes, call
predict on each image individually.
Using Non-Hosted Depth Anything V2 Checkpoints¶
To use a DAv2 model that LightlyTrain does not host, convert the official Depth Anything V2 weights into a LightlyTrain checkpoint. You are responsible for complying with each model’s license terms.
Download the official weights for the model you want from the corresponding Depth Anything V2 Hugging Face repository (for example
depth_anything_v2_metric_vkitti_vitl.pth).Convert them into a LightlyTrain checkpoint:
python -m lightly_train._task_models.depth_estimation_components.convert_checkpoint_dav2 \ --model-name dinov2/dav2-metric-large-vkitti \ --weights path/to/depth_anything_v2_metric_vkitti_vitl.pth \ --out ckpt/dav2_metric_vkitti_large.pt
Load the converted checkpoint like any other model:
import lightly_train model = lightly_train.load_model("ckpt/dav2_metric_vkitti_large.pt") depth_m = model.predict("image.jpg")
Tip
Converting the Apache-2.0 models (dinov2/dav2-relative-small and
dinov2/dav2-metric-small-hypersim) is not necessary, they are hosted by LightlyTrain
and downloaded automatically. For these, the converter can fetch the official weights
from Hugging Face directly, so the --weights argument can be omitted.