Depth Estimation (NEW)

Open In Colab

Note

LightlyTrain supports depth estimation inference with Depth Anything V2 and Depth Anything V3 models. Training support will be released soon!

LightlyTrain ports the Depth Anything V2 (DAv2) and V3 (DAv3) monocular depth estimation models. Both come in two flavors:

  • Relative depth predicts an unscaled depth map: it captures the ordering of the scene (what is closer and what is farther) but the values have no physical unit.

  • Metric depth predicts depth in meters, suitable for 3D reconstruction and any application that needs absolute scale.

Warning

The meaning of the predicted values is not the same across models. For DAv2 relative models, larger values are nearer. For DAv3 relative and for all metric models, larger values are farther.

Models

All models use a DINOv2 ViT backbone.

Depth Anything V3

Model

Type

Backbone

dinov2/dav3-relative-large

Relative

ViT-L/14

dinov2/dav3-metric-large

Metric

ViT-L/14

Depth Anything V2

Model

Type

Backbone

dinov2/dav2-relative-small

Relative

ViT-S/14

dinov2/dav2-relative-base*

Relative

ViT-B/14

dinov2/dav2-relative-large*

Relative

ViT-L/14

dinov2/dav2-metric-small-hypersim

Metric

ViT-S/14

dinov2/dav2-metric-base-hypersim*

Metric

ViT-B/14

dinov2/dav2-metric-large-hypersim*

Metric

ViT-L/14

dinov2/dav2-metric-small-vkitti*

Metric

ViT-S/14

dinov2/dav2-metric-base-vkitti*

Metric

ViT-B/14

dinov2/dav2-metric-large-vkitti*

Metric

ViT-L/14

* Not hosted by LightlyTrain. See the note below.

Note

All Depth Anything V3 models are hosted by LightlyTrain and downloaded automatically by load_model. For Depth Anything V2, only the two small Apache-2.0 models (dinov2/dav2-relative-small and dinov2/dav2-metric-small-hypersim) are hosted. The remaining DAv2 models (marked with * above) — the ViT-B/ViT-L variants and all VKITTI variants — are released under non-commercial licenses (CC-BY-NC-4.0 for the relative base/large and the Hypersim metric variants, CC-BY-NC-SA-3.0 for the VKITTI metric variants), so LightlyTrain does not host them. You can convert them from the official weights yourself, see Using Non-Hosted Depth Anything V2 Checkpoints.

Which model should I use?

  • Do you need depth in meters? If yes, pick a metric model. If you only need the relative ordering of the scene (closer vs. farther), pick a relative model, it is simpler to use and needs no camera information.

  • DAv3 or DAv2? DAv3 is the recent model and generally the most accurate. Choose DAv2 if you need a smaller and faster ViT-S or ViT-B model, or a model under a permissive license for commercial use, the two hosted DAv2 small models are Apache-2.0.

  • Which DAv2 metric model? The metric DAv2 models are trained per domain: use a hypersim model for indoor scenes (depth up to 20 m) and a vkitti model for outdoor driving scenes (depth up to 80 m).

Quick Start

Load a model and call predict on an image. The image can be a file path, a URL, a PIL image, or a (C, H, W) tensor. The result is a single (H, W) tensor with the same height and width as the input image.

import lightly_train

# Load a model hosted by LightlyTrain (downloaded and cached automatically).
model = lightly_train.load_model("dinov2/dav3-relative-large")

# Predict a relative-depth map. Returns a (H, W) tensor matching the input resolution.
depth = model.predict("image.jpg")

Tip

By default load_model runs on a GPU ("cuda" or "mps") if one is available and falls back to CPU otherwise. Pass device= to choose explicitly, e.g. lightly_train.load_model("dinov2/dav3-relative-large", device="cuda"). The default ViT-L models are sizable, so a GPU is recommended.

Visualize the Result

The depth map is a plain tensor, so you can colorize and display it with matplotlib:

import matplotlib.pyplot as plt
from PIL import Image

import lightly_train

model = lightly_train.load_model("dinov2/dav3-relative-large")
depth = model.predict("image.jpg")

# Colorize the depth map and save it next to the input image.
image = Image.open("image.jpg")
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
axes[0].imshow(image)
axes[0].set_title("Input")
axes[0].axis("off")
depth_vis = axes[1].imshow(depth.cpu(), cmap="Spectral_r")
axes[1].set_title("Relative depth (larger = farther)")
axes[1].axis("off")
fig.colorbar(depth_vis, ax=axes[1], fraction=0.046, pad=0.04)
fig.savefig("depth.png")

Predict Metric Depth

Metric models return depth in meters, with larger values corresponding to farther scene content.

Depth Anything V3

DAv3 metric models require the camera intrinsics of the input image, which set the absolute scale of the prediction. Pass a (3, 3) intrinsics matrix in the original image’s pixel coordinates via the intrinsics argument:

import math

import torch
from PIL import Image

import lightly_train

model = lightly_train.load_model("dinov2/dav3-metric-large")

# Approximate intrinsics from an assumed 60° horizontal field of view.
image = Image.open("image.jpg")
width, height = image.size
focal_px = (width / 2) / math.tan(math.radians(60.0) / 2)
intrinsics = torch.tensor(
    [
        [focal_px, 0.0, width / 2],
        [0.0, focal_px, height / 2],
        [0.0, 0.0, 1.0],
    ]
)

depth_m = model.predict("image.jpg", intrinsics=intrinsics)
print(f"depth range: {depth_m.min():.2f} m – {depth_m.max():.2f} m")

If you do not know the true intrinsics, an approximation from the field of view (as above) still gives a reasonable scale.

Depth Anything V2

DAv2 metric models are trained for a fixed domain and do not take camera intrinsics. Choose the model that matches your scene:

  • hypersim models are trained on indoor scenes (depth up to 20 m).

  • vkitti models are trained on outdoor driving scenes (depth up to 80 m).

import lightly_train

model = lightly_train.load_model("dinov2/dav2-metric-small-hypersim")
depth_m = model.predict("image.jpg")  # Metric depth in meters.

Batch Inference

Use predict_batch to run inference on several images at once. It returns a list of (H, W) tensors, one per image, each resized back to its original resolution.

import lightly_train

model = lightly_train.load_model("dinov2/dav3-relative-large")
depths = model.predict_batch(["image1.jpg", "image2.jpg"])

For DAv3 metric models, pass one intrinsics matrix per image:

model = lightly_train.load_model("dinov2/dav3-metric-large")
depths = model.predict_batch(
    ["image1.jpg", "image2.jpg"],
    intrinsics=[intrinsics1, intrinsics2],
)

Note

Images with different aspect ratios are center-cropped to the smallest processed size in the batch before inference, so their depth maps are slightly stretched when resized back to the original resolution. For pixel-perfect results on images of different sizes, call predict on each image individually.

Using Non-Hosted Depth Anything V2 Checkpoints

To use a DAv2 model that LightlyTrain does not host, convert the official Depth Anything V2 weights into a LightlyTrain checkpoint. You are responsible for complying with each model’s license terms.

  1. Download the official weights for the model you want from the corresponding Depth Anything V2 Hugging Face repository (for example depth_anything_v2_metric_vkitti_vitl.pth).

  2. Convert them into a LightlyTrain checkpoint:

    python -m lightly_train._task_models.depth_estimation_components.convert_checkpoint_dav2 \
        --model-name dinov2/dav2-metric-large-vkitti \
        --weights path/to/depth_anything_v2_metric_vkitti_vitl.pth \
        --out ckpt/dav2_metric_vkitti_large.pt
    
  3. Load the converted checkpoint like any other model:

    import lightly_train
    
    model = lightly_train.load_model("ckpt/dav2_metric_vkitti_large.pt")
    depth_m = model.predict("image.jpg")
    

Tip

Converting the Apache-2.0 models (dinov2/dav2-relative-small and dinov2/dav2-metric-small-hypersim) is not necessary, they are hosted by LightlyTrain and downloaded automatically. For these, the converter can fetch the official weights from Hugging Face directly, so the --weights argument can be omitted.