(depth-estimation-doc)= # Depth Estimation (NEW) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lightly-ai/lightly-train/blob/main/examples/notebooks/depth_estimation.ipynb) ```{note} LightlyTrain supports depth estimation inference with [Depth Anything V2](https://github.com/DepthAnything/Depth-Anything-V2) and [Depth Anything V3](https://github.com/ByteDance-Seed/Depth-Anything-3) models. Training support will be released soon! ``` LightlyTrain ports the Depth Anything V2 (DAv2) and V3 (DAv3) monocular depth estimation models. Both come in two flavors: - **Relative depth** predicts an unscaled depth map: it captures the ordering of the scene (what is closer and what is farther) but the values have no physical unit. - **Metric depth** predicts depth in meters, suitable for 3D reconstruction and any application that needs absolute scale. ```{warning} The meaning of the predicted values is not the same across models. For **DAv2 relative** models, larger values are *nearer*. For **DAv3 relative** and for all **metric** models, larger values are *farther*. ``` (depth-estimation-models)= ## Models All models use a DINOv2 ViT backbone. ### Depth Anything V3 | Model | Type | Backbone | | ---------------------------- | -------- | :------: | | `dinov2/dav3-relative-large` | Relative | ViT-L/14 | | `dinov2/dav3-metric-large` | Metric | ViT-L/14 | ### Depth Anything V2 | Model | Type | Backbone | | ------------------------------------- | -------- | :------: | | `dinov2/dav2-relative-small` | Relative | ViT-S/14 | | `dinov2/dav2-relative-base`\* | Relative | ViT-B/14 | | `dinov2/dav2-relative-large`\* | Relative | ViT-L/14 | | `dinov2/dav2-metric-small-hypersim` | Metric | ViT-S/14 | | `dinov2/dav2-metric-base-hypersim`\* | Metric | ViT-B/14 | | `dinov2/dav2-metric-large-hypersim`\* | Metric | ViT-L/14 | | `dinov2/dav2-metric-small-vkitti`\* | Metric | ViT-S/14 | | `dinov2/dav2-metric-base-vkitti`\* | Metric | ViT-B/14 | | `dinov2/dav2-metric-large-vkitti`\* | Metric | ViT-L/14 | \* Not hosted by LightlyTrain. See the note below. ```{note} All Depth Anything V3 models are hosted by LightlyTrain and downloaded automatically by `load_model`. For Depth Anything V2, only the two small Apache-2.0 models (`dinov2/dav2-relative-small` and `dinov2/dav2-metric-small-hypersim`) are hosted. The remaining DAv2 models (marked with \* above) — the ViT-B/ViT-L variants and all VKITTI variants — are released under non-commercial licenses ([CC-BY-NC-4.0](https://github.com/DepthAnything/Depth-Anything-V2) for the relative base/large and the Hypersim metric variants, [CC-BY-NC-SA-3.0](https://europe.naverlabs.com/research/computer-vision/proxy-virtual-worlds-vkitti-2/) for the VKITTI metric variants), so LightlyTrain does not host them. You can convert them from the official weights yourself, see [Using Non-Hosted Depth Anything V2 Checkpoints](#depth-estimation-convert). ``` ### Which model should I use? - **Do you need depth in meters?** If yes, pick a **metric** model. If you only need the relative ordering of the scene (closer vs. farther), pick a **relative** model, it is simpler to use and needs no camera information. - **DAv3 or DAv2?** DAv3 is the recent model and generally the most accurate. Choose DAv2 if you need a smaller and faster ViT-S or ViT-B model, or a model under a permissive license for commercial use, the two hosted DAv2 small models are Apache-2.0. - **Which DAv2 metric model?** The metric DAv2 models are trained per domain: use a `hypersim` model for **indoor** scenes (depth up to 20 m) and a `vkitti` model for **outdoor** driving scenes (depth up to 80 m). (depth-estimation-relative)= ## Quick Start Load a model and call `predict` on an image. The image can be a file path, a URL, a PIL image, or a `(C, H, W)` tensor. The result is a single `(H, W)` tensor with the same height and width as the input image. ```python import lightly_train # Load a model hosted by LightlyTrain (downloaded and cached automatically). model = lightly_train.load_model("dinov2/dav3-relative-large") # Predict a relative-depth map. Returns a (H, W) tensor matching the input resolution. depth = model.predict("image.jpg") ``` ```{tip} By default `load_model` runs on a GPU (`"cuda"` or `"mps"`) if one is available and falls back to CPU otherwise. Pass `device=` to choose explicitly, e.g. `lightly_train.load_model("dinov2/dav3-relative-large", device="cuda")`. The default ViT-L models are sizable, so a GPU is recommended. ``` ### Visualize the Result The depth map is a plain tensor, so you can colorize and display it with `matplotlib`: ```python import matplotlib.pyplot as plt from PIL import Image import lightly_train model = lightly_train.load_model("dinov2/dav3-relative-large") depth = model.predict("image.jpg") # Colorize the depth map and save it next to the input image. image = Image.open("image.jpg") fig, axes = plt.subplots(1, 2, figsize=(14, 6)) axes[0].imshow(image) axes[0].set_title("Input") axes[0].axis("off") depth_vis = axes[1].imshow(depth.cpu(), cmap="Spectral_r") axes[1].set_title("Relative depth (larger = farther)") axes[1].axis("off") fig.colorbar(depth_vis, ax=axes[1], fraction=0.046, pad=0.04) fig.savefig("depth.png") ``` (depth-estimation-metric)= ## Predict Metric Depth Metric models return depth in **meters**, with larger values corresponding to farther scene content. ### Depth Anything V3 DAv3 metric models require the camera intrinsics of the input image, which set the absolute scale of the prediction. Pass a `(3, 3)` intrinsics matrix in the original image's pixel coordinates via the `intrinsics` argument: ```python import math import torch from PIL import Image import lightly_train model = lightly_train.load_model("dinov2/dav3-metric-large") # Approximate intrinsics from an assumed 60° horizontal field of view. image = Image.open("image.jpg") width, height = image.size focal_px = (width / 2) / math.tan(math.radians(60.0) / 2) intrinsics = torch.tensor( [ [focal_px, 0.0, width / 2], [0.0, focal_px, height / 2], [0.0, 0.0, 1.0], ] ) depth_m = model.predict("image.jpg", intrinsics=intrinsics) print(f"depth range: {depth_m.min():.2f} m – {depth_m.max():.2f} m") ``` If you do not know the true intrinsics, an approximation from the field of view (as above) still gives a reasonable scale. ### Depth Anything V2 DAv2 metric models are trained for a fixed domain and do **not** take camera intrinsics. Choose the model that matches your scene: - `hypersim` models are trained on indoor scenes (depth up to 20 m). - `vkitti` models are trained on outdoor driving scenes (depth up to 80 m). ```python import lightly_train model = lightly_train.load_model("dinov2/dav2-metric-small-hypersim") depth_m = model.predict("image.jpg") # Metric depth in meters. ``` (depth-estimation-batch)= ## Batch Inference Use `predict_batch` to run inference on several images at once. It returns a list of `(H, W)` tensors, one per image, each resized back to its original resolution. ```python import lightly_train model = lightly_train.load_model("dinov2/dav3-relative-large") depths = model.predict_batch(["image1.jpg", "image2.jpg"]) ``` For DAv3 metric models, pass one intrinsics matrix per image: ```python skip_ruff model = lightly_train.load_model("dinov2/dav3-metric-large") depths = model.predict_batch( ["image1.jpg", "image2.jpg"], intrinsics=[intrinsics1, intrinsics2], ) ``` ```{note} Images with different aspect ratios are center-cropped to the smallest processed size in the batch before inference, so their depth maps are slightly stretched when resized back to the original resolution. For pixel-perfect results on images of different sizes, call `predict` on each image individually. ``` (depth-estimation-convert)= ## Using Non-Hosted Depth Anything V2 Checkpoints To use a DAv2 model that LightlyTrain does not host, convert the official Depth Anything V2 weights into a LightlyTrain checkpoint. You are responsible for complying with each model's license terms. 1. Download the official weights for the model you want from the corresponding [Depth Anything V2 Hugging Face repository](https://huggingface.co/collections/depth-anything/depth-anything-v2-666b22412f18a6dbfde23a93) (for example `depth_anything_v2_metric_vkitti_vitl.pth`). 1. Convert them into a LightlyTrain checkpoint: ```bash python -m lightly_train._task_models.depth_estimation_components.convert_checkpoint_dav2 \ --model-name dinov2/dav2-metric-large-vkitti \ --weights path/to/depth_anything_v2_metric_vkitti_vitl.pth \ --out ckpt/dav2_metric_vkitti_large.pt ``` 1. Load the converted checkpoint like any other model: ```python import lightly_train model = lightly_train.load_model("ckpt/dav2_metric_vkitti_large.pt") depth_m = model.predict("image.jpg") ``` ```{tip} Converting the Apache-2.0 models (`dinov2/dav2-relative-small` and `dinov2/dav2-metric-small-hypersim`) is not necessary, they are hosted by LightlyTrain and downloaded automatically. For these, the converter can fetch the official weights from Hugging Face directly, so the `--weights` argument can be omitted. ```