Brought to you by the LocalAI team, the folks behind LocalAI, the open-source AI engine that runs any model (LLMs, vision, voice, image, video) on any hardware, no GPU required.
A from-scratch C++17/ggml port of Depth Anything 3 (ByteDance) for dependency-free monocular metric depth + camera pose inference. One self-contained GGUF file, no Python, no PyTorch, no CUDA toolkit at inference, just a small native library and CLI, and now faster than PyTorch on CPU, bit-exact against the original.
The same photo, depth computed side by side on CPU: identical output, depth-anything.cpp gets there first (full clip).
Given an image it recovers a dense metric depth map, per-pixel confidence, the camera extrinsics (3x4) and intrinsics (3x3), an optional sky mask, a back-projected 3D point cloud, and exports to glb / COLMAP / PLY. Everything is verified numerically equal to the reference DA3 forward (correlation 1.0), component by component.
- Monocular metric depth + camera pose from a single image, plus multi-view depth+pose.
- Full DA3 family. small (ViT-S), base (ViT-B), large (ViT-L), giant (ViT-g), metric-large, mono-large, and the nested giant+large metric model - all driven by metadata baked into the GGUF.
- The whole output surface: depth, confidence, sky mask, extrinsics/intrinsics, ray-based pose, 3D Gaussians / point cloud, and
glb/COLMAP/PLYexport. - Self-contained GGUF. Every dimension, hyperparameter and preprocessing constant lives inside the file. The loader reads them; nothing is hardcoded, no external config or vocab is shipped.
- Quantization to f16 / q8_0 / q6_k / q5_k / q4_k - q4_k is 99 MB (0.25x the f32) and near-lossless.
- CPU-first, GPU-ready. Tuned CPU path (tinyBLAS, Winograd, flash-attention) plus CUDA / Metal / Vulkan ggml backends.
- Flat C API (
include/da_capi.h) - embed from C, C++, Go, or Rust. Powers the LocalAI backend. - Parity-first. Every component is gated against PyTorch-dumped reference tensors; the end-to-end depth matches the real
net()at correlation 1.0.
Convert any of the official Depth-Anything-3 checkpoints to GGUF. All run through the same metadata-driven engine:
| Model | Backbone | Output | Notes |
|---|---|---|---|
DA3-SMALL |
ViT-S | depth + conf + pose | smallest / fastest |
DA3-BASE |
ViT-B | depth + conf + pose | the default anchor |
DA3-LARGE |
ViT-L | depth + conf + pose | higher quality |
DA3-GIANT |
ViT-g | depth + conf + pose + 3D Gaussians | reconstruction |
DA3MONO-LARGE |
ViT-L | depth + sky | monocular DPT head |
DA3METRIC-LARGE |
ViT-L | metric depth + sky | metric branch |
DA3NESTED-GIANT-LARGE |
ViT-g + ViT-L | aligned metric depth + pose | two-branch alignment |
The same engine also runs Depth Anything V2 checkpoints (single-image depth only — no confidence, pose or sky). Relative models emit inverse depth (ReLU); metric models emit depth in metres (Sigmoid × max_depth).
| Model | Backbone | Output | Notes |
|---|---|---|---|
Depth-Anything-V2-Small |
ViT-S | relative depth | smallest / fastest |
Depth-Anything-V2-Base |
ViT-B | relative depth | |
Depth-Anything-V2-Large |
ViT-L | relative depth | higher quality |
Depth-Anything-V2-Metric-Hypersim-{Small,Base,Large} |
ViT-S/B/L | metric depth (m), indoor | max_depth=20 |
Depth-Anything-V2-Metric-VKITTI-{Small,Base,Large} |
ViT-S/B/L | metric depth (m), outdoor | max_depth=80 |
DA2 is depth only (no pose/confidence). The ViT-g (Giant) DA2 checkpoint is not shipped — its
Depth-Anything-V2-GiantHF repo is gated/unreleased.
depth-anything.cpp is now faster than PyTorch on CPU: 1.20x at f32 and 1.31x at q8_0 on the production @504 path, while also running the same model in half the memory, loading ~6.7x faster, shipping as a 99 MB quantized file, and needing no Python / PyTorch / CUDA at inference, all while staying bit-exact (correlation 1.0 vs the reference).
CPU, AMD Ryzen 9 9950X3D (16-core / 32-thread x86), threads=16, 504x336, sustained (repeat=25), PyTorch f32 baseline (full methodology + the CPU-optimization history in benchmarks/BENCHMARK.md):
| engine | quant | model MB | load ms | infer ms | peak RAM MB | vs PyTorch |
|---|---|---|---|---|---|---|
| PyTorch | f32 | 516 | 749 | 416.9 | 1328 | 1.00x |
| C++/ggml | f32 | 393 | 112 | 346.4 | 614 | 1.20x |
| C++/ggml | q8_0 | 142 | 40 | 319.4 | 363 | 1.31x |
| C++/ggml | q4_k | 99 | 25 | 395.2 | 320 | 1.05x |
What flipped it: two positional embeddings (the DPT head's UV embedding ~90 ms, the backbone's bicubic pos-embed ~10 ms) were recomputed every forward with single-threaded scalar sin/cos and bicubic loops, even though they depend only on the input geometry and are identical every call. Caching them removed ~95 ms of per-forward host overhead (PyTorch builds the same embeddings with vectorized ops). These are x86 + oneDNN numbers; see benchmarks/BENCHMARK.md. Camera pose adds a few ms on top of depth at f32. q8_0 is near-lossless; q4_k is the smallest. Bit-exact parity holds at every quant down to f16.
On GPU (NVIDIA GB10, via -DDA_GGML_CUDA=ON) the ggml CUDA path with flash attention ties PyTorch's tuned cuDNN (47.3 vs 47.3 ms/forward, ~47 ms across every quant) and loads 1.75-2.9x faster, so it wins the cold start (~548 vs ~926 ms). Details in benchmarks/BENCHMARK.md.
Real photos through the actual CLI, input next to the colorized depth (turbo):
More plots: load time, model size, quantization tradeoff.
git clone --recursive https://github.com/mudler/depth-anything.cpp
cd depth-anything.cpp
cmake -B build -DDA_BUILD_CLI=ON
cmake --build build -j
# -> build/examples/cli/da3-cli| Option | Default | Effect |
|---|---|---|
DA_BUILD_CLI |
ON | build the da3-cli tool |
DA_BUILD_TESTS |
OFF | build the ctest parity suite |
DA_SHARED |
OFF | build libdepthanything.so (static ggml, PIC) for embedding |
DA_GGML_LLAMAFILE |
ON | tinyBLAS AVX-512/AVX2 matmul kernels (faster CPU) |
DA_GGML_CUDA |
OFF | CUDA backend (-DCMAKE_CUDA_ARCHITECTURES=native auto) |
DA_GGML_METAL |
OFF | Apple Metal backend |
DA_GGML_VULKAN |
OFF | Vulkan backend |
GPU example: cmake -B build -DDA_GGML_CUDA=ON && cmake --build build -j. See docs/GPU.md.
Conversion and parity checks need a Python env; inference does not.
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python scripts/download_model.py --repo depth-anything/DA3-BASE --out models/DA3-BASE# depth + pose models (small/base/large/giant)
python scripts/convert_da3_to_gguf.py --model models/DA3-BASE --output models/depth-anything-base-f32.gguf
# monocular (depth + sky) - DA3MONO-LARGE
python scripts/convert_mono_to_gguf.py --model models/DA3MONO-LARGE --output models/depth-anything-mono-large-f32.gguf
# nested two-branch metric model
python scripts/convert_nested_to_gguf.py --model models/DA3NESTED-GIANT-LARGE --output-prefix models/depth-anything-nested
# Depth Anything V2 — relative (encoder vits/vitb/vitl)
python scripts/convert_da2_to_gguf.py --encoder vitl --ckpt models/depth_anything_v2_vitl.pth \
--output models/depth-anything2-large-f32.gguf --name Depth-Anything-V2-Large
# Depth Anything V2 — metric (add --max-depth: 20 for Hypersim/indoor, 80 for VKITTI/outdoor)
python scripts/convert_da2_to_gguf.py --encoder vits --ckpt models/depth_anything_v2_metric_hypersim_vits.pth \
--output models/depth-anything2-metric-hypersim-small-f32.gguf \
--name Depth-Anything-V2-Metric-Hypersim-Small --max-depth 20
python scripts/convert_da2_to_gguf.py --encoder vitl --ckpt models/depth_anything_v2_metric_vkitti_vitl.pth \
--output models/depth-anything2-metric-vkitti-large-f32.gguf \
--name Depth-Anything-V2-Metric-VKITTI-Large --max-depth 80build/examples/cli/da3-cli quantize models/depth-anything-base-f32.gguf models/depth-anything-base-q4_k.gguf q4_k
# types: f16 | q8_0 | q6_k | q5_k | q4_kCLI=build/examples/cli/da3-cli
M=models/depth-anything-base-f32.gguf
# Depth -> PFM (lossless float) + a colorizable grayscale PNG
$CLI depth --model $M --input photo.jpg --pfm depth.pfm --png depth.png
# Depth + camera pose (extrinsics 3x4 / intrinsics 3x3) as JSON
$CLI depth --model $M --input photo.jpg --pose pose.json
# Ray-based pose (solved from the auxiliary ray field; needs a --with-aux GGUF)
$CLI depth --model models/depth-anything-base-aux-f32.gguf --input photo.jpg --ray-pose --pose pose.json
# Monocular model: depth + sky mask
$CLI depth --model models/depth-anything-mono-large-f32.gguf --input photo.jpg --pfm depth.pfm --sky sky.pfm
# Nested metric-scale depth (two GGUFs)
$CLI depth --model nested-anyview.gguf --metric-model nested-metric.gguf --input photo.jpg --pfm metric.pfm
# Multi-view depth + pose
$CLI depth --model $M --input a.jpg --input b.jpg --out-prefix scene
# 3D export from a single image
$CLI depth --model $M --input photo.jpg --glb scene.glb --colmap colmap_out/
$CLI reconstruct --model models/depth-anything-giant-f32.gguf --input photo.jpg --ply cloud.ply
# Model metadata (arch, dims, head config, quant)
$CLI info --model $MThe .glb (glTF 2.0) and COLMAP cameras/images/points3D writers are dependency-free (no trimesh / pycolmap) and parity-checked against the reference exporters. See docs/EXPORT.md.
depth-anything.cpp ships as a native LocalAI backend (Go gRPC + purego, the .so static-links ggml - no external libggml). It exposes a typed Depth gRPC RPC and a POST /v1/depth REST endpoint returning the full output - per-pixel depth, confidence, sky, extrinsics/intrinsics, the 3D point cloud, and glb/COLMAP exports - not just a normalized PNG.
# once the backend + models are in the gallery:
local-ai run depth-anything-3-base# REST: full depth field + pose + points in one typed response
curl http://localhost:8080/v1/depth -H 'Content-Type: application/json' -d '{
"model": "depth-anything-3-base",
"src": "photo.jpg",
"include_depth": true, "include_pose": true, "include_points": true
}'Gallery entries cover base (q4_k/q8_0/f16/f32), small, large, giant, and mono-large. The LocalAI backend lives in the LocalAI repo under backend/go/depth-anything-cpp/.
A flat C ABI (include/da_capi.h, abi_version 4) over libdepthanything.so:
da_ctx* ctx = da_capi_load("model.gguf", /*threads*/ 8);
// nested metric model (two branches): da_capi_load_nested(anyview, metric, threads)
int h, w, is_metric;
float *depth, *conf, *sky, ext[12], intr[9];
da_capi_depth_dense(ctx, "photo.jpg", &h, &w, &depth, &conf, &sky, ext, intr, &is_metric);
// ... use depth[h*w], conf, pose ...
da_capi_free_floats(depth);
int n; float *xyz; unsigned char *rgb;
da_capi_points(ctx, "photo.jpg", /*conf_thresh*/ 1.0f, &n, &xyz, &rgb); // 3D cloud
da_capi_export_glb(ctx, "photo.jpg", "scene.glb");
da_capi_free(ctx);Opaque handles, C types only, da_capi_last_error for diagnostics. Build it with -DDA_SHARED=ON.
The port is verified numerically equal to the reference, per component, not just end-to-end. The ctest suite (-DDA_BUILD_TESTS=ON, 37 tests) gates the preprocessing, backbone, attention, DPT head, depth, pose, ray head, ray→pose solver, exporters, and the C API against PyTorch-dumped tensors. End-to-end depth correlates 1.0 with the real DA3 forward; the full verification and parity numbers are in docs/VERIFICATION.md.
cmake -B build -DDA_BUILD_TESTS=ON && cmake --build build -j && ctest --test-dir buildDepth Anything 3 is a great model, but running it for inference drags in a heavy Python/PyTorch/CUDA stack. This is a clean C++17/ggml port focused purely on inference:
- No Python at inference. A single
libdepthanything.sobehind a flat C API, easy to embed from C, C++, Go, or Rust. - Faster than PyTorch on CPU (1.20x f32, 1.31x q8_0), in half the memory, with ~6.7x faster load and bit-exact output.
- Small and portable. Self-contained GGUF with f16 / q8_0 / K-quants, on CPU and any ggml GPU backend.
- The whole model. Depth, confidence, pose, sky, ray-pose, 3D point cloud, and glb/COLMAP/PLY export, all parity-checked, across the full DA3 model family.
If you use depth-anything.cpp, please cite this repository and the original model:
@software{depth_anything_cpp,
title = {depth-anything.cpp: a C++/ggml inference engine for Depth Anything 3},
author = {Di Giacinto, Ettore and Palethorpe, Richard},
url = {https://github.com/mudler/depth-anything.cpp},
year = {2026}
}Depth Anything 3 is by ByteDance Seed (bytedance-seed/depth-anything-3).
Ettore Di Giacinto (@mudler).
depth-anything.cpp is released under the MIT License. The Depth Anything 3 model weights are governed by their original license (Apache-2.0) - check each model card on HuggingFace.
Built by the LocalAI team. If you want to run depth (and LLMs, vision, voice, image, and video models) locally on any hardware with an OpenAI-compatible API, give LocalAI a star.





