MODUS: Decoder-only Any-to-Any
Modeling of Diverse Modalities

ICML 2026

1EPFL    2Apple    3University of Copenhagen    4CUHK    5University of Geneva    6Lambda AI
* equal contribution    equal technical advising

TL;DR

MODUS unifies multimodal generation with one decoderOne causal transformer trunk shared across every modality. No separate encoder + decoder, no modality-specific weights, no task pipelines., two expertsA 1D Expert handles discrete tokens via autoregressive next-token prediction. A 2D Expert handles continuous latents via flow matching. Both attend to the same causal context., and zero task headsTwo losses, summed: cross-entropy for 1D and flow matching for 2D. No segmentation, depth, or detection heads. No per-task decoders. Every modality goes through the same trunk..

Any-to-any modeling aims to flexibly relate arbitrary modalities within a single system, a need that arises in multimodal learning and in scientific domains like ecology and astronomy. Existing approaches mostly train from scratch with encoder–decoder or diffusion architectures, which limits performance and forgoes pretrained models.

We investigate decoder-only any-to-any multimodal modeling: one decoder that treats every modality symmetrically, with no modality-specific heads, losses, or task pipelines. The resulting model, MODUS, naturally supports chained generation through intermediate modalities, cross-modal consistency verification, and analysis of visual representations by combining semantic and reconstruction features. Across a range of benchmarks, MODUS performs strongly out of the box and composes modalities flexibly in a single model.

Capabilities

Any-to-Any Generation

Where task-specific systems scale O(n × n), MODUS scales linearly in the number of modalities. The grid below shows every input modality decoded into every other, all produced by the same model.

hover a cell to preview

Figure 4. Every cell is generated by the same MODUS decoder: rows are the input modality, columns are the target. The 14 modalities cover captions, pixels (RGB), geometry (Depth, Normal), semantics (Segmentation), edges (Canny, SAM‑Edge), masks (SAM‑Seg), detections, and representation spaces (DINOv2, DINOv2 local, CLIP, ImageBind, ImageBind local). Hover any row, column or cell to isolate it and preview at full size on the right.

Visual Quality

RGB → Any

Drag the divider to compare the RGB input on the left with the modality MODUS generates on the right. Pick a target modality, browse samples with the arrows.

RGB
RGB input Depth
RGB
Depth

Visual Quality

Text → Any

PROMPT
RGB
RGB · MODUS output

Method

One decoder, two experts, a shared causal context

Animated walkthrough of one example sequence, end to end through MODUS.

MODUS adapts the pretrained BAGEL-7B mixture-of-transformers with two experts over a shared causal token sequence: a 1D Expert for discrete sequences (text, grounding boxes, DINOv2 tokens) trained with next-token prediction, and a 2D Expert for continuous spatial latents (RGB, depth, normals, segmentation, canny edges) trained with flow matching on VAE + ViT features. Both experts attend to the same causal context, so a token produced by either expert conditions every token that follows.

Blog post · 7 steps · ~3 min read
Read the walkthrough: how MODUS works, step by step
A short read-along blog post covering tokenizers, the unified sequence, the two experts, training (AR + ℒFM), and inference, all in plain visuals.

Capabilities

Chained Generation Through Intermediates

Because every output joins the shared causal context, MODUS can pipe its own predictions back in as conditioning. The model solves a task by routing through an intermediate modality, with no retraining or architectural change.

Direct vs chained generation
Figure 7. Top row: direct generation. Bottom: chained generation through an intermediate modality.
PipelineIntermediateNYUv2 Normal ↓
RGB → Normal20.02
RGB → Depth → Normalgeometry20.06
RGB → DINO → Normalsemantics20.71
RGB → Canny → Normallayout19.87

Table 2. Edge-map intermediates give the largest gain. They provide pixel-aligned low-level geometry that complements surface-orientation estimation.

Capabilities

Cross-modal Self-Verification

With a shared decoder, MODUS scores its own outputs. For text-to-image generation, we sample four candidates and pick the one whose grounding boxes or VQA answer best agree with the prompt, without any external verifier or separate reward model.

Self-verification candidates with confidence scores
Figure 6. For each prompt, MODUS samples several candidate generations and scores them with an auxiliary grounding or VQA pass produced by the same decoder. The most consistent candidate is kept.
VerifierGenEval ↑
no verifier (baseline)0.81
Object Grounding (best-of-4)0.82
VQA (best-of-4)0.83

Table 3. Both an auxiliary grounding pass and an auxiliary VQA pass improve image generation. Verifier and generator share weights.

# MODUS self-verification
candidates ← Text2RGB(prompt, n=4)
scores     ← []
for img in candidates:
    bbox ← RGB2Grounding(img, prompt)   # same decoder
    answ ← RGB2VQA(img, prompt)         # same decoder
    scores.append(agree(bbox, answ, prompt))
return candidates[argmax(scores)]

Capabilities

Visual Representation Composition

Each 2D modality is encoded by both a ViT (semantic) and a VAE (reconstruction). Ablating each side reveals a clean dissociation. ViT-only preserves identity but distorts geometry. VAE-only is locally consistent but semantically brittle. Combined, they recover both.

RGB
Click ViT only, VAE only, or ViT + VAE to preview it overlaid on the RGB input (right).
Input · RGB
Input RGB
Preview · ViT only · depth
RGB prediction
RGB
ViT only
drag the handle — left side stays RGB, right side reveals the prediction

ViT only keeps the room's overall identity but warps the geometry of the dark monitor.

FeaturesNYUv2 Depth ↓NYUv2 Normal ↓
ViT only15.135.30
VAE only6.919.96
ViT + VAE6.519.92

Table 4. Quantitative ablation. ViT + VAE wins on both depth and normal estimation.

Appendix · cross-model comparison GPT-4o hallucinates the same way as ViT-only click to expand

The hallucination we see with ViT only is not unique to MODUS. Sampling the same input from GPT-4o produces structurally similar hallucinations across samples. Adding VAE features on top of ViT pins the prediction to a single consistent geometry. The shared pattern is suggestive evidence that GPT-4o relies on a similar higher-level feature conditioning that, on its own, does not constrain low-level geometry.

ViT-only and GPT-4o both hallucinate; ViT+VAE pins it down
Figure 15 (appendix). Each scene shows depth (top) and surface normal (bottom). Modus ViT-only: 5 independent samples. GPT-4o: 2 samples. Modus ViT + VAE: deterministic output.

Training

Early timesteps choose the modality

In a multi-modality decoder, every modality shares the same noisy source distribution. The model must decide “which modality am I generating?” at the highest noise levels. Logit-normal sampling, which works well for unimodal text-to-image, undersamples exactly those steps, causing depth requests to collapse into normals or RGB. Uniform timestep sampling fixes this without sacrificing image quality.

Logit-normal vs uniform timestep sampling schematic
Figure 3. Logit-normal undersamples the early timesteps where modality identity is decided. Uniform sampling stabilises it.
Appendix · few-step generation Even one denoising step commits to the right modality click to expand

With uniform timestep sampling, MODUS commits to the correct target modality even at a single denoising step. Logit-normal sampling, by contrast, shows modality confusion at low step counts.

1/2/3/5/10/20/50-step generation comparison (Uniform vs Logit-Normal)
Figure 9 (appendix). Per-scene generations at 1, 2, 3, 5, 10, 20, 50 denoising steps.

Results

Zero-shot Benchmarks

MODUS extends decoder-only models from image–text settings to diverse modalities and is evaluated zero-shot. It matches or surpasses multitask baselines on the tasks they support, while also covering tasks they cannot solve at all.

Model
MMMU
GenEval
DIODE Depth
NYUv2 Normal
RefCOCO val
IN-1k T1/T5
Enc-Dec4M-21
0.37
0.331
37.28
78.3 / 92.4
Enc-DecUnified-IO 2
0.369
28.55
DiffusionOneDiffusion
0.65
0.399
DecoderBAGEL
53.2
0.86
DecoderKosmos-2
52.3
DecoderJanus-Pro
41.0
0.80
DecoderGPT-4o
69.1
0.84
OursMODUS
51.1
0.81
0.285
19.92
54.5
77.9 / 92.5

Table 1. Each cell shows a mini bar normalised to the best score per task. Filled teal = column best. = task not supported by the model. = score not reported. reproduced by us.

Each axis is one task, normalised to the best score across all listed models (so 100% = column-best in Table 1). The shaded polygon is MODUS — it touches the rim on 3 of 6 axes and competes on the others, while every baseline is restricted to a subset of tasks.

MMMU GenEval DIODE Depth NYUv2 Normal RefCOCO val IN-1k T1
MODUS (filled area, all 6 tasks)
BAGEL (MMMU + GenEval only)

Dataset

SPECTRUM-25M

We construct SPECTRUM-25M by extending the BLIP-3o image–caption corpus with per-image pseudo-labels for surface normals, monocular depth, segmentation, and canny edges (via DepthAnything, Marigold, and Grounded-SAM), plus DINOv2 global features as a representational modality. This alignment supports modality transformations that are difficult to study with conventional datasets, such as transforming depth into canny, as well as multi-step chained generation. The full dataset will be released.

SPECTRUM-25M pseudo-label visualisation
Figure 16 (appendix). Pseudo-label visualisations for surface normals, depth, and canny edges.

BibTeX

@article{ye2026modus,
  title   = {MODUS: Decoder-only Any-to-Any Modeling of Diverse Modalities},
  author  = {Ye, Mingqiao and An, Zhaochong and Gao, Zhitong and Liu, Xian
             and Kar, O\u{g}uzhan Fatih and Allardice, Jesse and Bachmann, Roman
             and Mizrahi, David and Fleuret, Fran\c{c}ois and Li, Chuan
             and Zadeh, Amir and Belongie, Serge and Dehghan, Afshin
             and Zamir, Amir},
  journal = {arXiv preprint},
  year    = {2026},
}