MODUS: Decoder-only Any-to-AnyICML 2026
MODUS unifies multimodal generation with one decoderOne causal transformer trunk shared across every modality. No separate encoder + decoder, no modality-specific weights, no task pipelines., two expertsA 1D Expert handles discrete tokens via autoregressive next-token prediction. A 2D Expert handles continuous latents via flow matching. Both attend to the same causal context., and zero task headsTwo losses, summed: cross-entropy for 1D and flow matching for 2D. No segmentation, depth, or detection heads. No per-task decoders. Every modality goes through the same trunk..
Any-to-any modeling aims to flexibly relate arbitrary modalities within a single system, a need that arises in multimodal learning and in scientific domains like ecology and astronomy. Existing approaches mostly train from scratch with encoder–decoder or diffusion architectures, which limits performance and forgoes pretrained models.
We investigate decoder-only any-to-any multimodal modeling: one decoder that treats every modality symmetrically, with no modality-specific heads, losses, or task pipelines. The resulting model, MODUS, naturally supports chained generation through intermediate modalities, cross-modal consistency verification, and analysis of visual representations by combining semantic and reconstruction features. Across a range of benchmarks, MODUS performs strongly out of the box and composes modalities flexibly in a single model.
Capabilities
Where task-specific systems scale O(n × n), MODUS scales linearly in the number of modalities. The grid below shows every input modality decoded into every other, all produced by the same model.
Figure 4. Every cell is generated by the same MODUS decoder: rows are the input modality, columns are the target. The 14 modalities cover captions, pixels (RGB), geometry (Depth, Normal), semantics (Segmentation), edges (Canny, SAM‑Edge), masks (SAM‑Seg), detections, and representation spaces (DINOv2, DINOv2 local, CLIP, ImageBind, ImageBind local). Hover any row, column or cell to isolate it and preview at full size on the right.
Visual Quality
Drag the divider to compare the RGB input on the left with the modality MODUS generates on the right. Pick a target modality, browse samples with the arrows.
Visual Quality
Method
Animated walkthrough of one example sequence, end to end through MODUS.
MODUS adapts the pretrained BAGEL-7B mixture-of-transformers with two experts over a shared causal token sequence: a 1D Expert for discrete sequences (text, grounding boxes, DINOv2 tokens) trained with next-token prediction, and a 2D Expert for continuous spatial latents (RGB, depth, normals, segmentation, canny edges) trained with flow matching on VAE + ViT features. Both experts attend to the same causal context, so a token produced by either expert conditions every token that follows.
ℒAR + ℒFM), and inference, all in plain visuals.
Capabilities
Because every output joins the shared causal context, MODUS can pipe its own predictions back in as conditioning. The model solves a task by routing through an intermediate modality, with no retraining or architectural change.
| Pipeline | Intermediate | NYUv2 Normal ↓ |
|---|---|---|
RGB → Normal | — | 20.02 |
RGB → Depth → Normal | geometry | 20.06 |
RGB → DINO → Normal | semantics | 20.71 |
RGB → Canny → Normal | layout | 19.87 |
Table 2. Edge-map intermediates give the largest gain. They provide pixel-aligned low-level geometry that complements surface-orientation estimation.
Capabilities
With a shared decoder, MODUS scores its own outputs. For text-to-image generation, we sample four candidates and pick the one whose grounding boxes or VQA answer best agree with the prompt, without any external verifier or separate reward model.
| Verifier | GenEval ↑ |
|---|---|
| no verifier (baseline) | 0.81 |
| Object Grounding (best-of-4) | 0.82 |
| VQA (best-of-4) | 0.83 |
Table 3. Both an auxiliary grounding pass and an auxiliary VQA pass improve image generation. Verifier and generator share weights.
# MODUS self-verification
candidates ← Text2RGB(prompt, n=4)
scores ← []
for img in candidates:
bbox ← RGB2Grounding(img, prompt) # same decoder
answ ← RGB2VQA(img, prompt) # same decoder
scores.append(agree(bbox, answ, prompt))
return candidates[argmax(scores)]
Capabilities
Each 2D modality is encoded by both a ViT (semantic) and a VAE (reconstruction). Ablating each side reveals a clean dissociation. ViT-only preserves identity but distorts geometry. VAE-only is locally consistent but semantically brittle. Combined, they recover both.

ViT only keeps the room's overall identity but warps the geometry of the dark monitor.
| Features | NYUv2 Depth ↓ | NYUv2 Normal ↓ |
|---|---|---|
| ViT only | 15.1 | 35.30 |
| VAE only | 6.9 | 19.96 |
| ViT + VAE | 6.5 | 19.92 |
Table 4. Quantitative ablation. ViT + VAE wins on both depth and normal estimation.
The hallucination we see with ViT only is not unique to MODUS. Sampling the same input from GPT-4o produces structurally similar hallucinations across samples. Adding VAE features on top of ViT pins the prediction to a single consistent geometry. The shared pattern is suggestive evidence that GPT-4o relies on a similar higher-level feature conditioning that, on its own, does not constrain low-level geometry.
Training
In a multi-modality decoder, every modality shares the same noisy source distribution. The model must decide “which modality am I generating?” at the highest noise levels. Logit-normal sampling, which works well for unimodal text-to-image, undersamples exactly those steps, causing depth requests to collapse into normals or RGB. Uniform timestep sampling fixes this without sacrificing image quality.
With uniform timestep sampling, MODUS commits to the correct target modality even at a single denoising step. Logit-normal sampling, by contrast, shows modality confusion at low step counts.
Results
MODUS extends decoder-only models from image–text settings to diverse modalities and is evaluated zero-shot. It matches or surpasses multitask baselines on the tasks they support, while also covering tasks they cannot solve at all.
Table 1. Each cell shows a mini bar normalised to the best score per task. Filled teal = column best. = task not supported by the model. — = score not reported. † reproduced by us.
Each axis is one task, normalised to the best score across all listed models (so 100% = column-best in Table 1). The shaded polygon is MODUS — it touches the rim on 3 of 6 axes and competes on the others, while every baseline is restricted to a subset of tasks.
Dataset
We construct SPECTRUM-25M by extending the BLIP-3o image–caption corpus with per-image pseudo-labels for surface normals, monocular depth, segmentation, and canny edges (via DepthAnything, Marigold, and Grounded-SAM), plus DINOv2 global features as a representational modality. This alignment supports modality transformations that are difficult to study with conventional datasets, such as transforming depth into canny, as well as multi-step chained generation. The full dataset will be released.
@article{ye2026modus,
title = {MODUS: Decoder-only Any-to-Any Modeling of Diverse Modalities},
author = {Ye, Mingqiao and An, Zhaochong and Gao, Zhitong and Liu, Xian
and Kar, O\u{g}uzhan Fatih and Allardice, Jesse and Bachmann, Roman
and Mizrahi, David and Fleuret, Fran\c{c}ois and Li, Chuan
and Zadeh, Amir and Belongie, Serge and Dehghan, Afshin
and Zamir, Amir},
journal = {arXiv preprint},
year = {2026},
}