Can multimodal AI models make exact visual edits?
A foundational benchmark for precise single-answer visual editing.
Deterministically evaluated without bias-prone judge models.
Infinite problem variety via procedural generation.
The highest-performing model today reaches only 17.1%.
Each PaintBench problem gives the model an input image and precise instruction. There is exactly one ground-truth answer. Evaluation compares each pixel of the model output to the ground-truth answer and input image using CIEΔE76 perceptual color distance—no human rating, no judge model.
Green/blue pixels are correctly edited/preserved within the tolerance. Red/orange pixels are incorrectly edited/preserved.
The highest-performing model reaches only 17.1% (mean IoU) overall. Geometric transformation, most structural manipulation, and formula-based color changes are consistently hard.
| Model | Overall | Geometric Transformation | Structural Manipulation | Color Change | Symbolic Reasoning | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Translation | Rotation | Reflection | Scaling | Shearing | Construction | Removal | Copying | Border | Cropping | Recolor | Flood Fill | Blending | Gradient | Point Ops | Comparison | Ordering | Pattern | Counting | Legend | ||
Detailed diagnostics reveal both universal failure patterns and model-specific behaviors. We find brittleness in performance to several scene variations: object count, background complexity, color scheme, and edit-region size.
Do PaintBench scores generalize to applied tasks? To find out, we create TinyGrafixBench, a data visualization editing eval adopting the same seed-gen, pixel-eval philosophy. Model scores strongly correlate with those on PaintBench, suggesting generalization to applied precise editing tasks.
Mean IoU scores displayed. Scores strongly correlate with those on PaintBench (linear regression R² = 0.91, p < 0.001).
| Model | Overall | Bar Chart | Scatter | Line Chart | Heatmap | Network |
|---|
* Equal contribution · New York University
@article{paintbench2026,
title={PaintBench: Deterministic Evaluation of Precise Visual Editing},
author={Kai Xu and Ellis Brown and Shrikar Madhu and Rob Fergus and He He and Saining Xie},
year={2026},
eprint={2606.00188},
archivePrefix={arXiv},
primaryClass={cs.GR},
url={https://arxiv.org/abs/2606.00188},
}