PaintBench

Can multimodal AI models make exact visual edits?

A foundational benchmark for precise single-answer visual editing.
Deterministically evaluated without bias-prone judge models.
Infinite problem variety via procedural generation.
The highest-performing model today reaches only 17.1%.

Kai Xu^* · Ellis Brown^* · Shrikar Madhu · Rob Fergus · He He · Saining Xie

* Equal contribution · New York University

Paper

Dataset Code

PaintBench benchmark illustration: geometric shapes with measurement annotations and pixel-level evaluation grid

Foundational editing operations

∞

Seed-generated problem variety

17.1%

Highest-performing model score (mIoU)

01 / How a Problem Works

One correct answer per problem.
Evaluated at the pixel.

Each PaintBench problem gives the model an input image and precise instruction. There is exactly one ground-truth answer. Evaluation compares each pixel of the model output to the ground-truth answer and input image using CIEΔE₇₆ perceptual color distance—no human rating, no judge model.

Instruction

Draw a filled olive-colored (#717A1E) circle centered at (66.1%, 56.7%) with a radius of 21.5% image width. Place it underneath any existing shapes. Clip any parts that may extend beyond the image boundary.

Input

Ground truth

Model output (Nano Banana 2)

Error map (drag slider)

Tolerance ΔE ≤ strict → lenient

Tolerance ΔE ≤ 0

Edit accuracy —

Preservation accuracy —

IoU@tolerance —

Green/blue pixels are correctly edited/preserved within the tolerance. Red/orange pixels are incorrectly edited/preserved.

02 / The Benchmark

20 tasks across 4 categories.
Infinitely scalable.

⟳

Procedurally Generated

Each problem is generated from a unique seed and configurable scene/difficulty parameters.

✓

One Correct Answer

Every output is evaluated pixel-by-pixel via CIE ΔE₇₆ color distance. No human ratings, no bias-prone judge models, no perceptual proxies.

◈

Atomic Operations

Tasks target fundamental building-block operations found in many precise real-world visual editing tasks.

Input Ground Truth Model Output GPT-I2

Instruction

03 / Results

The benchmark is far from solved.

The highest-performing model reaches only 17.1% (mean IoU) overall. Geometric transformation, most structural manipulation, and formula-based color changes are consistently hard.

Model	Overall	Geometric Transformation					Structural Manipulation					Color Change					Symbolic Reasoning
Model	Overall	Translation	Rotation	Reflection	Scaling	Shearing	Construction	Removal	Copying	Border	Cropping	Recolor	Flood Fill	Blending	Gradient	Point Ops	Comparison	Ordering	Pattern	Counting	Legend

04 / Room for Model Improvement

Model failure patterns revealed.

Detailed diagnostics reveal both universal failure patterns and model-specific behaviors. We find brittleness in performance to several scene variations: object count, background complexity, color scheme, and edit-region size.

InputGround TruthModel Output

Execution Omission

Qwen-Image-Edit · Cropping

Crop to the interior of the outlined region. Scale to fill the canvas using nearest-neighbor interpolation.

InputGround TruthModel Output

Color Imprecision

LongCat-Edit · Removal

Remove all shapes except those that are a yellow heart.

InputGround TruthModel Output

Structural Imprecision

FLUX.1-Kontext · Shearing

Shear the hexagon so its top bounding box edge shifts left by 66% of its bounding box width, keeping the bottom edge fixed.

InputGround TruthModel Output

Structural Catastrophe

LongCat-Edit · Reflection

Reflect the gray shape across the top-left to bottom-right diagonal of its bounding box. Place the transformed shape underneath any possible overlapping shapes.

05 / Generalization to Applied Tasks

TinyGrafixBench

Do PaintBench scores generalize to applied tasks? To find out, we create TinyGrafixBench, a data visualization editing eval adopting the same seed-gen, pixel-eval philosophy. Model scores strongly correlate with those on PaintBench, suggesting generalization to applied precise editing tasks.

Mean IoU scores displayed. Scores strongly correlate with those on PaintBench (linear regression R² = 0.91, p < 0.001).

Model	Overall	Bar Chart	Scatter	Line Chart	Heatmap	Network

Authors

Kai Xu^* · Ellis Brown^* · Shrikar Madhu · Rob Fergus · He He · Saining Xie

* Equal contribution · New York University

Citation

@article{paintbench2026,
      title={PaintBench: Deterministic Evaluation of Precise Visual Editing},
      author={Kai Xu and Ellis Brown and Shrikar Madhu and Rob Fergus and He He and Saining Xie},
      year={2026},
      eprint={2606.00188},
      archivePrefix={arXiv},
      primaryClass={cs.GR},
      url={https://arxiv.org/abs/2606.00188},
}

PaintBench

One correct answer per problem.Evaluated at the pixel.

20 tasks across 4 categories.Infinitely scalable.

The benchmark is far from solved.

Model failure patterns revealed.

TinyGrafixBench

One correct answer per problem.
Evaluated at the pixel.

20 tasks across 4 categories.
Infinitely scalable.