PaintBench

Can multimodal AI models make exact visual edits?

A foundational benchmark for precise single-answer visual editing.
Deterministically evaluated without bias-prone judge models.
Infinite problem variety via procedural generation.
The highest-performing model today reaches only 17.1%.

Kai Xu*  ·  Ellis Brown*  ·  Shrikar Madhu  ·  Rob Fergus  ·  He He  ·  Saining Xie
* Equal contribution  ·  New York University
PaintBench benchmark illustration: geometric shapes with measurement annotations and pixel-level evaluation grid
20
Foundational editing operations
Seed-generated problem variety
17.1%
Highest-performing model score (mIoU)
01 / How a Problem Works

One correct answer per problem.
Evaluated at the pixel.

Each PaintBench problem gives the model an input image and precise instruction. There is exactly one ground-truth answer. Evaluation compares each pixel of the model output to the ground-truth answer and input image using CIEΔE76 perceptual color distance—no human rating, no judge model.

Instruction
Draw a filled olive-colored (#717A1E) circle centered at (66.1%, 56.7%) with a radius of 21.5% image width. Place it underneath any existing shapes. Clip any parts that may extend beyond the image boundary.
Input image
Input
Ground truth answer
Ground truth
Model output (Nano Banana 2)
Model output (Nano Banana 2)
Pixel error map
Error map (drag slider)
Tolerance ΔE ≤ strict → lenient
Tolerance   ΔE ≤ 0
Edit accuracy  
Preservation accuracy  
IoU@tolerance  

Green/blue pixels are correctly edited/preserved within the tolerance. Red/orange pixels are incorrectly edited/preserved.

02 / The Benchmark

20 tasks across 4 categories.
Infinitely scalable.

Procedurally Generated
Each problem is generated from a unique seed and configurable scene/difficulty parameters.
One Correct Answer
Every output is evaluated pixel-by-pixel via CIE ΔE₇₆ color distance. No human ratings, no bias-prone judge models, no perceptual proxies.
Atomic Operations
Tasks target fundamental building-block operations found in many precise real-world visual editing tasks.
Input Ground Truth Model Output GPT-I2
Input Ground truth Model output
Instruction
03 / Results

The benchmark is far from solved.

The highest-performing model reaches only 17.1% (mean IoU) overall. Geometric transformation, most structural manipulation, and formula-based color changes are consistently hard.

Model Overall Geometric Transformation Structural Manipulation Color Change Symbolic Reasoning
TranslationRotationReflectionScalingShearing ConstructionRemovalCopyingBorderCropping RecolorFlood FillBlendingGradientPoint Ops ComparisonOrderingPatternCountingLegend
04 / Room for Model Improvement

Model failure patterns revealed.

Detailed diagnostics reveal both universal failure patterns and model-specific behaviors. We find brittleness in performance to several scene variations: object count, background complexity, color scheme, and edit-region size.

InputGround TruthModel Output
Input Ground truth Model output
Execution Omission
Qwen-Image-Edit · Cropping
Crop to the interior of the outlined region. Scale to fill the canvas using nearest-neighbor interpolation.
InputGround TruthModel Output
Input Ground truth Model output
Color Imprecision
LongCat-Edit · Removal
Remove all shapes except those that are a yellow heart.
InputGround TruthModel Output
Input Ground truth Model output
Structural Imprecision
FLUX.1-Kontext · Shearing
Shear the hexagon so its top bounding box edge shifts left by 66% of its bounding box width, keeping the bottom edge fixed.
InputGround TruthModel Output
Input Ground truth Model output
Structural Catastrophe
LongCat-Edit · Reflection
Reflect the gray shape across the top-left to bottom-right diagonal of its bounding box. Place the transformed shape underneath any possible overlapping shapes.
05 / Generalization to Applied Tasks

TinyGrafixBench

Do PaintBench scores generalize to applied tasks? To find out, we create TinyGrafixBench, a data visualization editing eval adopting the same seed-gen, pixel-eval philosophy. Model scores strongly correlate with those on PaintBench, suggesting generalization to applied precise editing tasks.

Mean IoU scores displayed. Scores strongly correlate with those on PaintBench (linear regression R² = 0.91, p < 0.001).

Model Overall Bar Chart Scatter Line Chart Heatmap Network
Authors

Kai Xu*  ·  Ellis Brown*  ·  Shrikar Madhu  ·  Rob Fergus  ·  He He  ·  Saining Xie

* Equal contribution  ·  New York University

Citation
@article{paintbench2026,
      title={PaintBench: Deterministic Evaluation of Precise Visual Editing},
      author={Kai Xu and Ellis Brown and Shrikar Madhu and Rob Fergus and He He and Saining Xie},
      year={2026},
      eprint={2606.00188},
      archivePrefix={arXiv},
      primaryClass={cs.GR},
      url={https://arxiv.org/abs/2606.00188},
}