Can coding agents match the published SOTA of Nature-family papers?
📖 Overview • 🔧 Installation • 🚀 Quick Start • 🌱 NatureGym • 📚 Documentation • ⚖️ License • 🎈 Citation
NatureBench is a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, spanning 6 scientific domains, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery. Each task asks an agent to solve a real scientific machine-learning problem and is scored against the source paper's reported state of the art.
NatureBench is built on NatureGym, an automated pipeline that converts a published paper into a containerized task package comprising a task brief, the paper's dataset, a held-out test set with hidden ground truth, and an automated evaluator.
The strongest configuration reaches a 17.8% Surpass-SOTA rate, and success remains uneven across the six scientific domains NatureBench spans.
git clone https://github.com/FrontisAI/NatureBench.git
cd NatureBench
conda env create -f conda_env.yml
conda env create -f conda_env_eval.yml
conda activate naturebenchThis creates two environments: naturebench (the main orchestration environment that runs run_naturebench.py, agent adapters, Docker scheduling, and result aggregation) and naturebench-eval (the evaluation service environment that runs scoring logic).
The base Docker image is built automatically on the first run via --ensure-base-image (used in the Quick Start below). To build it manually:
bash scripts/ensure_naturebench_base.shSet credentials for your agent. Claude Code is shown here; for Codex, Gemini CLI, the post-hoc judge, and network proxy, see docs/configuration.md.
export ANTHROPIC_API_KEY=...
export ANTHROPIC_BASE_URL=...Run end-to-end. This single command downloads the dataset, builds the base image if needed, starts the evaluation service, and evaluates:
python run_naturebench.py \
--tasks gpu_low \
--agent claude \
--model <model-name> \
--out-dir ./results/claude_<model-name>_gpu_low \
--gpu-devices 0,1,2,3 \
--max-workers 4 \
--start-eval-services \
--eval-env-mapping ./eval_env_mapping.json \
--ensure-base-imageThis lists only the parameters you set explicitly; options with sensible defaults are omitted (see Quick Start defaults for the full list and their values). Adjust --gpu-devices / --max-workers to your hardware, or use --tasks cpu (without the GPU flags) for a GPU-free run. The complete parameter reference is in docs/usage.md.
The task packages are built by NatureGym, an automated, Skills-based pipeline that turns a published Nature-family paper into a containerized, runnable task. It filters papers, acquires and verifies the data, and assembles the task package (brief, data, evaluator, environment, metadata), while an information firewall removes the source method so that agents must discover solutions rather than reproduce them.
The pipeline runs as a chain of Claude Code skills driven by batch scripts, all under naturegym/. See naturegym/README.md for the stage-by-stage flow, the construction skills, and how to run them.
run_naturebench.py— one-command entry point: download data and launch evaluationsolve.py— main evaluation orchestratoreval_service.py— host-side evaluation servicejudge.py— post-hoc validity judgeagent/— adapters for Claude Code / Codex CLI / Gemini CLIevaluator/— evaluator interfacedocker/Dockerfile.base— NatureBench base Docker imagescripts/— helper scriptsensure_naturebench_base.sh— build the NatureBench base image if it is missingstart_eval_services.sh— start evaluation service from the mapping file
task-set/— task lists grouped by resource demanddocs/— detailed configuration, usage, and task-package referencenaturegym/— NatureGym construction pipeline: skills + batch drivers that build task packages from papersconda_env.yml— main orchestration environmentconda_env_eval.yml— evaluation service environmenteval_env_mapping.json— task-to-evaluation-service port mappingconfig.example.yaml— example configuration fileLICENSE,NOTICE— MIT license for original work;NOTICEdefines the scope
| Document | Contents |
|---|---|
docs/configuration.md |
Agent authentication (Claude Code / Codex CLI / Gemini CLI), the post-hoc judge, network proxy, and the evaluation service. |
docs/usage.md |
More run examples (CPU, GPU batch, Codex login, resume), the complete parameter reference, and output formats. |
docs/task-packages.md |
Task package structure and the resource-grouped task lists. |
The top-level LICENSE is the MIT License and applies only to original NatureBench contributions; see NOTICE for the exact scope. Third-party data bundled in each task package is governed by the notices in that task's tasks/<case_id>/licenses/ directory.
If you use NatureBench in your research, please cite our work:
@misc{wang2026naturebench,
title = {NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?},
author = {Yuru Wang and Lejun Cheng and Yuxin Zuo and Sihang Zeng and Bingxiang He and Che Jiang and Junlin Yang and Yuchong Wang and Kaikai Zhao and Weifeng Huang and Kai Tian and Zhenzhao Yuan and Jincheng Zhong and Weizhi Wang and Ning Ding and Bowen Zhou and Kaiyan Zhang},
year = {2026},
eprint = {2606.24530},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2606.24530}
}
