[codex] Fix knowledge scoring and Anthropic Opus 4.7 compatibility by Jameswlepage · Pull Request #21 · WordPress/wp-bench

Jameswlepage · 2026-04-16T15:05:05Z

Summary

This PR fixes two real benchmark harness issues:

the knowledge runner treated every knowledge item as letter-only multiple choice even though the dataset already contains short-answer questions
Anthropic's claude-opus-4-7 rejects the temperature parameter, which prevented the harness from benchmarking the model at all

Root Cause

BenchmarkRunner and SingleModelRunner always prompted with Answer with only the letter of the correct choice.
Knowledge scoring used a strict answer.upper().startswith(correct_answer) check for every knowledge record.
The dataset loader and Parquet export path did not preserve short-answer metadata like type and answer_type.
ModelInterface always passed temperature, but Anthropic now returns invalid_request_error for Opus 4.7 when that field is sent.

What Changed

Added a shared knowledge helper module for prompt rendering and scoring.
Multiple-choice questions still prefer letter answers, but now also accept the correct choice text when the model answers in prose.
Short-answer questions now prompt for the actual WordPress value and score against the canonical answer text.
Preserved type and answer_type in local parsing, HF loading, and Parquet export.
Added focused tests for prompt rendering, scorer behavior, metadata preservation, and Anthropic temperature-retry behavior.
Updated README/docs to describe mixed multiple-choice + short-answer knowledge items.
Added a retry path in the model wrapper: if Anthropic rejects a request because temperature is deprecated, retry once without it.

Audit Notes

In this checkout, wp-core-v1 currently contains 42 knowledge tests and 24 execution tests.
Only 2 knowledge items are short-answer in the local suite (k-rest-001, k-security-001).
That means the Anthropic note about 113 of 325 fill-in-the-blank items does not match the current local JSON in this repository. The larger count likely came from a different dataset revision or export.

Validation

ruff check python
pytest python
python datasets/export_dataset.py
wp-bench dry-run --config wp-bench.yaml
wp-bench run --config wp-bench.yaml --model-name claude-sonnet-4-5-20250929
wp-bench run --config wp-bench.yaml --model-name claude-opus-4-5-20251101
wp-bench run --config wp-bench.yaml --model-name anthropic/claude-opus-4-7

Live Benchmark Results

Clean-branch reruns on this patch produced:

Claude Sonnet 4.5 (claude-sonnet-4-5-20250929): knowledge 92.86%, correctness 52.08%, overall 48.69%
Claude Opus 4.5 (claude-opus-4-5-20251101): knowledge 78.57%, correctness 50.00%, overall 43.57%
Claude Opus 4.7 (anthropic/claude-opus-4-7): knowledge 95.24%, correctness 54.86%, overall 50.52%

Re-scoring the Sonnet and Opus 4.5 answer logs with the legacy scorer (startswith / letter-only behavior) yields:

Sonnet knowledge: 85.71% instead of 92.86% (+7.14 points from the fix)
Opus 4.5 knowledge: 73.81% instead of 78.57% (+4.76 points from the fix)

Those deltas came from the two short-answer items plus one multiple-choice response where the model answered correctly in prose instead of starting with the option letter.

github-actions · 2026-04-16T18:53:32Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: Jameswlepage <isotropic@git.wordpress.org>
Co-authored-by: JasonTheAdams <jason_the_adams@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

fix: support short-answer knowledge scoring

a871767

Jameswlepage requested a review from JasonTheAdams April 16, 2026 15:05

Jameswlepage changed the title ~~[codex] Fix short-answer knowledge scoring~~ Fix short-answer knowledge scoring Apr 16, 2026

fix: retry Anthropic requests without temperature

f19e8c6

Jameswlepage changed the title ~~Fix short-answer knowledge scoring~~ [codex] Fix knowledge scoring and Anthropic Opus 4.7 compatibility Apr 16, 2026

Jameswlepage marked this pull request as ready for review April 16, 2026 18:53

JasonTheAdams approved these changes Apr 16, 2026

View reviewed changes

Jameswlepage merged commit f7b9dc9 into trunk Apr 16, 2026
3 checks passed

Jameswlepage deleted the codex/fix-knowledge-scoring branch April 16, 2026 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Fix knowledge scoring and Anthropic Opus 4.7 compatibility#21

[codex] Fix knowledge scoring and Anthropic Opus 4.7 compatibility#21
Jameswlepage merged 2 commits into
trunkfrom
codex/fix-knowledge-scoring

Jameswlepage commented Apr 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Jameswlepage commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

What Changed

Audit Notes

Validation

Live Benchmark Results

Uh oh!

github-actions Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Jameswlepage commented Apr 16, 2026 •

edited

Loading

github-actions Bot commented Apr 16, 2026 •

edited

Loading