Skip to content

[codex] Fix knowledge scoring and Anthropic Opus 4.7 compatibility#21

Merged
Jameswlepage merged 2 commits into
trunkfrom
codex/fix-knowledge-scoring
Apr 16, 2026
Merged

[codex] Fix knowledge scoring and Anthropic Opus 4.7 compatibility#21
Jameswlepage merged 2 commits into
trunkfrom
codex/fix-knowledge-scoring

Conversation

@Jameswlepage
Copy link
Copy Markdown
Contributor

@Jameswlepage Jameswlepage commented Apr 16, 2026

Summary

This PR fixes two real benchmark harness issues:

  • the knowledge runner treated every knowledge item as letter-only multiple choice even though the dataset already contains short-answer questions
  • Anthropic's claude-opus-4-7 rejects the temperature parameter, which prevented the harness from benchmarking the model at all

Root Cause

  • BenchmarkRunner and SingleModelRunner always prompted with Answer with only the letter of the correct choice.
  • Knowledge scoring used a strict answer.upper().startswith(correct_answer) check for every knowledge record.
  • The dataset loader and Parquet export path did not preserve short-answer metadata like type and answer_type.
  • ModelInterface always passed temperature, but Anthropic now returns invalid_request_error for Opus 4.7 when that field is sent.

What Changed

  • Added a shared knowledge helper module for prompt rendering and scoring.
  • Multiple-choice questions still prefer letter answers, but now also accept the correct choice text when the model answers in prose.
  • Short-answer questions now prompt for the actual WordPress value and score against the canonical answer text.
  • Preserved type and answer_type in local parsing, HF loading, and Parquet export.
  • Added focused tests for prompt rendering, scorer behavior, metadata preservation, and Anthropic temperature-retry behavior.
  • Updated README/docs to describe mixed multiple-choice + short-answer knowledge items.
  • Added a retry path in the model wrapper: if Anthropic rejects a request because temperature is deprecated, retry once without it.

Audit Notes

  • In this checkout, wp-core-v1 currently contains 42 knowledge tests and 24 execution tests.
  • Only 2 knowledge items are short-answer in the local suite (k-rest-001, k-security-001).
  • That means the Anthropic note about 113 of 325 fill-in-the-blank items does not match the current local JSON in this repository. The larger count likely came from a different dataset revision or export.

Validation

  • ruff check python
  • pytest python
  • python datasets/export_dataset.py
  • wp-bench dry-run --config wp-bench.yaml
  • wp-bench run --config wp-bench.yaml --model-name claude-sonnet-4-5-20250929
  • wp-bench run --config wp-bench.yaml --model-name claude-opus-4-5-20251101
  • wp-bench run --config wp-bench.yaml --model-name anthropic/claude-opus-4-7

Live Benchmark Results

Clean-branch reruns on this patch produced:

  • Claude Sonnet 4.5 (claude-sonnet-4-5-20250929): knowledge 92.86%, correctness 52.08%, overall 48.69%
  • Claude Opus 4.5 (claude-opus-4-5-20251101): knowledge 78.57%, correctness 50.00%, overall 43.57%
  • Claude Opus 4.7 (anthropic/claude-opus-4-7): knowledge 95.24%, correctness 54.86%, overall 50.52%

Re-scoring the Sonnet and Opus 4.5 answer logs with the legacy scorer (startswith / letter-only behavior) yields:

  • Sonnet knowledge: 85.71% instead of 92.86% (+7.14 points from the fix)
  • Opus 4.5 knowledge: 73.81% instead of 78.57% (+4.76 points from the fix)

Those deltas came from the two short-answer items plus one multiple-choice response where the model answered correctly in prose instead of starting with the option letter.

@Jameswlepage Jameswlepage changed the title [codex] Fix short-answer knowledge scoring Fix short-answer knowledge scoring Apr 16, 2026
@Jameswlepage Jameswlepage changed the title Fix short-answer knowledge scoring [codex] Fix knowledge scoring and Anthropic Opus 4.7 compatibility Apr 16, 2026
@Jameswlepage Jameswlepage marked this pull request as ready for review April 16, 2026 18:53
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 16, 2026

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: Jameswlepage <isotropic@git.wordpress.org>
Co-authored-by: JasonTheAdams <jason_the_adams@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@Jameswlepage Jameswlepage merged commit f7b9dc9 into trunk Apr 16, 2026
3 checks passed
@Jameswlepage Jameswlepage deleted the codex/fix-knowledge-scoring branch April 16, 2026 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants