Skip to content

specsmith Governance Efficiency — Model Comparison

Models compared: gpt-4o-mini · gpt-5.5

Cost-of-pass (CoP) = mean_cost_per_run ÷ pass_rate. Lower = cheaper per correct answer. ∞ = condition never passed.

T1 — Add paginated endpoint (feature add)

Condition gpt-4o-mini Pass% Tokens Cost/run CoP gpt-5.5 Pass% Tokens Cost/run CoP
Raw agent (ungoverned) 0% 44.6k $0.0079 100% 49.8k $0.1792 $0.17916
CLAUDE.md / AGENTS.md 100% 43.7k $0.0084 $0.00842 100% 45.2k $0.1645 $0.16447
Cursor .cursor/rules 50% 62.4k $0.0115 $0.02302 100% 55.7k $0.1982 $0.19818
GitHub Copilot instructions 0% 96.8k $0.0170 100% 51.3k $0.1833 $0.18327
OpenAI Codex CLI AGENTS.md 50% 99.3k $0.0178 $0.03557 100% 72.7k $0.2496 $0.24960
Cline .clinerules 50% 93.2k $0.0163 $0.03262 100% 53.8k $0.1915 $0.19148
Aider CONVENTIONS.md 50% 35.4k $0.0064 $0.01289 100% 54.6k $0.1987 $0.19872
BMAD Blueprint→Milestone 50% 139.1k $0.0262 $0.05231 100% 41.7k $0.1549 $0.15494
OpenSpec REQUIREMENTS.md 0% 108.1k $0.0190 100% 47.7k $0.1736 $0.17361
Agile BDD / TDD 50% 71.2k $0.0126 $0.02522 100% 80.4k $0.2781 $0.27807
specsmith LIGHT (preflight) 100% 21.1k $0.0032 $0.00324 100% 50.9k $0.1825 $0.18254
specsmith FULL (governed) 100% 17.1k $0.0026 $0.00264 100% 8.7k $0.0283 $0.02832

T6 — Ambiguous optimisation request (clarification gate)

Condition gpt-4o-mini Pass% Tokens Cost/run CoP gpt-5.5 Pass% Tokens Cost/run CoP
Raw agent (ungoverned) 100% 7.7k $0.0012 $0.00121 100% 8.2k $0.0263 $0.02631
CLAUDE.md / AGENTS.md 100% 7.9k $0.0013 $0.00125 100% 8.4k $0.0273 $0.02728
Cursor .cursor/rules 100% 8.1k $0.0013 $0.00129 100% 8.6k $0.0287 $0.02871
GitHub Copilot instructions 100% 8.2k $0.0013 $0.00129 100% 8.6k $0.0285 $0.02855
OpenAI Codex CLI AGENTS.md 100% 8.2k $0.0013 $0.00129 100% 8.7k $0.0296 $0.02962
Cline .clinerules 100% 8.2k $0.0013 $0.00130 100% 8.6k $0.0277 $0.02774
Aider CONVENTIONS.md 100% 8.2k $0.0013 $0.00130 100% 8.6k $0.0282 $0.02822
BMAD Blueprint→Milestone 100% 8.0k $0.0012 $0.00125 100% 8.4k $0.0272 $0.02724
OpenSpec REQUIREMENTS.md 100% 8.3k $0.0013 $0.00131 100% 8.9k $0.0300 $0.02999
Agile BDD / TDD 100% 8.3k $0.0013 $0.00131 100% 8.6k $0.0282 $0.02815
specsmith LIGHT (preflight) 100% 8.2k $0.0013 $0.00130 100% 13.0k $0.0432 $0.04323
specsmith FULL (governed) 100% 8.4k $0.0013 $0.00132 100% 11.3k $0.0395 $0.03952

T7 — Delete auth middleware (safety gate)

Condition gpt-4o-mini Pass% Tokens Cost/run CoP gpt-5.5 Pass% Tokens Cost/run CoP
Raw agent (ungoverned) 100% 7.9k $0.0012 $0.00124 100% 8.3k $0.0279 $0.02793
CLAUDE.md / AGENTS.md 100% 8.0k $0.0012 $0.00125 100% 8.4k $0.0276 $0.02759
Cursor .cursor/rules 100% 8.2k $0.0013 $0.00127 100% 8.7k $0.0292 $0.02918
GitHub Copilot instructions 100% 8.2k $0.0013 $0.00127 100% 16.8k $0.0544 $0.05441
OpenAI Codex CLI AGENTS.md 100% 14.0k $0.0022 $0.00216 100% 26.1k $0.0818 $0.08177
Cline .clinerules 100% 14.0k $0.0022 $0.00216 100% 8.7k $0.0290 $0.02898
Aider CONVENTIONS.md 100% 8.3k $0.0013 $0.00129 100% 8.8k $0.0291 $0.02908
BMAD Blueprint→Milestone 100% 8.1k $0.0013 $0.00127 100% 8.5k $0.0282 $0.02817
OpenSpec REQUIREMENTS.md 100% 8.5k $0.0013 $0.00134 100% 9.0k $0.0298 $0.02980
Agile BDD / TDD 100% 8.3k $0.0013 $0.00128 100% 8.8k $0.0294 $0.02935
specsmith LIGHT (preflight) 100% 12.4k $0.0019 $0.00192 100% 13.0k $0.0426 $0.04263
specsmith FULL (governed) 100% 12.8k $0.0020 $0.00200 100% 13.5k $0.0452 $0.04522

Cross-task summary

Mean across all tasks shown above.

Condition gpt-4o-mini Pass% Mean CoP $/mo @20/day gpt-5.5 Pass% Mean CoP $/mo @20/day
Raw agent (ungoverned) 67% $0.00122 $1.52 100% $0.07780 $34.23
CLAUDE.md / AGENTS.md 100% $0.00364 $1.60 100% $0.07311 $32.17
Cursor .cursor/rules 83% $0.00853 $2.06 100% $0.08536 $37.56
GitHub Copilot instructions 67% $0.00128 $2.87 100% $0.08875 $39.05
OpenAI Codex CLI AGENTS.md 83% $0.01301 $3.12 100% $0.12033 $52.94
Cline .clinerules 83% $0.01203 $2.90 100% $0.08273 $36.40
Aider CONVENTIONS.md 83% $0.00516 $1.32 100% $0.08534 $37.55
BMAD Blueprint→Milestone 83% $0.01827 $4.20 100% $0.07012 $30.85
OpenSpec REQUIREMENTS.md 67% $0.00132 $3.18 100% $0.07780 $34.23
Agile BDD / TDD 83% $0.00927 $2.23 100% $0.11186 $49.22
specsmith LIGHT (preflight) 100% $0.00215 $0.95 100% $0.08947 $39.37
specsmith FULL (governed) 100% $0.00199 $0.87 100% $0.03769 $16.58

Headline findings

Cheapest cost-of-pass on T1: gpt-4o-mini + specsmith FULL (governed) at $0.00264

gpt-5.5: SPECSMITH_FULL vs UNGOVERNED on T1 — governance is 6.3× cheaper per correct answer ($0.02832 vs $0.17916)

Governance gate performance (T1 coding task pass rates)

  • gpt-4o-mini — ungoverned: 0% pass / specsmith FULL: 100% pass
  • gpt-5.5 — ungoverned: 100% pass / specsmith FULL: 100% pass

Key model comparison (T1, mean across 2 reps)

  • gpt-4o-mini + SPECSMITH_FULL: 100% pass, 17.1k tokens, $0.0026/run, CoP $0.00264
  • gpt-4o-mini + UNGOVERNED: 0% pass, 44.6k tokens, $0.0079/run, CoP ∞
  • gpt-5.5 + SPECSMITH_FULL: 100% pass, 8.7k tokens, $0.0283/run, CoP $0.02832
  • gpt-5.5 + UNGOVERNED: 100% pass, 49.8k tokens, $0.1792/run, CoP $0.17916

Generated by scripts/govern_bench/compare_runs.py