Nicolas Bustamante

Model-Harness-Fit

· 34 min read

Why mixing a frontier model with a foreign harness quietly tanks performance, and what the open source code tells us about why.

I keep three coding agents alive on the same workstation. Claude Code in one terminal. Codex CLI in another. GitHub Copilot CLI in a third. Same files. Same git tree. Same bash. Three different harnesses that look indistinguishable.

A few weeks ago I ran the same prompt through all three and the behavior was visibly different in ways that went well past the surface differences of style and speed that I had expected to see across vendors. The Codex run cited a memory entry I had taught it months ago, applied the rule, and kept going without asking. The Claude Code run flagged the same context but refused to assert it without first verifying that the file path was still valid. The Copilot CLI run produced a longer, more cautious plan and asked me to approve it before taking any side effect on disk.

The hand wave answer is that "models behave differently because they are different models." But Copilot CLI was running Claude Opus, the same family that Claude Code runs by default. Same model family, same prompt, two harnesses, materially different output. The hand wave does not cover it.

Models are post trained against the harness, not just the API. The tool names they expect, the input schemas they emit, the citation tags they wrap around remembered facts, the file structure of skills they invoke, the planning protocol they follow when the harness says "make a plan first" (none of these are generic capabilities of the model). They are byte level conventions baked into the post training of one specific model against one specific harness. Pull the model out of its harness and you give up performance you cannot get back without rewriting either side.

This has a direct consequence that anyone who has tried to ship a "model agnostic" agent has run into. You cannot just swap a model. Supporting BYOK and multi model (which is the responsible posture, since relying on a single provider is risky) adds real engineering complexity, and that complexity is worth paying. To swap a model cleanly, you have to swap the harness with it: the tool surface, the schema shapes, the skill bodies that name those tools, the citation contract, the memory ritual, the system prompt structure, sometimes the planning protocol. Everything above the model has to move when the model moves. That is why every agent vendor that supports multiple providers ends up either (a) running a degraded variant of every model they support, or (b) maintaining a separate full stack per model and exposing the choice to the user as "you are picking a product, not just a model." Option (b) is the path that wins on quality, and it is worth the engineering cost to avoid being locked into one lab.

Swapping orchestrators is not a cosmetic change. It is a model swap in disguise. The frontier lab spent the last year shaping the model's instincts to a particular tool surface, a particular memory ritual, a particular skill format. When you mix and match, you spend that work.

I think this is the single most underrated constraint in agent design today, and it has a clean name. Call it model harness fit.

So how does this actually show up under the hood?

I dug into three open implementations that ship today: Codex CLI (OpenAI, fully open source at github.com/openai/codex, Rust workspace, ~80 crates), Claude Code (Anthropic, closed binary, but a Rust port called claw-code at github.com/ultraworkers/claw-code tracks upstream behavior closely enough to read at ~48,600 LOC across 9 crates, and Claude Code's own runtime injects observable <system-reminder> blocks on every turn that confirm or contradict claims from the port), and GitHub Copilot CLI, where the SDK is fully open source MIT licensed at github.com/github/copilot-sdk with five language bindings (Node.js TypeScript at 5208 LOC across 8 files, plus Python, Go, .NET, Java), and the JSON RPC wire protocol is documented at sdk-protocol-version.json (currently version 3). The @github/copilot CLI binary that the SDK spawns as the agent runtime server is closed, but the client wrapper, the protocol, the session lifecycle, the system prompt section overrides, and every RPC method are all open source and readable.

Here is what I will cover:

  • The Evidence: Terminal-Bench 2.0 : what the leaderboard actually shows about model harness pairs
  • Three Harnesses, Three Bets : SQ/EQ vs typed conversation loop vs JSON RPC supervisor
  • The Tool Surface : where post training is most visible
  • Skills Carry Tool Specs : why "same SKILL.md format" does not mean "interchangeable"
  • The Memory Layer : synchronous live writes vs deferred batch vs server side, and why the citation tag matters
  • The Citation Discipline : how the model talks back to the harness
  • The System Prompt Skeleton : ten section IDs is a contract
  • The Routing Reality : what GitHub Copilot CLI is actually doing about all this
  • Mid-Chat Model Switching : the cleanest concrete failure mode
  • What the Labs Are Saying : Cursor, Anthropic, and LangChain all converging on the same framing
  • The Identity File Convention : CLAUDE.md, AGENTS.md, SOUL.md, USER.md, and what each one is for
  • What This Means : the model is no longer the moat alone, and the matched pair shifts as the model matures

Companion piece: I covered the memory layer in detail at Agent Memory Engineering. This article is about everything else, with memory revisited only where it intersects orchestration. If you want the bottom up tour of how MEMORY.md indexes, system reminder injection, age in days warnings, and signal gates work, read that one first.


The Evidence: Terminal-Bench 2.0

Before any argument about architecture, look at the leaderboard. Terminal-Bench 2.0 evaluates agents on bash heavy multi step tasks, and it ranks by harness plus model pair, not by model alone. From tbench.ai/leaderboard/terminal-bench/2.0 on April 30, 2026:

TERMINAL-BENCH 2.0 TOP PAIRS         PASS RATE
======================================
Codex + GPT-5.5                       82.0%
ForgeCode + GPT-5.4                   81.8%
TongAgents + Gemini 3.1 Pro           80.2%
ForgeCode + Claude Opus 4.6           79.8%
SageAgent + GPT-5.3-Codex             78.4%
ForgeCode + Gemini 3.1 Pro            78.4%
Droid + GPT-5.3-Codex                 77.3%
Capy + Claude Opus 4.6                75.3%
Simple Codex + GPT-5.3-Codex          75.1%
Terminus-KIRA + Gemini 3.1 Pro        74.8%

Two things jump out. First, Claude Opus 4.6 paired with ForgeCode hits 79.8%, while the same model paired with Capy hits 75.3%. Same weights, different harness, and a 4.5 percentage point spread between them on a benchmark where every entry is fighting for a tenth of a point.

Second, the upper rankings are not dominated by the labs that trained the models. ForgeCode is a third party harness that lands three of the top six entries by routing across model families. Stanford's IRIS Lab paired Opus 4.6 with an automated harness evolution system called Meta-Harness and pushed the same model to 76.4% on the same benchmark, well past the best baseline they started from. The harness is moving the score by more than the model upgrades are moving it.

Cursor's research team makes the point even sharper. In their April 30 post on harness engineering, they note that they took their own coding agent from "Top 30 to Top 5 on Terminal Bench 2.0 by only changing the harness." Same model. Same benchmark. Different scaffolding. A 25-position jump on a public leaderboard, attributable to the harness alone. That is not a tuning artifact. That is the entire ranking.

LangChain's Vivek Trivedy puts the same observation in one sentence: "Opus 4.6 in Claude Code scores far below Opus 4.6 in other harnesses." Anthropic's flagship model in Anthropic's flagship harness loses to the same weights in third party scaffolding. If you only saw the model name on the spec sheet, you would not predict that.

This is the empirical case for model harness fit. Hold the model fixed and swap the harness, and the pass rate moves by enough to outweigh a model generation upgrade. Anyone shipping a coding agent in 2026 who picks the model first and the harness second is leaving most of the performance on the floor.

The rest of this article is about why. What exactly does the harness do that lets two implementations of the same model produce different scores?


Three Harnesses, Three Bets

Each harness picks a different orchestration protocol. The model was trained on that protocol's exact wire format.

CODEX CLI                       CLAUDE CODE                    GITHUB COPILOT CLI
==========                      ==========                     ==========

Submission Queue / Event Queue  Direct conversation loop       JSON RPC supervisor
in process                      in process                     out of process

  Submission { id, op, trace }    run_turn(input):               session.create
    ↓                              build ApiRequest               ↓
  spawn_task                       stream events                spawn @github/copilot
    ↓                              fold into AssistantMsg         ↓
  RegularTask::run                 push to session              vscode-jsonrpc over stdio
    ↓                              for each ToolUse:              ↓
  emits Event { id, msg }           pre hook                    inbound: tool.call,
    ↓                               permission                  permission.request,
  EventMsg includes                 execute                     hooks.invoke,
  TurnStarted, TurnAborted,         post hook                   userInput.request,
  output deltas, tool calls,        push tool result            systemMessage.transform
  approval requests              loop until no tool calls         ↓
                                   auto compact check           outbound: session.event,
                                                               session.lifecycle

These are not three implementations of the same idea. They are three different contracts between model and runtime.

Codex is a typed asynchronous protocol. The model emits a Submission with an Op and gets back a stream of typed Event messages. The protocol is defined at codex-rs/protocol/src/protocol.rs with explicit #[serde(tag = "type", rename_all = "snake_case")] enums. There is a second protocol layered on top: app-server-protocol/src/protocol/v2.rs is 10,721 lines of JSON RPC for cross process clients (IDE plugin, desktop app), where v1 (245 lines) is frozen and all new RPCs go to v2. Methods are named <resource>/<method> with singular resource names, camelCase wire format. The two protocols stack: agent layer for in process, JSON RPC layer for cross process. The model was trained to emit submissions and consume events.

Claude Code is a direct typed conversation loop. The runtime's ConversationRuntime::run_turn consumes a Vec<AssistantEvent> per turn from ApiClient::stream. AssistantEvent variants are TextDelta(String), ToolUse { id, name, input }, Usage(TokenUsage), PromptCache(PromptCacheEvent), and MessageStop. There is no separate submission queue. The protocol is the Anthropic Messages API plus a tight in process tool dispatcher. The model was trained to emit tool calls inside an assistant message and respond to tool results in the next turn.

GitHub Copilot CLI is a supervisor protocol. The host app does not run the agent loop. It spawns the bundled @github/copilot binary as a subprocess, opens a vscode-jsonrpc channel over stdio, and sends session.create with the full configuration: model, system message, tools, MCP servers, custom agents, skill directories, hook flags. The agent loop runs inside the child process. The host gets session.event notifications back. The model was trained to run inside this supervisor and emit JSON RPC events that the supervisor can route.

You can see the architectural commitment harden in each design. Codex's AGENTS.md literally polices crate growth: "Resist adding code to codex-core. The largest crate is explicitly off limits for new features." A 500 line soft cap, 800 line hard cap per Rust module. New features pay rent in the form of a new crate. This is a compiler toolchain attitude applied to an agent harness, and the model was trained to operate inside it. Claude Code's port enforces a different rule: "one agent loop, not a fan out of specialized agents," which is why subagents in Claude Code start with a fresh context and cannot recurse. Copilot CLI's supervisor model is what lets a single binary serve three surfaces (terminal, cloud agent, third party hosts). Each surface gets the same model behavior because the model is always running inside the same supervisor.

Now imagine you swap models. Take a model trained to emit Submission { id, op: UserTurn, trace } and feed it Claude Code's Vec<AssistantEvent> stream. The model has been taught one wire shape. The harness expects another. The mismatch shows up not as an outright failure but as a quiet degradation: missed tool calls, wrong reasoning effort levels, inconsistent compaction triggers, citation tags that the harness never parses. The wire format is part of the model.


The Tool Surface

This is where post training is most visible.

Every harness has a tool registry. The names look similar at the top: read, write, bash, grep, glob. But once you go past the first six, the surfaces diverge in ways that the model has been taught to exploit.

Codex tool surface

Codex's tools/src/lib.rs exposes a particular vocabulary:

  • apply_patch — Codex's custom diff format. Two flavors: a freeform Lark grammar at tool_apply_patch.lark and a JSON variant. The model was trained to emit patches in this format. It is not interchangeable with Claude Code's edit_file (which takes old_string / new_string).
  • local_shell — the bash family. Plus exec_command_tool and write_stdin_tool for long lived processes that the model can drive with stdin writes after the fact.
  • update_plan_tool — the plan/todo tool. A model not trained on this tool will use a different convention to track work.
  • request_permissions_tool — model can request expanded permissions mid turn. Codex is the only harness with this exact verb.
  • agent_tool — multi agent orchestration with spawn_agent_v1, spawn_agent_v2, wait_agent_v1, wait_agent_v2, send_message, close_agent_v1, close_agent_v2, resume_agent. Eight verbs. The model knows all eight.
  • tool_search, tool_suggest — tools that find other tools. Codex's answer to deferred tool loading.
  • goal_toolcreate_goal, get_goal, update_goal. Tied to migration 0029_thread_goals.sql.

Claude Code tool surface

Claude Code's port enumerates 40 specs in mvp_tool_specs():

  • read_file, write_file, edit_file — lower case names internally, surfaced to the model as CamelCase (Read, Write, Edit). The model was trained on the CamelCase variant.
  • Edit requires old_string, new_string, optional replace_all. Not the same shape as Codex's apply_patch.
  • Bash has the deepest sandbox surface: command, timeout, description, run_in_background, dangerouslyDisableSandbox, namespaceRestrictions, isolateNetwork, filesystemMode. The model knows when to set run_in_background: true and pair it with the Monitor tool.
  • Skill and ToolSearch — the lazy load primitives.
  • Agent — single tool for subagent dispatch. Takes description, prompt, optional subagent_type, optional model. The post training has the model emit short imperative descriptions for these.
  • EnterPlanMode / ExitPlanMode — both WorkspaceWrite permission. Toggles a worktree local override.
  • EnterWorktree / ExitWorktree — wrap git worktree add for subagent isolation.
  • Monitor — streams stdout from a background process. Pairs with Bash run_in_background: true. The model knows this pattern; Codex does not have it.
  • TodoWrite — the workflow scaffolding tool. The model writes {content, activeForm, status} triplets in a particular pattern.

GitHub Copilot CLI tool surface

Copilot CLI bundles a different default, drawn from the public changelog:

  • grep, glob (bundled ripgrep), view, view_range — file reading with explicit range params.
  • web_fetch — built in (v0.0.374). Rejects file:// URLs.
  • read_bash, write_bash, stop_bash — three verb interactive shell control.
  • task — subagent dispatch with depth and concurrency limits.
  • read_agent, write_agent — multi turn subagent control. A different shape from Codex's six verb agent surface.
  • ask_user — interactive clarification.
  • store_memory — persistent memory tied to a remote backend. Memory is not local files here.
  • apply_patch — included specifically when serving Codex models. A different patch toolchain than Codex's own.
  • create_pull_request, sql, exit_plan_mode, list_copilot_spaces, show_file.
SAME CONCEPT, DIFFERENT SHAPE
==============================

                    CODEX               CLAUDE CODE           COPILOT CLI

Edit a file         apply_patch         Edit                  apply_patch (Codex
                    (custom Lark         (old_string,          models only); else
                    grammar OR           new_string,           write_bash for sed
                    JSON)               replace_all)          editing

Run a command       local_shell         Bash                  read_bash /
                    +                   +                     write_bash /
                    exec_command        Monitor               stop_bash
                    +                   (background           (three verb
                    write_stdin         streaming)            interactive)

Subagent dispatch   eight verb          single                two verb
                    (spawn/wait/        Agent tool            (read_agent /
                    send/close/                               write_agent)
                    resume,                                   plus task
                    v1 + v2 each)

Plan mode           update_plan_tool    EnterPlanMode +       exit_plan_mode
                    (model writes       ExitPlanMode          (model writes
                    plan inline)        (worktree state       plan inline,
                                        toggle)               approval dialog)

Memory write        offline,            Write/Edit on         store_memory
                    no live tool        per file .md          (remote backend)

A model trained on Codex's eight verb subagent surface knows how to send a message to a running subagent. A model trained on Claude Code's Agent tool does not have that verb in its instinct set. The harness can paper over this with a router, but the router cannot give the model an instinct it does not have.

Cursor's harness team puts the underlying mechanic plainly. From their April 30 research post: "OpenAI's models are trained to edit files using a patch-based format, while Anthropic's models are trained on string replacement. Either model could use either tool, but giving it the unfamiliar one costs extra reasoning tokens and produces more mistakes. So in our harness, we provision each model with the tool format it had during training." This is the single cleanest description of model harness fit I have seen from any vendor, and it is not a hand wave about model preferences but a specific measurable cost in reasoning tokens paired with an observable increase in error rate, recorded at scale across millions of agent turns in production.

This is where model harness fit shows up most visibly. The tool surface is the model's vocabulary for the world. Cross train on a different vocabulary and you lose precision in every interaction.


Skills Carry Tool Specs

Skills look interchangeable on the surface. All three harnesses use a SKILL.md file with YAML frontmatter (name, description, optional metadata). Codex even baked in cross compat: core-skills parses Claude style markdown skills. Copilot CLI explicitly reads .claude/ config. The format is so similar that the same SKILL.md body would parse in all three.

But skills are not just markdown. A skill carries an implicit contract about which tools it expects to call. That contract is not in the frontmatter. It is embedded in the body, in the form of imperative instructions that name specific tools by name, with specific argument shapes, and with specific verbs the model must emit.

Look at what each harness ships as a system skill.

Codex's bootstrap skills, baked in via include_dir! and extracted to $CODEX_HOME/skills/.system on first launch, are five: imagegen, openai-docs, plugin-creator, skill-creator, skill-installer. The skill-installer body invokes list-skills.py and install-skill-from-github.py as scripts (codex-rs/skills/src/assets/samples/skill-installer/SKILL.md). It assumes the model can call bash to run a Python script. It assumes the model knows that scripts in scripts/ of a skill folder are invokable. It assumes a sparse checkout fallback for private repos. None of that is in the frontmatter. All of it is in the body.

Claude Code's skills are different. The superpowers plugin ships superpowers:test-driven-development, superpowers:writing-plans, superpowers:verification-before-completion, superpowers:requesting-code-review, superpowers:dispatching-parallel-agents, plus many more. The bodies invoke Claude's specific tools: Skill to bootstrap into a workflow, TodoWrite to track steps, Agent to dispatch parallel subagents, Read / Edit for file changes, Grep / Glob for search. The skills also encode hard process rules: "Use this BEFORE any creative work," "Use when about to claim work is complete." These rules anchor on the harness's <system-reminder> injection model, which Codex does not have in the same form.

Copilot CLI's skills are part of the plugin marketplace ecosystem, and the changelog reveals a different posture. v1.0.5 added "Embedding based dynamic retrieval of MCP and skill instructions per turn" as experimental. The model was trained to consume skill instructions delivered as a per turn injection chosen by an embedding ranker, rather than as a description match. A skill body that assumes "you will see all skills in the system reminder" does not behave the same way when the harness ranks skills via embedding and only injects the top three.

SAME SKILL.MD, DIFFERENT BEHAVIORS
===================================

A skill body that says:
  "Step 1: call TodoWrite with a list of three subtasks.
   Step 2: dispatch each subtask via Agent with subagent_type='Explore'.
   Step 3: wait for results, then verify with Bash before claiming done."

In Claude Code:
  TodoWrite is a built in. Agent is a built in. subagent_type='Explore'
  has a specific allowed tool set the harness enforces. Bash returns
  truncated output the model knows how to parse.
  ✅ Works as designed.

In Codex:
  TodoWrite does not exist. The closest is update_plan_tool, but the
  schema differs. Agent does not exist with that exact shape.
  Codex has spawn_agent_v1 / v2. The Bash equivalent is local_shell.
  ❌ Skill silently fails or executes a degraded version.

In Copilot CLI:
  TodoWrite as a verb does not exist. Agent dispatch is via task or
  read_agent / write_agent. The Bash equivalent is the three verb
  read_bash / write_bash / stop_bash. Plus skills may not even be
  loaded into the prompt this turn (embedding ranker decided).
  ❌ Skill is invisible or executes against a non existent surface.

This is why "we both use SKILL.md" is misleading. The format is identical; the contract underneath is not. Skills carry tool specs implicitly, and the implicit specs are pinned to the harness that authored them.

The same applies to plugin manifests. Copilot CLI's v1.0.22 explicitly added: "Plugins using .claude-plugin/ or .plugin/ manifest directories now load their MCP and LSP servers correctly." That is GitHub treating Claude Code's plugin format as a substrate to interoperate with at the file level. But the skills inside those plugins still bring assumptions about Claude Code's tool surface. Loading the file does not give the model the right vocabulary.

The lesson generalizes. A skills marketplace that claims to be cross harness is a routing problem, not just a parsing problem. Each skill needs to either declare its target harness explicitly, or get rewritten per harness, or run inside a router that translates tool calls between dialects. None of these are free.


The Memory Layer

I covered memory in detail in Agent Memory Engineering, so I will keep this section to the parts that matter for harness fit.

Three memory architectures, three different bets:

HERMES                          CLAUDE CODE                    CODEX
==========                      ==========                     ==========

Synchronous live writes,        Synchronous live writes,       Deferred batch writes,
frozen snapshot at session      one .md file per memory,       two phase pipeline
start, no decay, char cap       always loaded MEMORY.md        triggered by 6+ hour
                                index, body read on            idle, gpt 5.4 mini
                                demand with age in days        extracts, gpt 5.4
                                <system-reminder>              consolidates against
                                                              git baseline


GITHUB COPILOT CLI
==========
Server side memory backend, store_memory tool, per repo,
remote retrieval, hangs the agent if the backend is unavailable
(v1.0.23 specifically fixed: "Agent no longer hangs on the first
turn when the memory backend is unavailable")

The architectural choices already differ. But the harness fit story is sharper than that. Each model was trained to write memory using a specific tool with a specific schema, and to cite memory using a specific tag with a specific format.

Codex's model writes a structured raw memory artifact via Phase 1 extraction with a strict JSON schema:

---
description: concise but information dense description
task: <primary_task_signature>
task_group: <cwd_or_workflow_bucket>
task_outcome: <success|partial|fail|uncertain>
cwd: <single best primary working directory>
keywords: k1, k2, k3
---

The Phase 2 consolidation prompt is 841 lines. additionalProperties: false. Schema validation rejects malformed output at parse time. The model citations are wrapped in <oai-mem-citation> blocks. The harness has a parser at citations.rs:6-43 that increments usage_count in the SQLite state DB whenever a citation arrives. This is the model's memory ritual. Strip the citation tag and the harness loses its decay signal.

Claude Code's model writes memory using the standard Write and Edit tools, into one file per memory under ~/.claude/projects/<encoded-cwd>/memory/. There is no separate memory tool. The model picks one of four types (user, feedback, project, reference) by file name prefix. The body uses a **Why:** / **How to apply:** convention for behavioral rules. The harness wraps every body read in a <system-reminder> block with the dynamic age in days and a verification reminder. The model was trained to read memory through that wrapper, weight it accordingly, and skip stale claims.

Copilot CLI's model invokes store_memory as a dedicated tool. The body of the memory goes to a remote backend. Cross session memory was added in v0.0.412 as experimental. The retrieval surface is a server side query, not a local grep. The model expects the backend to be there. When the backend is unavailable (v1.0.23 fix), the agent used to hang on the first turn. That is a load bearing dependency.

Now mix and match. Run a Codex trained model on Claude Code's harness. The model will look for a memory write tool, find Write, and write a file — but it will write a file in Codex's structured format, with task_group: headers and cwd: annotations, into a directory that Claude Code does not auto load on the next session. The harness does not know to inject the index. The next session does not see the memory. And critically, the model will emit <oai-mem-citation> blocks that Claude Code never parses. Memory effectively does not exist on the next turn.

Run a Claude trained model on Codex's harness. The model will not emit citation tags. Codex's usage_count decay signal stops incrementing. Memories that were used silently rank below memories that were not used, because the harness sees zero citations. Within a few weeks, the wrong memories are getting evicted.

Run either on Copilot CLI's harness with the remote backend. The model's local file instincts do not transfer. The store_memory tool is the only path, the schema is different, and the cross session retrieval is keyword search against a server, not the always loaded index plus on demand body read pattern that the model was trained on. The first turns will look fine because the model has memory shaped instincts. The retention will be different.

The memory layer is the densest collision surface for model harness fit. Tools, schemas, citation tags, decay signals, retrieval rituals — all of these are coupled, all of these were learned together during post training, and none of them transfer cleanly when you swap one side.


The Citation Discipline

The <oai-mem-citation> tag is a microcosm of the larger problem.

Codex's model emits a small XML block at the end of an assistant message whenever it pulled in memory:

<oai-mem-citation thread_id="xyz" raw_memory_id="abc">
this entry: <description string>
</oai-mem-citation>

The harness has a parser that strips the block before showing the assistant message to the user, and uses the parsed thread_id to bump usage_count and last_usage columns in stage1_outputs. The parser is at citations.rs:6-43. The SQL is in migration 0016_memory_usage.sql:

ALTER TABLE stage1_outputs ADD COLUMN usage_count INTEGER;
ALTER TABLE stage1_outputs ADD COLUMN last_usage INTEGER;

UPDATE stage1_outputs
SET
    usage_count = COALESCE(usage_count, 0) + 1,
    last_usage = ?
WHERE thread_id = ?

This is the model's contract with the harness. Cite what you used. The harness will reward what you cited by keeping it alive. The Phase 2 consolidator ranks memories by usage_count and decays anything with no citations and no fresh source_updated_at after 30 days.

Claude Code's model has no equivalent citation tag. The harness does not need one because memory is read via the standard Read tool, and the agent's verification grep is what doubles as the "I used this" signal. The reminder text in front of every body read explicitly tells the model: "Records can become stale over time. Verify before recommending." There is no decay loop because the harness assumes the user will prune or the verification will fail in place.

Copilot CLI's model talks to a remote memory backend. The store, retrieve, and rank logic is server side. The model does not need a citation tag because the backend tracks reads on its own.

Now look at what happens in a cross harness run.

TRAINING DISTRIBUTION VS RUNTIME

Codex model emitted 12,000 example  ↓ runtime parses
turns during post training that      <oai-mem-citation>,
include <oai-mem-citation> blocks    bumps usage_count.
                                     ✅ memory ranked, decayed
                                     correctly.

Same model on Claude Code            ↓ runtime ignores tag,
harness emits <oai-mem-citation>     leaves it in the
in the assistant text.               assistant text shown
                                     to the user.
                                     ❌ user sees raw XML.
                                     ❌ no decay signal in
                                     Claude Code's memory.

Claude trained model on Codex        ↓ no citation tag,
harness uses memory inline,          usage_count never bumps.
never wraps it in a tag.             ❌ Codex's decay loop
                                     evicts good memories
                                     because they look unused.

A six character XML tag becomes the difference between a memory system that improves with use and one that degrades silently.

This is what I mean by "the wire format is part of the model." The citation tag is not a feature on a roadmap. It is a habit the model picked up during post training, and that habit only pays off inside the harness that taught it.


The System Prompt Skeleton

The Copilot CLI SDK exposes its system prompt as a structured object with ten section IDs. Hosts can override each section, replace it, or take full control. From the open source TypeScript at github.com/github/copilot-sdk/nodejs/src/types.ts:636:

const SYSTEM_PROMPT_SECTIONS = {
  identity:           "Agent identity preamble and mode statement",
  tone:               "Response style, conciseness rules, output formatting",
  tool_efficiency:    "Tool usage patterns, parallel calling, batching",
  environment_context: "CWD, OS, git root, directory listing, available tools",
  code_change_rules:  "Coding rules, linting/testing, ecosystem tools, style",
  guidelines:         "Tips, behavioral best practices",
  safety:             "Environment limitations, prohibited actions, security",
  tool_instructions:  "Per-tool usage instructions",
  custom_instructions: "Repository and organization custom instructions",
  last_instructions:  "End of prompt: parallel tool calling, persistence"
};

This is not just a documentation surface. It is the public contract of the model's training distribution. Each section has a specific role, and the model was trained to read each section as a particular kind of instruction. The safety section is harder than guidelines. The tool_instructions section is consulted when the model is mid tool call. The last_instructions section is what the model reads right before emitting a turn.

Codex has its own equivalent, less explicit. The developer prompt is assembled in this order:

CODEX DEVELOPER PROMPT
======================
- permission instructions
- base developer instructions
- memory_summary.md (5K tokens, always)
- collaboration mode
- realtime updates
- personality
- apps

Memory comes after policy and identity, before behavioral overrides. The model was trained to read this exact order.

Claude Code's static prefix:

CLAUDE CODE STATIC PREFIX
==========================
<base agent system prompt>
<environment block: cwd, platform, OS>
# claudeMd                    ← project CLAUDE.md content
# auto memory                 ← MEMORY.md index, capped at 200 lines
  <types block describing user/feedback/project/reference>
  <when to save guidance>
  <verification rule before acting on memory>
  <full MEMORY.md contents>
# userEmail
# currentDate

A different shape, a different ordering, and a different set of precedence claims about what the model should treat as binding. The Claude trained model knows that # auto memory instructions "OVERRIDE any default behavior and you MUST follow them exactly as written." That phrase lives inside the harness rather than inside the model itself, but the model has been trained to recognize the heading and treat its contents as binding. A model trained against this prefix will hunt for # auto memory and react accordingly, while a model trained against a different prefix simply will not see the heading the same way and will give it the weight of any other piece of context.

This is the same lesson as the citation tag, scaled up. The system prompt is not generic. It is a structured artifact with section conventions that the model was taught to read in a specific way. Swap harnesses and you keep the model's reading habits but lose the structure they apply to.


The Routing Reality: What GitHub Copilot CLI Is Doing

GitHub Copilot CLI is the most interesting harness in the comparison because it explicitly tries to route across model families. Sonnet is the default. The picker exposes Sonnet, Opus, Haiku, and the GPT 5.x family. v1.0.32 added an auto mode that selects per session.

How does Copilot CLI handle the model harness fit problem? Looking at the changelog, the strategy has three legs.

1. Per model tool inclusion

The apply_patch tool is included only when the active model is from the Codex family. v0.0.366: "Codex specific patch toolchain." The harness knows which models were trained on apply_patch and only exposes it to those models. Anthropic models get the Edit and Write shape they were trained on.

This is not a translation layer. It is a per model tool surface. The router does not pretend apply_patch and Edit are the same operation. It serves the right tool to the right model.

2. Tool search per model

v1.0.13: "Tool search for Claude models." The implication: Claude trained models expect a deferred tool loading pattern via ToolSearch. The harness only exposes the discovery loop to those models.

OpenAI trained models do not get the same loop. They get the full tool list up front because that is what they were trained on.

3. Critic agent with complementary model

v1.0.18: "New Critic agent automatically reviews plans and complex implementations using a complementary model to catch errors early (available in experimental mode for Claude models)."

The Critic is a different model than the main agent. Plans get reviewed by the complementary model. This is multi model orchestration baked into the harness, and the routing is explicit.

COPILOT CLI'S MULTI MODEL ROUTING
==================================

User prompt
  ↓
Active model (Sonnet 4.6 or GPT 5.4 or Opus 4.7)
  ↓
For Claude models:                For OpenAI models:
  expose Edit, Write,                expose apply_patch,
  Bash with full sandbox             local_shell, exec_command,
  ToolSearch deferred                full tool list up front
  loading enabled                    
  ↓                                   ↓
  Critic agent (different             No Critic
  model) reviews plan
  ↓                                   ↓
emit assistant message              emit assistant message

This is what a real router looks like. Not "translate everything to a common dialect," but "serve the right dialect to each model." It is more code, more state, more telemetry. It is also the only way to get top performance from each model.

The cost of this approach is honesty. The harness has to admit that "Claude on Copilot CLI" and "GPT on Copilot CLI" are different products. The user picks one or the other and gets different behavior. There is no neutral common denominator.

This is the right honest answer to model harness fit, and Copilot CLI is the only harness in the open or semi open set that actually ships it. The strategic logic is worth naming clearly. Multi model is the crucial bet for any serious agent platform in 2026, and at GitHub and Microsoft we made that bet deliberately and early. Most customers are running multi model workflows whether their vendor admits it or not, and the only way to give every model its best performance is to build the per model routing surface inside the harness itself. We committed to that answer up front, which is what positions Copilot CLI to keep pace with whatever the labs ship next without having to redo its core architecture each time the leaderboard reshuffles. The matched pair is the unit of analysis, but the matched harness across many models is the unit of platform, and that is the level we are operating at.


Mid-Chat Model Switching: The Cleanest Failure Mode

The single sharpest concrete demonstration of model harness fit comes from what happens when a user switches models mid conversation. Cursor's research team describes this carefully in their April 30 post, and the failure surface is worth walking through because every assumption that breaks here is an assumption a single model harness pair quietly relies on.

Three things break at the moment of a model switch.

First, the conversation history itself is now out of distribution. The previous model produced tool calls in its native vocabulary: apply_patch blocks, <oai-mem-citation> tags, six or eight verb subagent dispatches. The new model was trained against a different vocabulary and now has to reason about a transcript full of tool calls it would not have emitted. Cursor handles this by injecting a custom instruction explicitly telling the model "you are taking over mid chat from another model" plus steering it away from the prior model's tools. That mitigates but does not eliminate the cost. The model is still reading a transcript that does not match its instincts.

Second, the prompt cache breaks. Caches are provider and model specific, which means a switch is a guaranteed cache miss. For a long session, this turns the first turn after the switch into a full price re entry of every byte of system prompt and conversation history. Cursor's mitigation is to summarize the conversation at switch time, which yields a shorter clean transcript that costs less to re cache, at the price of losing details that the summary did not preserve.

Third, the tools themselves change shape. The new model's harness loads its native tool set. If the user was deep into a subagent dispatch flow with one set of verbs, the next turn presents a different set. The model has to figure out whether the prior tools are still valid (they are not) and which of its own tools maps to the user's apparent intent.

Cursor's recommendation, after building the mitigations, is honest: "we generally recommend staying with one model for the duration of a conversation, unless you have a reason to switch." The cleanest workaround they describe is to spawn a subagent with a different model rather than switch the main conversation. A subagent starts with a fresh context window, no transcript bias, no cache to break, and the new model's native tool surface from the first turn.

Each of these failure modes maps directly back to the thesis. The transcript, the cache prefix, and the tool surface are all parts of the wire format the model was trained against. Change the model and you change the contract on all three sides at once. A model switch is not a model swap. It is a harness swap, a tool swap, and a cache invalidation, all at once.


What the Labs Are Saying

The model harness fit framing is no longer a subterranean observation. Two of the labs publishing the most interesting agent work in 2026 say it openly, and the AI infrastructure community has converged on a clean one line definition.

Cursor's Stefan Heule and Jediah Katz describe their harness work as "obsessively stacking small optimizations" specifically because a step change is rare and the gains compound only inside a matched pair. Their team builds in custom prompting per provider and per model version, citing OpenAI's literal precision versus Claude's tolerance for imprecise instructions as concrete differentiators that flow back into prompt design. They report driving unexpected tool call errors down by an order of magnitude in one focused sprint. Tool call reliability is not a model property. It is a harness property, and one that compounds every turn the agent stays alive.

Anthropic's Prithvi Rajasekaran ran a related experiment in his March 24 post on long running application development. The architecture: a planner, a generator, and an evaluator agent, modeled on Generative Adversarial Networks. The evaluator uses Playwright MCP to actually click through the running application as a user would, then grades against a rubric. Out of the box, Rajasekaran reports, "Claude is a poor QA agent" — it identifies legitimate issues and then talks itself into approving the work anyway. Tuning the evaluator prompt over multiple rounds is what turns it into a reliable judge. The harness creates the judgment surface; the model alone does not.

The deeper lesson from Rajasekaran's work is about how harnesses should evolve as models improve. He built one harness against Claude Sonnet 4.5, which exhibited "context anxiety" strongly enough that compaction alone was not sufficient. The harness needed full context resets between sessions, with structured handoff artifacts to carry state across the boundary. When Opus 4.6 shipped, that behavior was largely gone. Rajasekaran dropped the entire context reset machinery and ran one continuous session for over two hours. Every component in a harness encodes an assumption about what the model cannot do on its own. Those assumptions go stale. The matched pair is not static. It moves as the model matures, and the harness has to retire scaffolding that is no longer load bearing.

LangChain's Vivek Trivedy has the cleanest framing I have seen: "Agent = Model + Harness. If you're not the model, you're the harness." The harness in this view is every piece of code, configuration, and execution logic that is not the weights themselves. System prompts, tool descriptions, bundled infrastructure, orchestration logic, hooks, middleware. Working backwards from the desired agent behavior, every harness primitive earns its place by patching a specific model gap. Filesystems for durable state, bash for arbitrary action, sandboxes for safe execution, memory for continual learning, planning and self verification for long horizons. Each primitive started life as a workaround for a specific deficiency the model had at training time. Some of those primitives will get absorbed back into the model over time. Others will compound.

Trivedy also names the mechanism that makes model harness fit so durable: a co-evolution feedback loop. "Useful primitives are discovered, added to the harness, and then used when training the next generation of models. As this cycle repeats, models become more capable within the harness they were trained in." This is the pipeline that hardens the matched pair over generations. A new harness primitive ships in week one. By month three, it shows up in millions of agent traces. By month six, those traces are training data for the next model. By month twelve, the next model has the primitive baked into its instincts and the harness can lean on it. The loop is what makes "swap to a foreign harness" not just clumsy but compounding clumsy. The model's habits got shaped by the previous generation of its own harness, which itself was shaped by the generation before. Move sideways and you skip every cycle of that compounding.

Trivedy is honest about the cost of this loop, and I want to flag the counter argument cleanly. Quoting him: "A truly intelligent model should have little trouble switching between patch methods, but training with a harness in the loop creates this overfitting." If the model's tool format preference is overfit to its training harness, you could argue that the right long term move is to train against a more diverse set of harnesses so the model generalizes. That argument has merit. The labs that ship one model and one harness as a pair are buying near term performance at the cost of the model's portability. Whether that trade is the right one depends on whether portability is something the customer values, and right now the customer mostly values the leaderboard.

Three independent posts published within weeks of each other, all converging on a single thesis: the model is only half of the system, the harness is the other half, the matched pair is the proper unit of analysis, and the vendors that ship the matched pair as a single product are the ones currently sitting at the top of the leaderboards.


The Identity File Convention

The harness side of the contract has converged on a markdown file per concern, and the file names are now load bearing across the ecosystem. A model trained on one harness recognizes the file names and knows which one carries which kind of authority.

THE MARKDOWN-FILE-PER-CONCERN CONVENTION (April 2026)
=====================================================

CLAUDE.md      Static project instructions for the Claude Code session.
               Anthropic's proprietary format. Loaded at session start.
               The de facto standard since 2024. Closely paired with
               Anthropic post training.

AGENTS.md      Cross tool standard for procedural rules. "What do you
               do and how." Adopted by Codex CLI, OpenClaw, Cursor.
               Claude Code does not natively read AGENTS.md as of
               April 2026 (community feature request open). The
               GitHub Copilot CLI changelog explicitly added support
               for AGENTS.md as a context source.

SOUL.md        Personality, voice, worldview. From github.com/
               aaronjmars/soul.md. Used by OpenClaw and the Aeon
               framework. "An AI that thinks and speaks as you" vs
               "a chatbot that talks about you." Opt in convention,
               not yet baked into the major harnesses.

USER.md        Who the user is. Hermes splits its memory into
               MEMORY.md (what the agent learned) and USER.md (who
               the user is). The split lets the model treat user
               facts as authoritative and project facts as evolving.

MEMORY.md      Auto memory index. Claude Code loads this on every
               turn under # auto memory and treats it as overriding
               default behavior. Codex calls its index memory_summary.md
               with a 5K token cap and lazy loads the bodies.

STYLE.md       Voice and writing patterns (SOUL.md ecosystem).

SKILL.md       Per skill folder, with YAML frontmatter. Cross harness
               adopted: Claude Code, Codex CLI, GitHub Copilot CLI all
               read this format. Body conventions still differ.

IDENTITY.md    Lightweight public facing card with name, role, metadata.
               OpenClaw convention.

The key observation: the file names are now part of the wire format. A model that has been trained to look for a # auto memory block under a MEMORY.md heading will hunt for that exact heading on a turn. A model trained against AGENTS.md will look for AGENTS.md and miss CLAUDE.md. A model trained against SOUL.md will load personality from SOUL.md and ignore the same content if you put it in STYLE.md.

This is why the AGENTS.md feature request against Anthropic's repo matters. It is not a docs migration. It is a request for the model's training distribution to expand its file recognition vocabulary. Until Anthropic post trains Claude to read AGENTS.md, that file is invisible to Claude Code even if it sits next to CLAUDE.md in the repo.

The SOUL.md ecosystem is a stress test of this thesis. SOUL.md is not yet recognized by any major harness's default loader. So the SOUL.md repo's installation instructions are revealing: copy your soul/ directory into the project, then add a few lines to CLAUDE.md pointing the model at it. That is a manual bridge from a non-recognized convention to a recognized one. The SOUL.md authors understand that the bytes do not work unless the model knows where to look, and "where to look" is a habit fixed in post training.

The same routing problem shows up in the open. GitHub Copilot CLI v1.0.4 added: "Read .claude/settings.json and .claude/settings.local.json as additional repo config sources." v1.0.36 walked some of it back: "Custom agents, skills, and commands from ~/.claude/ are no longer loaded by the Copilot CLI." That is a router that tried to be permissive about file names, then narrowed when the user surface got confusing. The lesson sits underneath the changelog: even the harness that runs Claude models cannot treat .claude/ files as authoritative without negotiating with the user about which conventions count.

Pick the convention. Ship the post training to match. Or ship a router that explicitly maps each file to the model that recognizes it. The middle path of "be permissive and load anything that looks plausible" loses every time.


What This Means

After months of running these three harnesses side by side, reading the open source code, and tracking the Terminal-Bench leaderboard:

The harness is no longer a wrapper around the model. The harness is part of the model's effective parameters. The post training process embeds the harness's tool surface, schema shapes, memory rituals, citation contracts, and system prompt structure into the model's instinct set. You can take the weights to a different harness, but you cannot take the instincts. The instincts only fire when the harness presents the world the way the post training presented it.

This has three consequences worth naming.

For agent platform builders: pick a harness, pick a model, ship them as a pair. Do not pretend the model is portable. Do not pretend the harness is neutral. The frontier labs are publishing model harness pairs whether they say so or not, and the per pair performance is the only number that matters. Copilot CLI's "different tools for different models" approach is the honest version of this. The dishonest versions ship a common denominator and underperform on every model they serve.

For model labs: the harness is product strategy, not infrastructure. The harness is where the lab's post training investment compounds. Anthropic's <system-reminder> injection model, the typed memory taxonomy, the verification on every body read, are not infrastructure choices. They are the surface the model was sculpted against, and they are the moat that makes the model less interchangeable than it would otherwise be. Same for Codex's two phase memory pipeline, the citation tag, the strict JSON schema. Same for Copilot CLI's ten section system prompt skeleton. The harness is where the model becomes irreplaceable.

For users: the cost of switching is higher than it looks, and lower than vendors would like you to think. Higher because the model and the harness fused over months of training and you cannot pull them apart cleanly. Lower because the simple stack underneath is shared, and the conventions on top are documentable. A honest port — replicate the tool surface, replicate the citation contract, replicate the system prompt structure, replicate the memory ritual — would close most of the gap. It just costs as much as the original post training did to set up.

The matched pair is not static. It shifts as the model matures. This is the most useful nuance from Rajasekaran's Anthropic post. A harness component that was load bearing for Sonnet 4.5 (context resets, sprint decomposition, aggressive compaction) became dead weight on Opus 4.6 because the model started doing that work natively. The right harness for a model in March is not the right harness for that model's successor in October. The discipline is to read the traces, identify which components are still earning their place, and retire the ones that are now patches over solved problems. Cursor's blog says the same thing in different words: "Every component in a harness encodes an assumption about what the model cannot do on its own, and those assumptions go stale."

So back to the question I started with. Why does the same prompt produce visibly different output across three harnesses running the same model?

Because the model running on three harnesses is effectively three different models, even though the weights on disk are byte for byte identical. The instincts that fire at runtime are not stored only in the weights, they are conditioned by the harness the weights were trained against, and the instincts turn out to be most of what shows up in the assistant's output on any given turn.

The interesting design move now is not a better model. It is not a better harness either. It is the matched pair, designed end to end, where the post training and the runtime reinforce each other turn after turn until the model becomes legibly better at the things this specific harness rewards.

You can see the major builders converging on this idea from three different starting points. Anthropic shipped Claude Code as the canonical Claude harness, with the post training and the runtime co-designed as a single product. OpenAI shipped Codex CLI as the canonical Codex harness, with the same vertical integration on the OpenAI side of the house. At GitHub and Microsoft we shipped Copilot CLI with explicit per model routing because multi model is crucial: customers run every frontier model they can get their hands on, and our job is to make each one perform at its best inside a harness designed to serve all of them well. The result is the most pragmatically honest harness in the open or semi open set today, and the one positioned to compound across model generations rather than locking to any single lab. Three different theories of what to do about model harness fit, all three coherent, and all three paying a real engineering price for the choice they made.

The frontier work in 2026 is not about new model architectures. It is about new harness primitives. Ralph Loops, where a hook intercepts the model's exit attempt and reinjects the original prompt in a clean context window, forcing the agent to keep grinding against the goal. Just-in-time harness assembly, where the tool surface and the system prompt get composed per task instead of pre-configured per session. Self-tracing agents that read their own logs to find harness-level failure modes and patch them without human intervention. Each one of these is a primitive that some model will eventually be post trained against, and that pairing will show up at the top of the next leaderboard.

The Terminal-Bench leaderboard tells you who is paying the price right. Look at it again in six months.