Comments only matter when the code doesn't

For years, the comment rule was simple: do not explain code the code already explains.

That rule still holds. But AI agents changed the reader.

Andrej Karpathy tweet about code comments helping the copilot

A human reads code with memory. They remember that gift-card refunds stop after 24 hours because the payment processor settles those funds after T+1. They know an expedited refund policy should not change that. The code does not always carry that context.

An agent often sees a slice. A file. A task prompt. A few search results. It can reason well inside that context, but it does not have the same social memory. So the old question changes.

It is no longer only: is this comment useful to a human?

It is also: does this comment carry information the code cannot recover?

I decided to test this.

I ran 42 public trials across four versions of the same gift-card refund task.

The result was not "comments help" or "comments are useless." Both are too broad.

Comments helped when they carried information the code did not and lived near the edit. They did almost nothing when the code already carried the rule.

What changed

The old debate was mostly about noise.

Developers would write this:

// Check whether this order is refundable.
return elapsedHours <= refundWindowHours(order);

Then someone would say: remove the comment. The code already says that.

They were right.

That kind of comment does not add information. It makes the file longer. It creates another thing that can drift. It trains people to ignore comments.

But that is not the kind of comment agents need.

The useful comment is closer to this:

// Gift-card refunds must stay capped at 24 hours.
// The payment processor settles those funds after T+1.
// Expedited refunds must not extend this window.

The code can show that an order has a refund window. It cannot explain why one product type has a stricter window unless that rule is encoded somewhere. It cannot tell the agent which future edit would silently weaken the refund policy.

This is the shift. Comments are less about explaining local mechanics and more about carrying intent, invariants, and non-local constraints.

The mistake is taking "comments help agents" and turning it into "write more comments everywhere."

I wanted to know where the boundary is.

The benchmark

Comment Bench is small on purpose. Each scenario presents an agent with a task and a tempting wrong fix. Each scenario runs across seven treatments. Only the comment payload varies:

none - no comment
what_paraphrase - restates what the code does
human_why_inline - precise intent at point of use
human_file_header - module-level rule listing
aidev_anchor - a durable note prefixed with // AIDEV-NOTE: so agents and humans can grep for constraints that must survive refactors
ai_generated_comment - generic plausible AI-style docstring
stale_misleading - comment contradicts the code or rule

AIDEV-NOTE is not magic syntax. It is a convention. The point is to mark comments that are written for future AI-assisted edits, not just for the next human reading the file.

Each trial runs in an isolated /tmp workspace. The agent gets one or a few source files plus a task prompt. Tests are hidden. After the agent stops, I copy its output into a clean directory and run hidden invariant tests against it.

The score has two bits:

task_done = true or false
invariant_held = true or false

That second bit is the point. The dangerous failure is not "the agent failed the task." The dangerous failure is that the feature works and the protected rule breaks.

The benchmark tests four shapes of that problem. They all use gift-card refunds so the domain stays boring and the mechanism stays visible.

Scenario 1: the code already carries the rule

The first scenario tested structural protection.

I used gift-card refunds because the exception is easy to follow. Most orders can get a longer refund window. Gift cards cannot. They stay capped at 24 hours because that payment flow settles earlier.

This was not really about gift cards. It was about whether comments matter when the code already shows the invariant.

I tried three versions of the same scenario.

In the first version, the protected rule lived in its own branch:

if (order.productType === "gift_card") {
	return elapsedHours <= 24;
}

In the second version, the rule was encoded as a cap:

window = Math.min(window, processorCap);

In the third version, the rule lived in a helper:

return capRefundWindow(order.productType, baseWindow);

The comments did not matter.

No comment, paraphrase comment, human-written invariant, file header, AIDEV-NOTE, vague generated comment, stale misleading comment - the result stayed the same. The agent preserved the invariant because the code structure made the invariant visible.

That is the useful null result.

If the code already carries the rule, prose adds little. A branch, helper, schema, type, or test is still better than a comment. The code did the steering.

Scenario 2: the comment is the only place the rule exists

The final scenario removed the structural protection.

The refund code had no branch, no cap, and no helper. It only had the normal refund windows:

const DEFAULT_REFUND_HOURS = 7 * 24;
const EXPEDITED_REFUND_HOURS = 14 * 24;

The task asks the agent to add expedited refunds. The tempting wrong fix is obvious:

return order.expedited ? EXPEDITED_REFUND_HOURS : DEFAULT_REFUND_HOURS;

But the missing rule is not recoverable from the code. Gift-card refunds still need the 24-hour cap. That rule only exists if the comment says it clearly.

This is where comments decided the result.

When the invariant was written directly above the function, the agent preserved it. When the same idea was missing, vague, stale, or merely restated the code, the agent followed the task wording and extended the window.

The clean split was this:

treatment	invariant pass rate
`human_why_inline`	3/3
`aidev_anchor`	3/3
`human_file_header`	0/3
`none`	0/3
`what_paraphrase`	0/3
`ai_generated_comment`	0/3
`stale_misleading`	0/3

The lesson is narrow:

If the invariant is not in the code, a precise comment can become the only thing that prevents the wrong edit.

Scenario 3: placement mattered

The surprising result was human_file_header.

The file header said the same thing. It explicitly said gift-card orders must cap at 24 hours. It said new tiers must carve out gift cards before applying their tier window.

It still failed 0/3.

The point-of-use comments worked. The file header did not.

The AIDEV-NOTE version sat directly above the function the agent had to edit:

// AIDEV-NOTE: Hard processor invariant. Orders with productType ===
// "gift_card" MUST cap at 24 hours regardless of any tier.
// Standard and expedited tiers MUST carve out gift_card before applying
// their window.
export function refundWindowHours(order: Order): number {
	return DEFAULT_REFUND_HOURS;
}

That placement mattered in this run. The file-header version may have read as metadata. Or the task may have narrowed the edit so much that the agent focused on the function body and ignored the top-of-file rule.

Three trials are not enough to make a universal claim about file headers. But they are enough for a practical rule: when a comment is load-bearing, put it near the code it constrains.

What the scenarios showed

The results collapse into three cases.

scenario shape	what the comment did	result
Code already encoded the invariant	Repeated or decorated an existing rule	No measurable effect
Comment was the only encoding near the edit point	Carried the missing refund cap	Decisive effect
Comment was only in a file header	Carried the same rule away from the edit point	Failed in this run

That is the boundary.

A comment next to a protective branch is usually decoration. A comment near the function that needs the rule can be load-bearing. A comment that says "be careful" is weak. A comment that says "gift-card refunds must cap at 24 hours before expedited windows apply" is useful.

What I would change in a codebase

I would not tell agents to write more comments.

I would tell them to write fewer low-signal comments and preserve the high-signal ones.

Keep comments that encode what code can't. Business rules with no structural guard. Constraints that future code might violate. Rationale for why a non-obvious choice is the way it is. These are decisive when they're the only signal.

Drop paraphrases. // Check whether this order is refundable above return elapsedHours <= refundWindowHours(order) reads as zero signal. In the benchmark, paraphrase comments behaved like no comment.

Generic AI-style docstrings are inert. The ai_generated_comment arm used plausible-sounding comments like "returns the appropriate window based on the order's properties." It performed like no comment. This matches the Gloaguen et al. 2026 finding that LLM-generated context files tend to be neutral-to-negative.

Put load-bearing comments near the code they constrain. In this run, the module-level file header failed. The point-of-use comment worked.

AIDEV-NOTE-style tags work. // AIDEV-NOTE: is useful for marking comments that should not be silently dropped on refactor.

Stale comments waste the slot. They destroy the steering benefit a correct comment would have had. Treat them like bugs. In an agent-heavy codebase, stale documentation is not just confusing. It is bad input for the next agent run.

The policy I want in agent-heavy repos is simple:

Do not comment what the next line does.
Do comment invariants, business rules, security constraints,
non-local coupling, and historical reasons the code cannot express.
Treat generated comments as code review material.
Treat stale comments as bugs.

I turned this into a small comment-policy.md file that can be imported from CLAUDE.md, AGENTS.md, .cursorrules, or any agent-instruction file.

What I'm not saying

This isn't a claim about all comments in all settings. The public run is 42 trials on one current coding agent, using single-shot edits in small TypeScript workspaces. Real editor sessions involve multi-turn interaction, longer context, and different file-reading behavior. Other models may behave differently.

It also isn't a claim about humans. The rule "self-documenting code, no comments" was a fight with humans about what comments are for. In the AI-agent regime, the question is different.

The right question is not "comments or no comments." The right question is: does the code already encode this fact?