AI & Agents 6 May 2026 · 12 min

Inside the Article Crew: nine stages, ten specialists, and a 95% claim-coverage gate.

By Timothy Mo · 6 May 2026

The editorial pipeline we run inside WaterDoctor's backend. How one agent per stage, a verbatim-quote check, a Crossref DOI cross-check, and a live-URL fetch keep fabricated citations out of the prose.

We don’t generate articles. We staff a research desk to write them. The Article Crew is the live one — nine stages, ten specialists, a tiered source registry, a verbatim-quote check, a Crossref DOI cross-check, every URL fetched live before sign-off. It runs inside WaterDoctor’s editorial backend, shipping research-spotlight pieces on aquaculture nitrogen cycling, SND bacteria, IoT water quality.

This is how it’s wired, why each stage exists, and where the gates that hold the line actually sit.

What a single-agent draft fails at

Before the crew, we tried what most teams try: one capable model, one long prompt, “write me a 1500-word article on X with citations.” The output passed casual reading. It failed every audit.

Three failure modes recurred. Fabricated DOIs — confident-looking 10.xxxx/yyyy strings that didn’t resolve at Crossref. Verbatim-quote drift — a sentence framed as a quote that the source never actually said in those words. Unsourced numerics — a “73% of facilities report” with no citation, the model splicing a half-remembered statistic into a new context.

None of these are fixable with better prompting. They’re structural: a single context window holding the topic, the research, the prose, the citations and the editorial voice can’t separate “what the source said” from “what would sound right here.” The boundary between memory and output is the failure surface.

A crew makes the boundary explicit. The drafter never does research. The researcher never drafts. The fact-checker only checks. Each agent’s prompt and memory are scoped to its role. False confidence has no place to hide.

The shape: nine stages, no skips

Topic to brief. Brainstorm to scope and angle. Outline to a section-by-section structure with explicit evidence requirements per section. Research to per-section bundles with verbatim excerpts. Draft to first-pass prose. Refine to a tightened version against a house rubric. Humanize to strip the AI tone. Finalize to reconcile references and run the claim-coverage gate. Translate to parallel English / 简体中文 reusing the same citation map.

The order matters. We have tried collapsing stages — outline plus research, draft plus refine — and watched the failure modes return. Each stage exists because removing it lost something. The drafter without an explicit outline freelances. The drafter without per-section research bundles invents citations to fit prose it has already written. The humanizer without a separate refine stage rewrites instead of polishes.

No skips. Not for short pieces. Not when the deadline is tight. The pipeline is the unit of work.

The agents and their memory

Seven stage writers. One fact-check agent. One reference-verifier. One human editor. Plus a curator we don’t usually bill as an agent because its job is to tend the shared knowledge bank — promote what proves out, demote what stops being true, gate what goes in.

Each agent has three private memory tiers. Tier-01 is the role charter — what this agent is for, what it must never do, what its handoff format looks like. Pinned for life. Tier-02 is long-term: house-style decisions made on prior pieces, named-entity spellings, transitions the editor has called out as “this is how we phrase it here.” Curator-written, never agent-written directly. Tier-03 is the per-piece scratchpad — the agent’s working notes for the piece in front of it. Disposable on completion.

Tier-04 is the shared knowledge bank, read by every agent in the crew. The source-tier registry sits here — eight tiers, peer-reviewed at the top, news at the bottom, Other dropped from the body entirely. The glossary sits here. The eval rubric sits here. The refusal list — “no top-10 listicles without direct project experience, no ‘studies show’ without naming the study” — sits here. Two roles can write to it: the curator and the human editor. No agent edits it without one of those signing off.

The point of the four-tier shape is that drift has nowhere to settle. A correction made on Tuesday’s piece on SND bacteria becomes a tier-02 entry for the drafter, a glossary update in tier-04, and a check the fact-checker runs on Wednesday’s piece. The crew gets sharper with use. We don’t restart from zero each time.

The eval that won’t let bad work through

Four axes, all measured before sign-off:

Claim coverage. Every prose sentence of substance must trace to a citation in the references. The fact-check agent maps each numeric and named claim back. The gate sits at 95%. Below the threshold, the draft goes back to the writer that owns the failing section. Not forward to the editor. The ratio is the floor; the editor is not the appeals court.

Source-tier authority. Academic claims are restricted to peer-reviewed, pre-print, standards or government primary. Industry sources may support a load-bearing claim; they cannot carry one. A piece whose strongest claim rests on a vendor white paper goes back to research, not forward to refine.

Quote fidelity. Every quoted sentence is matched verbatim against the source it cites. Paraphrases dressed up as quotes are the most common LLM failure mode in our category and the easiest to catch — string-match the candidate quote against the cached source text, accept exact, reject otherwise. We have had zero fabricated quotes ship since this gate went in.

Reference liveness. Every DOI is resolved at Crossref. Every URL is fetched live. Mismatched DOIs are dropped from the references; dead URLs block publish. The check runs at finalize, not at draft, because we want the latest state of the world before sign-off, not the state when the researcher pulled the bundle.

The four scores are computed and shown to the human editor. The editor can override any of them — the override is on the record, with a reason. Overrides happen. They are rare and we read every one in the quarterly review.

How we built it

We did not build the crew end-to-end and then turn it on. We built it backward from the eval.

Phase 1 — eval first. Before any agent prompt was written, we wrote the rubric. Twenty pieces from the WaterDoctor backlog were graded by hand on the four axes. The results were grim — claim coverage averaged 71%, fabricated DOIs in 8 of 20, quote drift in 11 of 20. That was the baseline. Anything we built had to beat it on every axis or it didn’t ship.

Phase 2 — narrow agents one at a time. We wrote the research agent first, with a tier-01 prompt that said only “given an outline section, find sources, return verbatim excerpts with metadata, never write prose.” We tested it for two weeks on real briefs, fixing only the research agent. Then we added the drafter, scoped to only “given an outline plus the research bundle, write prose, cite inline, never invent a source.” Then the fact-checker. Each agent earned its place by lifting the eval before the next one was added.

Phase 3 — memory tiers. Once the pipeline ran end-to-end, we found the same corrections being made by the editor every week. We added the curator, the four-tier memory shape, and the discipline that nothing writes to tier-02 or tier-04 without a curator pass. The eval moved another six points on claim coverage in the next cycle.

Phase 4 — verifier and refusal list. Reference verification was bolted on after a single incident: a piece passed claim coverage at 96%, the editor approved, and a reader emailed to say a DOI in the references didn’t resolve. It was a real DOI for a different paper — the researcher had pulled metadata from one source and a DOI from another. We added the Crossref cross-check, the live URL fetch, and the refusal list. No more single-source metadata trust.

Phase 5 — bilingual translate. EN ⇌ 简中 was the last stage we added. The translate agent reuses the same citation map — no new claims, no drift, nothing the English version doesn’t say. The reviewer checks both versions. We’ve shipped pieces where the English passed and the Chinese flagged a paraphrase the English version had also paraphrased; the bilingual check has caught issues both versions missed independently.

The whole sequence was eleven months from first prompt to current shape. The eval was running from week three.

What I’d change for the next crew

Three things. First, typed handoffs sooner. The first version of the pipeline passed prose between agents — researcher to drafter as a markdown bundle, drafter to fact-checker as a markdown article. We re-parsed the same metadata at every stage. We have since moved to typed handoffs: the researcher emits a structured payload (sources, excerpts, per-section evidence pointers), the drafter consumes the structure and emits inline citations as IDs into the structure, the fact-checker reads both. Tokens dropped by a third; latency dropped more. I’d start with typed handoffs on day one next time.

Second, the curator is not optional and not an afterthought. We added the curator in phase 3. The eval result tells us we should have added it in phase 1, even before the drafter. The crew without a curator regresses on the same corrections forever; the crew with a curator improves every cycle.

Third, the refusal list is editorial, not technical. We initially treated “what we won’t publish” as a tier-04 lookup the agents enforced. It isn’t. The refusal list is what the editor uses to send a piece back, and the agents’ job is to surface candidates to the editor — not to pre-filter them. We moved the refusal list out of the gate logic and onto the editor’s review screen. The crew now produces more candidates and the editor kills more of them; both ends of the funnel are healthier.

The pipeline is the unit of work. The eval is the floor. The editor is the only seat that signs. That shape ports — to legal-research desks, to clinical-evidence reviews, to investor-update writing rooms. The Article Crew is the most-instrumented version we have running, and it is what we recommend any team modelling editorial work on.

— Timothy Mo, wGrow

← All field notes Brief a crew →