AI & Agents 14 April 2026 · 7 min

Using Agents to Backfill NUnit Tests on a 2012 C# Monolith

By wGrow Project Team · 14 April 2026

The Refactoring Deadlock in Legacy Microsoft Stacks

IT professional inspecting server rack cabling in a data center.

We inherited a 2012 .NET 4.5 Web Forms portal for an SME logistics client. The portal ran revenue-critical freight billing and lane-pricing workflows for the client — live, in production, with no downtime tolerance. Zero unit tests. The client wanted the Web Forms portal rebuilt as an ASP.NET Core application — freight-billing logic and all — and we refused to touch the architecture until we had a baseline. That refusal cost us a two-week conversation. It saved us a six-month regression nightmare.

Refactoring old code without a test harness is professional negligence. Writing that harness manually across a decade-old monolith is commercially unsustainable. A legacy Microsoft stack is a specific archaeology problem: business logic bleeds into code-behind files, global state gets mutated in a dozen places you haven’t found yet, DbContext instances are newed up directly inside button-click handlers. There is no clean seam to cut.

The standard playbook: spend three sprints reading the codebase, write characterisation tests, then begin the refactor. On a 400-KLOC portal, three sprints stretch fast. Engineers fatigue reading spaghetti they didn’t write. Coverage stays thin. The refactor begins anyway because the client is paying. Regressions appear. Everyone blames the refactor when the real culprit was the untested baseline.

We built a dedicated AI agent crew to read the legacy C# and generate baseline NUnit test coverage before anyone touched the architecture. The rest of this piece explains how that crew is structured, where it works, where it fails, and what we actually measured.

LLMs Make Better Archaeologists Than Architects

Here’s the honest position on current AI coding tools: they tend to struggle with coherent net-new architecture in domains they haven’t seen concretely specified. Ask an LLM to greenfield a complex multi-tenant platform and you’ll likely get something structurally plausible that buckles under the first real domain constraint. Ask the same model to read a 500-line method — eight nested conditionals, twelve if/else branches, side effects on three static fields — and it produces a readable execution trace in seconds.

That asymmetry is the whole point.

For the logistics portal, we pointed agents at specific unmapped assemblies, one at a time. The agent’s job was not to critique the code. It was to trace execution paths and map inputs to outputs — the same work a human engineer would do with a pad and a whiteboard, but without the mounting existential dread, and in a fraction of the time.

The prompting strategy is deliberate and tightly constrained: lock in current behaviour. Bugs included. A baseline characterisation test is not a correctness assertion. It’s a regression tripwire. If the legacy method returns a wrong freight surcharge for a particular lane configuration, the baseline test asserts that wrong answer — but only until the bug is deliberately fixed. When the fix lands, you update or retire that specific characterisation test in the same change. The safety net is for catching accidental drift; it is not a mechanism for freezing known defects indefinitely.

This distinction matters more than it sounds. Engineers used to writing meaningful tests instinctively resist asserting incorrect output. The agent has no such instinct — it follows the instruction. For this specific task, that is an advantage.

Across freight-calculation assemblies timed from project records during the engagement, first-pass human tracing of roughly 500-line methods took two to four hours per method. The agent produced the initial execution-path map in under a minute per method. Senior-engineer review is excluded from both timings and reported separately. That review still mattered; it moved the work from first-pass tracing to verification. Across an assembly with dozens of such methods, that shift in effort profile compounds quickly.

Automating the Tedium of Tightly Coupled Dependencies

Engineer reviewing legacy C# dependency graphs on a monitor in an office meeting room.

Legacy C# Pattern

1	public OrderInfo GetOrder(int id) {
2	using (var db = new LogisticsEntities()) {	← ①
3	return db.Orders.Find(id);
4	}
5	}

① Direct Entity Framework context isolated by agent using Microsoft Fakes

The primary bottleneck in legacy test generation is not test logic. It’s dependency isolation. Old systems don’t use dependency injection. Dependencies are instantiated inline, resolved through service locators, or pulled from static singletons — never injected through a clean interface. There’s no boundary to mock against.

The logistics portal used Entity Framework 5 with a single massive DbContext subclass: 47 DbSet properties, several hundred navigation properties, direct instantiation inside service methods that were themselves called from code-behind files. No repository layer. No unit-of-work abstraction. One enormous context that knew everything about everything.

Writing a test for any method touching that context required either spinning up a real database — slow, fragile, and not a unit test by any reasonable definition — or building a mock that replicated enough of EF’s behaviour to fool the method under test.

The agent crew handles this in two modes depending on how much you’re willing to touch production code. Where a minimal interface wrapper is feasible without modifying production paths, the agents generate the wrapper and the corresponding mock. Where the coupling is too severe — production code constructs the context directly with new and no seam exists — they use Microsoft Fakes shims to intercept the constructor call. Where a seam is already present, or a thin wrapper can be introduced without altering runtime behaviour, Moq-based DbContext substitutes built on EF5-specific faking patterns handle the isolation instead.

Concretely: the agents generated the full boilerplate mock setup for the FreightDbContext automatically. A test that would have taken an engineer ninety minutes to scaffold — parsing the EF model, wiring up the Moq configuration — was generated in about three minutes. The engineer’s job became verification, not authorship.

The limitation is real, and you need to know it. The agent will occasionally produce a mock setup that violates the constraints of the legacy framework. EF5 mocking patterns differ from EF6 and EF Core in specific, non-obvious ways. A model trained predominantly on modern stack examples will sometimes generate a DbSet mock that compiles but fails at runtime because it doesn’t correctly replicate EF5’s change-tracking behaviour. These failures can produce false confidence — a green test that is not actually exercising the production code path. This is not a failure mode you catch only in QA.

The Mandatory Human Review Gate

Verification Flow

Test Generation

Dependency Mocking

Compilation

Logic Verification

Agent Crew

Trace paths

Configure Moq/Fakes

Senior Engineer

Pull & build branch

Review & observe pass

We applied the same agentic pattern to our own internal technical debt. wGrow runs a legacy IIS HTTP module — predating our current tooling by several years — that intercepts and transforms XML-based API calls for a client integration. Measured in CI against the IIS module test project — scoped to the core transformation and routing assemblies — branch coverage moved from 15 percent to 78 percent before the ASP.NET Core middleware migration.

The review gate is non-negotiable. Agents write the test and configure the mock. Agents do not commit to the main branch. Every generated test goes into a feature branch. A human engineer pulls the branch, compiles the test project, and watches it pass locally. No exceptions — including when you’re behind schedule.

On the IIS module work, the engineer estimated that writing the suite manually would take roughly 100 hours. Reviewing and correcting the agent-generated tests took about 30 hours — 70 percent below the manual estimate. The engineer shifted from writing boilerplate to operating as a code reviewer. That’s a different cognitive mode, and a significantly less exhausting one.

Three failure modes that human reviewers must be trained to catch:

Tests that compile but test nothing. The classic empty assert, or an assertion on a value the mock itself returns without the production code being exercised. These pass every time and protect against nothing.

Mocks that bypass the actual business logic. A mock configured to return a specific value for the method under test, rather than for a dependency of that method. The test becomes a tautology.

Assertion granularity too coarse to be meaningful. An agent asserting that a method returns “not null” when the real business rule is a specific freight calculation. Passes. Tests nothing of value.

These are predictable output patterns when agents receive underspecified prompts or insufficiently constrained instructions. The fix is in the prompting and the review checklist — not in the model’s underlying capability.

Shifting Engineering Cycles to Actual Modernisation

Two engineers reviewing software on dual monitors at a desk.

LLMs will not automatically rewrite a legacy enterprise system cleanly. Anyone selling that outcome is running well ahead of the current evidence. What they can do is handle the archaeological grunt work that lets human engineers execute modernisation safely.

The ROI here is not abstract. Fewer billable hours on boilerplate test generation. A meaningful reduction in developer burnout from repetitive scaffolding. A reliable NUnit baseline established faster than purely manual approaches allow. On the logistics portal, we reached baseline coverage on the core freight calculation assemblies in eleven days of elapsed time. The manual estimate for the same scope was five to seven weeks.

That gap is the modernisation window. Senior engineers who would otherwise spend six weeks writing DbContext mocks are instead reviewing agent output and executing the actual .NET Core migration. The refactor begins from a position of evidence, not hope.

Two things have to hold for the approach to deliver at that margin. First, prompting discipline: instruct the agent to characterise, not improve; give it one assembly at a time; require it to enumerate every external dependency before generating a single test. Second, a review gate staffed by someone who knows what an empty assert looks like and has read the framework version constraints.

Neither condition is onerous. Both are necessary.

Stop asking senior developers to reverse-engineer twelve-year-old C# by hand. Deploy an agent crew to build the safety net. Then execute the refactor.

← All field notes Brief a crew →