Building with an agent inside the framework
What actually changes when a coding agent builds inside a contract-enforcing framework instead of against an empty canvas. A development story told through the commits of agentback-demo.
The claim at the center of AgentBack is that encoding best practice in the framework — rather than in a code review comment or a README — lets an AI coding agent produce better-structured code with less drift. That is easy to assert. Here is a concrete account of what it looked like to test it.
The agentback-demo is a real reference application: a weather MCP server that exposes Open-Meteo data as MCP tools, served three ways from a single DI wiring. It was built end-to-end with Claude Code as the coding agent, starting from an empty directory and a single sentence of intent. The result is roughly 600 lines of TypeScript, zero hand-maintained OpenAPI or MCP schema files, three transports (stdio, Streamable HTTP, dev console), API-key auth, per-tool rate limiting, a real test suite, and an architecture diagram. It earned a 93/100 health score from a brooks-lint review on first pass.
What follows is a walk through the development arc — the commits, the moments the framework helped, and the one place it couldn't.
The contract before the implementation
The agent's first meaningful act was not writing a route handler. It
was writing schemas.ts:
export const LocationSelector = z.object({
city: z.string().optional(),
latitude: z.number().min(-90).max(90).optional(),
longitude: z.number().min(-180).max(180).optional(),
});
export const CurrentWeather = z.object({
city: z.string(),
temperature: z.number(),
feelsLike: z.number(),
humidity: z.number(),
windSpeed: z.number(),
description: z.string(),
time: z.string(),
});
export const ForecastDay = z.object({ /* … */ });
export const Forecast = z.object({ days: z.array(ForecastDay) });
This was the first commit. Nothing ran yet. But once those schemas
existed, everything else had a single source to pull from: the MCP tool
input and output, the TypeScript types across the whole codebase, the
emitted MCP inputSchema that tools/list returns to
clients. The agent didn't have to synchronize any of those separately —
they were all derived.
This is the spec-first moment the framework is designed around. The contract lands before the behavior, enforced by the compiler, and every subsequent step is implementation inside a shape that is already locked.
Where the type error caught the agent
In the second commit, the agent wrote the tool class and reached for a return shape that differed slightly from the declared output schema — a field name mismatch that would have been invisible until a client tried to parse the response.
@tool('get_current_weather', {input: LocationSelector, output: CurrentWeather})
async getCurrent(input: z.infer<typeof LocationSelector>) {
const data = await this.weather.current(input);
return {
city: data.city,
temp: data.temperature, // ← TS error here: 'temp' not in CurrentWeather
// …
};
}
The TypeScript error appeared at the @tool decoration
line, naming the property mismatch precisely. The agent corrected the
field name, ran the build again, and moved on. No test needed to catch
it. No runtime surprise. The framework's generic constraint on the
method return type was the signal, and it fired before the code ever
ran.
This is the class of mistake agents make often. Without an output schema on the tool, the error would have surfaced as a behavioral failure in a downstream consumer — a test, a CI run, or a confused MCP client. With it, the signal was a single-line TS error at the definition site. That is a meaningful difference in search space.
Adding a second transport without touching the contract
The initial commit served stdio only — the right starting point for an MCP server you wire into a local AI assistant. The next meaningful capability was remote access: a Streamable HTTP transport so clients on the network could connect without spawning a subprocess.
The agent added this as a new entry point, serve-http.ts,
with API-key authentication and per-tool rate limiting:
const app = buildHttpApp({port: PORT});
installMcpHttp(app, {
rateLimits: [
{methods: ['tools/call'], points: 60, duration: 60},
{methods: ['tools/call'], toolNames: ['get_forecast'], points: 20, duration: 60},
],
});
await app.start();
The buildHttpApp helper is just a factory that boots the
same application with the same DI wiring — the same schemas, the same
tool class, the same service — and adds the Streamable HTTP transport
on top. The schemas didn't change. The tool class didn't change. The
MCP tool definitions clients received were identical to the stdio
surface.
The agent had no architectural choices to invent here. The DI container already had the tools registered; HTTP was a new server binding, not a new set of tools. The framework's extension-point model meant the second transport was additive, not structural.
A third surface — the dev console, a web UI combining an MCP inspector, an OpenAPI explorer, and a DI container viewer — was added with the same pattern: another entry point, same wiring. Three ways to run the server; one place where the contract lives.
Tests that don't start a process
The test suite was the agent's next step, and it was where
@agentback/testing's createTestApp paid off
most clearly:
await using setup = await createTestApp(buildTestApp);
it('registers the expected tools', async () => {
const tools = await setup.mcp.listTools();
expect(tools.map(t => t.name)).toContain('get_current_weather');
expect(tools.map(t => t.name)).toContain('get_forecast');
});
it('rejects a call without a location', async () => {
const result = await setup.mcp.callTool('get_current_weather', {});
expect(result.isError).toBe(true);
expect(result.content[0].text).toMatch(/city.*latitude/i);
});
setup.mcp is an in-memory MCP client — no process spawn,
no port binding, no stdio pipe. The MCP session runs entirely in
memory. The agent could write and iterate tests without managing
processes or race conditions, and CI could run them without a live
Open-Meteo connection. The await using pattern handles
cleanup automatically.
The agent also wrote HTTP auth tests using the supertest bridge
(setup.http), verifying that requests without a valid
x-api-key got a 401. Both test surfaces — MCP and REST —
came from createTestApp without any additional setup.
What the framework didn't enforce
A brooks-lint review of the finished codebase gave it 93/100 and found
exactly one substantive gap: the network and response-mapping logic
inside WeatherService had no test seam.
The root cause was structural. The getJson helper called
the global fetch directly, without a seam for injecting a
fake:
async function getJson<T>(url: string): Promise<T> {
const res = await fetch(url, {signal: AbortSignal.timeout(15_000)});
if (!res.ok) throw new WeatherError(`HTTP ${res.status}`);
return res.json() as T;
}
The WMO-code translation, the unit fallback logic, and the full
response-shaping pipeline in current() and
forecast() were exercised only by the live API. The green
test suite read as "covered" while the part most likely to break on an
upstream change had zero protection.
The fix is mechanical — accept an injectable fetcher via a constructor
parameter or a CoreBindings.FETCH binding, then write unit
tests against a canned Open-Meteo fixture. The framework supports the
pattern directly: CoreBindings.FETCH is the injectable
fetch seam, and createTestApp accepts binding
overrides. The agent just didn't reach for it.
This is the honest limit. The framework can enforce schema coherence at compile time and make the right wiring pattern easy; it cannot enforce test coverage or prevent a module-level function from closing over the global. Where the framework's structure doesn't reach, the agent navigates on its own — and navigates no better than it would without the framework.
The framework can make the wrong structure hard to build. It cannot make a missing test visible.
What the commit history shows
The development arc has a shape that is legible in the git log:
- Initial commit — schemas first, then tool class, then service. The contract preceded the implementation.
- HTTP transport — new entry point, same wiring. No changes to schemas or tools.
- Dev console — third entry point, same pattern.
- Auth and rate limiting — one configuration site in the HTTP entry point. The tools were unaware.
-
Tests —
createTestAppwith in-memory MCP and supertest. No process management. -
Component refactor — the agent packaged the weather
wiring as an AgentBack
Component, a pattern it learned from the framework's DI idioms rather than invented. - Architecture docs — a layer diagram and a narrative README, generated with the agent's help and committed alongside the code.
Each commit was additive. The schemas written in the first commit were never replaced — only the implementation around them changed. The framework's one-source-of-truth property held across the whole trajectory.
What an AI-native development experience looks like
The agent that built this server was tireless and fast. It also made the usual mistakes: field name mismatches, missing test seams, a duplicated configuration literal across two entry points that the brooks-lint review flagged as a maintenance risk.
The framework changed where those mistakes showed up. The schema mismatches surfaced at the decoration line, not in a test. The duplicated port logic was contained to two entry points, not spread across handlers. The missing test seam was isolated to the service layer, not to a route handler where it would have been harder to see. In each case the framework's structure meant the error signal was localized and early.
That is the claim in practice: not that the agent makes fewer mistakes, but that the mistakes it makes are smaller, earlier, and more self-describing. The agent still needs a human (or a reviewer like brooks-lint) to see the test seam gap. But the structural mistakes — the drift between surfaces, the mismatched schema, the missing DI registration — those the framework closes.
The bet is that in a world where agents write most of the code, the framework's job is to be the part of the system that never drifts.