Building with an agent inside the framework

What actually changes when a coding agent builds inside a contract-enforcing framework instead of against an empty canvas. A development story told through the commits of agentback-demo.

The claim at the center of AgentBack is that encoding best practice in the framework — rather than in a code review comment or a README — lets an AI coding agent produce better-structured code with less drift. That is easy to assert. Here is a concrete account of what it looked like to test it.

The agentback-demo is a real reference application: a weather MCP server that exposes Open-Meteo data as MCP tools, served three ways from a single DI wiring. It was built end-to-end with Claude Code as the coding agent, starting from an empty directory and a single sentence of intent. The result is roughly 600 lines of TypeScript, zero hand-maintained OpenAPI or MCP schema files, three transports (stdio, Streamable HTTP, dev console), API-key auth, per-tool rate limiting, a real test suite, and an architecture diagram. It earned a 93/100 health score from a brooks-lint review on first pass.

What follows is a walk through the development arc — the commits, the moments the framework helped, and the one place it couldn't.

The contract before the implementation

The agent's first meaningful act was not writing a route handler. It was writing schemas.ts:

export const LocationSelector = z.object({
  city:      z.string().optional(),
  latitude:  z.number().min(-90).max(90).optional(),
  longitude: z.number().min(-180).max(180).optional(),
});

export const CurrentWeather = z.object({
  city:        z.string(),
  temperature: z.number(),
  feelsLike:   z.number(),
  humidity:    z.number(),
  windSpeed:   z.number(),
  description: z.string(),
  time:        z.string(),
});

export const ForecastDay = z.object({ /* … */ });
export const Forecast    = z.object({ days: z.array(ForecastDay) });

This was the first commit. Nothing ran yet. But once those schemas existed, everything else had a single source to pull from: the MCP tool input and output, the TypeScript types across the whole codebase, the emitted MCP inputSchema that tools/list returns to clients. The agent didn't have to synchronize any of those separately — they were all derived.

This is the spec-first moment the framework is designed around. The contract lands before the behavior, enforced by the compiler, and every subsequent step is implementation inside a shape that is already locked.

Where the type error caught the agent

In the second commit, the agent wrote the tool class and reached for a return shape that differed slightly from the declared output schema — a field name mismatch that would have been invisible until a client tried to parse the response.

@tool('get_current_weather', {input: LocationSelector, output: CurrentWeather})
async getCurrent(input: z.infer<typeof LocationSelector>) {
  const data = await this.weather.current(input);
  return {
    city:        data.city,
    temp:        data.temperature,  // ← TS error here: 'temp' not in CurrentWeather
    // …
  };
}

The TypeScript error appeared at the @tool decoration line, naming the property mismatch precisely. The agent corrected the field name, ran the build again, and moved on. No test needed to catch it. No runtime surprise. The framework's generic constraint on the method return type was the signal, and it fired before the code ever ran.

This is the class of mistake agents make often. Without an output schema on the tool, the error would have surfaced as a behavioral failure in a downstream consumer — a test, a CI run, or a confused MCP client. With it, the signal was a single-line TS error at the definition site. That is a meaningful difference in search space.

Adding a second transport without touching the contract

The initial commit served stdio only — the right starting point for an MCP server you wire into a local AI assistant. The next meaningful capability was remote access: a Streamable HTTP transport so clients on the network could connect without spawning a subprocess.

The agent added this as a new entry point, serve-http.ts, with API-key authentication and per-tool rate limiting:

const app = buildHttpApp({port: PORT});
installMcpHttp(app, {
  rateLimits: [
    {methods: ['tools/call'], points: 60, duration: 60},
    {methods: ['tools/call'], toolNames: ['get_forecast'], points: 20, duration: 60},
  ],
});
await app.start();

The buildHttpApp helper is just a factory that boots the same application with the same DI wiring — the same schemas, the same tool class, the same service — and adds the Streamable HTTP transport on top. The schemas didn't change. The tool class didn't change. The MCP tool definitions clients received were identical to the stdio surface.

The agent had no architectural choices to invent here. The DI container already had the tools registered; HTTP was a new server binding, not a new set of tools. The framework's extension-point model meant the second transport was additive, not structural.

A third surface — the dev console, a web UI combining an MCP inspector, an OpenAPI explorer, and a DI container viewer — was added with the same pattern: another entry point, same wiring. Three ways to run the server; one place where the contract lives.

Tests that don't start a process

The test suite was the agent's next step, and it was where @agentback/testing's createTestApp paid off most clearly:

await using setup = await createTestApp(buildTestApp);

it('registers the expected tools', async () => {
  const tools = await setup.mcp.listTools();
  expect(tools.map(t => t.name)).toContain('get_current_weather');
  expect(tools.map(t => t.name)).toContain('get_forecast');
});

it('rejects a call without a location', async () => {
  const result = await setup.mcp.callTool('get_current_weather', {});
  expect(result.isError).toBe(true);
  expect(result.content[0].text).toMatch(/city.*latitude/i);
});

setup.mcp is an in-memory MCP client — no process spawn, no port binding, no stdio pipe. The MCP session runs entirely in memory. The agent could write and iterate tests without managing processes or race conditions, and CI could run them without a live Open-Meteo connection. The await using pattern handles cleanup automatically.

The agent also wrote HTTP auth tests using the supertest bridge (setup.http), verifying that requests without a valid x-api-key got a 401. Both test surfaces — MCP and REST — came from createTestApp without any additional setup.

What the framework didn't enforce

A brooks-lint review of the finished codebase gave it 93/100 and found exactly one substantive gap: the network and response-mapping logic inside WeatherService had no test seam.

The root cause was structural. The getJson helper called the global fetch directly, without a seam for injecting a fake:

async function getJson<T>(url: string): Promise<T> {
  const res = await fetch(url, {signal: AbortSignal.timeout(15_000)});
  if (!res.ok) throw new WeatherError(`HTTP ${res.status}`);
  return res.json() as T;
}

The WMO-code translation, the unit fallback logic, and the full response-shaping pipeline in current() and forecast() were exercised only by the live API. The green test suite read as "covered" while the part most likely to break on an upstream change had zero protection.

The fix is mechanical — accept an injectable fetcher via a constructor parameter or a CoreBindings.FETCH binding, then write unit tests against a canned Open-Meteo fixture. The framework supports the pattern directly: CoreBindings.FETCH is the injectable fetch seam, and createTestApp accepts binding overrides. The agent just didn't reach for it.

This is the honest limit. The framework can enforce schema coherence at compile time and make the right wiring pattern easy; it cannot enforce test coverage or prevent a module-level function from closing over the global. Where the framework's structure doesn't reach, the agent navigates on its own — and navigates no better than it would without the framework.

The framework can make the wrong structure hard to build. It cannot make a missing test visible.

What the commit history shows

The development arc has a shape that is legible in the git log:

Each commit was additive. The schemas written in the first commit were never replaced — only the implementation around them changed. The framework's one-source-of-truth property held across the whole trajectory.

What an AI-native development experience looks like

The agent that built this server was tireless and fast. It also made the usual mistakes: field name mismatches, missing test seams, a duplicated configuration literal across two entry points that the brooks-lint review flagged as a maintenance risk.

The framework changed where those mistakes showed up. The schema mismatches surfaced at the decoration line, not in a test. The duplicated port logic was contained to two entry points, not spread across handlers. The missing test seam was isolated to the service layer, not to a route handler where it would have been harder to see. In each case the framework's structure meant the error signal was localized and early.

That is the claim in practice: not that the agent makes fewer mistakes, but that the mistakes it makes are smaller, earlier, and more self-describing. The agent still needs a human (or a reviewer like brooks-lint) to see the test seam gap. But the structural mistakes — the drift between surfaces, the mismatched schema, the missing DI registration — those the framework closes.

The bet is that in a world where agents write most of the code, the framework's job is to be the part of the system that never drifts.

agentback-demo on GitHub Boundary coherence thesis