AI Implementation Group
Agentic AI & MCP

Building the Harness: Memory, Rules, and Skills That Make AI Development Actually Work

By Carl Tierney

You’re not using AI to write code. You’re using AI to write code wrong, consistently, from scratch, every single session — because you never taught it how your system works.

I know this because I lived it. I’ve been using AI coding tools daily for over two years across multiple enterprise projects. For the first year, I treated them like autocomplete with ambition. I’d open a session, describe what I needed, get code that looked plausible, then spend the next hour fixing everything that didn’t fit — wrong JSON casing, swallowed exceptions, React components in a codebase that uses Alpine.js, queries joining on name strings instead of GUIDs.

The AI wasn’t broken. It was uninformed. And every session, it was uninformed in exactly the same ways.

The day-one problem

Here’s what happens when you use an AI coding tool without structure: every session starts from zero. The model doesn’t know that your financial system uses PascalCase JSON because the backend runs Newtonsoft with a DefaultContractResolver. It doesn’t know that Account 2100 in your chart of accounts is unreliable for capacity calculations. It doesn’t know that your workflow entities use IncidentUser, not User, and BaseRole, not Role. It doesn’t know that every new view needs to go through a UX review gate before it ships.

So it generates code that violates all of these constraints. You catch some. You miss others. The ones you miss become bugs in production. The ones you catch, you fix manually — and then the next session, the same mistakes come back.

This is the experience of every developer using AI tools today. The model is capable. The context is missing.

What a harness is

A harness is the layer between the AI model and your codebase that encodes your organization’s knowledge into AI behavior. It’s not a plugin. It’s not a prompt template. It’s a structured system with four components:

Memory — persistent knowledge about your project, your domain, and the decisions you’ve made. Survives across sessions.

Rules — constraints and standards that are automatically enforced based on what the AI is working on. Not suggestions. Guardrails.

Skills — reusable workflows for complex, multi-step operations that need to be executed the same way every time.

Subagents — specialized workers with defined domains, authority boundaries, and their own context. Not one AI doing everything. Focused agents doing what they’re good at.

Together, these four components turn a general-purpose AI model into something that understands your system.

Rules: constraints that fire automatically

The most immediate win is rules. In the harness I built for a financial application managing consolidation accounting for an international organization, I have eleven rules that auto-load based on file patterns.

When the AI opens a C# file, it automatically knows: never swallow exceptions — every error must surface to the user. When it opens a .cshtml view, it knows: all interactive UI uses Alpine.js, not React, not Vue, not jQuery. When it touches SQL, it knows: use GUIDs for foreign key joins, not name lookups. Use BEGIN/END transaction blocks. Avoid reserved words.

These aren’t guidelines I hope the AI follows. They’re constraints the harness enforces every time, triggered by the files being edited. The AI doesn’t need to remember that your system uses PascalCase JSON — the rule fires automatically when it touches any JavaScript or view file.

The rules exist because each one represents a mistake that happened in production. PascalCase JSON cost me a full day of debugging when the frontend silently failed to bind data because the AI generated camelCase property names. That will never happen again — not because I’ll remember to mention it, but because the harness won’t let it through.

Memory: the knowledge that persists

Rules handle constraints. Memory handles knowledge.

For the same financial system, I’ve codified domain knowledge that would take a senior developer a week to absorb: a complete 30-account chart of accounts with debit/credit normals and granularity levels. The three distinct granting paths and how they flow through the data model. Entity relationships that aren’t obvious from the schema. Known data issues — specific discrepancies that exist in production and shouldn’t be “fixed” by an AI that doesn’t understand they’re expected.

This knowledge is structured in markdown files that load into context when relevant agents start working. The key insight: if a human developer would need tribal knowledge to get it right, your AI needs that same knowledge in memory.

Memory also captures feedback. When I correct the AI — “don’t do that, here’s why” — the harness records the correction with the reasoning. Next session, that correction is already loaded. The harness doesn’t make the same mistake twice, even if the model would.

Subagents: specialization over generality

One of the most important shifts in how I work with AI was moving from a single agent doing everything to specialized agents with defined domains.

For the financial system, I have five agents, each with its own expertise and authority boundaries:

A subledger specialist that understands the consolidation accounting model — the 30-account chart, journal entry patterns, balance validation rules, diagnostic queries. It has thirteen non-negotiable invariants. It can modify stored procedures and run integration tests. It cannot touch frontend code.

A grant data specialist that understands grant operations — three granting paths, package types, allocation flows, and the specific formulas that govern spend capacity. It knows about known data issues and won’t try to “fix” intentional discrepancies.

A verification agent that doesn’t trust any of the other agents. It runs a two-gate verification process: backend gate (build, tests, SQL validation) and frontend gate (browser automation and UX review). Every claim of “it works” must be backed by evidence — test output, query results, screenshots. No exceptions.

A UX reviewer that checks views against the design system — 25 specific checks covering panel structure, table patterns, Alpine.js directives, and visual consistency. It can only read views and report issues. It cannot modify code directly.

An Excel export specialist that knows the reporting library’s patterns — outline grouping, helper methods, number formatting conventions.

Each agent loads its own domain knowledge on startup. The subledger specialist reads the chart of accounts and accounting invariants before touching any code. The UX reviewer reads the design system patterns. They don’t share context they don’t need.

This isn’t over-engineering. It’s the same principle behind microservices: bounded contexts produce better results than monoliths. A specialized agent with deep domain knowledge and clear authority boundaries writes better code than a generalist that knows a little about everything.

Verification gates: the harness doesn’t trust itself

This is the part most people skip, and it’s the part that matters most.

The harness includes mandatory verification gates that prevent incomplete or incorrect work from being declared done. Every significant change goes through two gates:

Backend gate: Does it build? Do the tests pass? Do SQL spot-checks confirm the data is correct? These aren’t optional steps the AI decides to run. They’re required by the harness before work can proceed.

Frontend gate: Does the UI render correctly? Does it match the design system? Do interactive elements work? This gate uses browser automation to actually verify the output — not just assert that the code looks right.

The verification agent requires evidence for every check. A screenshot of the rendered page. The actual test output. Query results showing the expected data. “It should work” is not evidence. “Here’s the test output showing it works” is.

This is defensive engineering applied to AI-assisted development. You don’t trust the AI to be right. You verify that it’s right, with the same rigor you’d apply to a junior developer’s pull request.

The self-improvement loop

The harness gets smarter over time. At the end of each session — or automatically before the context window compresses — a reflection process reviews what happened: what corrections were made, what mistakes occurred, what knowledge was missing.

From that review, the harness proposes updates: new rules to prevent recurring mistakes, new memory entries to capture domain knowledge that was discovered mid-session, updated agent context to fill gaps. Every correction is an investment in future accuracy.

This is the compounding advantage of a well-built harness. Session one is rough. Session ten is noticeably better. Session fifty, and the AI is operating with the institutional knowledge of a tenured team member — because the harness has captured that knowledge systematically over time.

The organizational case

If you’re an engineering leader, this pattern should look familiar. Memory is institutional knowledge. Rules are policies. Subagents are specialized teams. Verification gates are compliance checks. The self-improvement loop is continuous improvement.

It’s the same architecture as enterprise AI governance — encoding organizational knowledge into AI behavior so the technology operates within your constraints rather than despite them.

The uncomfortable truth: most organizations are investing in AI governance frameworks for their products while their own development teams use AI tools with zero governance. No rules. No persistent memory. No verification. Every developer’s AI starts from zero every morning, making the same mistakes across every session.

“We bought Copilot licenses” is not an AI development strategy. Building the harness is.

The tools don’t matter

Everything I’ve described works because of the architecture, not the specific tool. I use Claude Code. The same pattern applies to Cursor, Copilot, Windsurf, or whatever ships next quarter. The model provides capability. The harness provides context, constraints, and verification.

If you’re getting inconsistent results from AI coding tools, the fix isn’t a better model. It’s a better harness. Start with three rules that prevent your most common AI-generated bugs. Add a memory file with the domain knowledge your system requires. Build from there.

The developers who figure this out first will have a compounding advantage that grows with every session. The ones who don’t will keep fixing the same AI-generated mistakes, every day, forever.

The harness is the product. The model is just the engine.

Related Insights