How to Build an Agentic Development Team
By Carl Tierney
A previous piece in this series argued that the new technical interview should assess whether a candidate can design an agentic harness — whether they can answer four questions: what’s a rule, what’s a subagent, what’s a skill, and how do you manage memory. The implicit follow-up is obvious: if that’s the assessment, how do you actually build the thing?
The answer isn’t a list of tools. It starts with understanding what a harness is encoding. A well-built agentic development environment is a structured representation of three things that every software project requires but almost never makes explicit:
- Process — how work moves through your development lifecycle, what the gates are, who reviews what and when
- Domain model — the problem space: PRDs, requirements, feature areas, workflows, the relationships between business concepts
- Technical architecture — the platform, the patterns, the constraints of your specific implementation environment
These three don’t exist independently. Process tells you how to work. Domain model tells you what you’re building. Technical architecture tells you the constraints of how you’re building it. An AI operating in your codebase without all three is working with partial information — and partial information produces code that’s technically executable but wrong in ways that take days to find.
The harness components — rules, subagents, skills, memory, and the knowledge graph — are how you encode these three inputs into AI behavior.
Rules: the technical invariants
Rules are the non-negotiable constraints of your technical architecture and development process. They’re not preferences. They’re not suggestions. They’re the things that, when violated, produce bugs, security issues, or architectural drift that compounds over time.
Rules load automatically based on file glob patterns — when the AI opens a file matching **/*.cs, the C# rules fire; when it touches a .sql file, the SQL rules fire. The developer doesn’t remind the AI of constraints that apply to every file in a given layer. The harness enforces them.
In a financial consolidation platform I’ve been building, the rules directory has eleven files covering three categories:
Language and framework treatment. json-pascalcase.md fires on all JavaScript and view files: always serialize JSON in PascalCase because the .NET backend uses DefaultContractResolver — a constraint discovered the hard way when a frontend silently failed to bind data from camelCase properties. alpine-js-frontend.md fires on all view files: all interactive UI uses Alpine.js, never React or jQuery. These conventions exist because someone learned them the hard way in production. The rule encodes the lesson so the AI never has to learn it again.
Design pattern enforcement. transaction-integrity.md fires on all C# files: all writes for a business operation must complete within a single transaction — never commit batch status, package records, and journal entries separately, because if any step fails everything must roll back. Never use the caching mechanism for consolidation adjustments because it’s a static memory mechanism outside EF transaction boundaries. sql-use-guids-not-names.md fires on all SQL files: join on GUIDs, never on name strings. These aren’t style preferences — they’re architectural decisions made at the system level that must hold across every feature, every sprint, every AI session.
Quality gates. verification-gates.md requires evidence before any work is declared complete — not “it should work” but test output, query results, screenshots. ux-review-new-views.md requires all new views to pass a design system check before they ship.
The right rule set isn’t comprehensive — it’s targeted at the mistakes that actually happen. Every rule in that directory exists because AI-generated code violated it in a way that cost real debugging time. Start with three rules that prevent your most recurring AI-generated bugs. You’ll discover the next three within a few sessions.
Subagents: a team of domain experts
The most important conceptual shift in building an agentic development team is understanding that subagents are not just workers — they’re reviewers. You’re not building a pipeline of AI processes. You’re building a team of experts who each understand a specific dimension of your system deeply enough to evaluate work from that perspective.
The organizing principle is architectural boundaries. In the same financial consolidation platform, the agents directory contains five specialists:
subledger-specialist owns the consolidation accounting model — the 30-account chart with debit/credit normals, journal entry patterns, transaction grouping rules, 19 integrity invariants, and known data exceptions. It loads the full subledger domain memory before it touches any stored procedure. It can review any change that affects the subledger and tell you whether it respects the accounting model. It cannot touch the frontend.
grant-data-specialist owns granting operations — three distinct granting paths, package types, allocation flows, spend capacity formulas, and the specific data patterns that look like bugs but aren’t. It knows that WMN granting lives at base level with ActivityId = Guid.Empty and that the init SP deliberately excludes it from activity-level granting. Correcting that “bug” would break the model.
ux-reviewer owns the view layer — 25 specific checks covering panel structure, Alpine.js directive patterns, table formatting, and design system consistency. It can only read views and report issues. It cannot modify data models or business logic.
verification-agent trusts nothing. It runs a two-gate process — backend gate (build, tests, SQL spot-checks with actual query results) and frontend gate (browser automation, rendered screenshots) — before any work is declared complete. Every claim of “it works” requires evidence.
epplus-excel-specialist owns report generation — the library’s outline grouping patterns, helper methods, number formatting conventions. Narrow scope, deep expertise.
Each agent loads only the context it needs. The subledger specialist reads the chart of accounts and accounting invariants on startup. The UX reviewer reads the design system patterns. Irrelevant context is noise that degrades output quality — a subledger specialist that also knows about Alpine.js directives is a worse subledger specialist.
The critical design principle: each subagent has defined authority limits. The UX reviewer can flag a view that violates the design system. It cannot modify the data model that feeds the view. The subledger specialist can review a stored procedure. It cannot touch the UI. Bounded authority prevents the class of failures where an agent “fixes” something in a domain it doesn’t understand by breaking something in a domain it does.
This is the same principle behind good team design. You don’t ask your database architect to redesign the UI. You ask them to review the data layer and report what they find.
Memory: what survives across sessions
Memory is the knowledge layer that persists across sessions. Not everything belongs there. The question to ask about any piece of knowledge is: is this an invariant that will still be true next month? If yes, it belongs in memory. If it changes with the implementation, it belongs somewhere else.
The financial consolidation platform’s memory directory has five files:
subledger-domain.md is the canonical example of what belongs in memory. It contains the complete 30-account chart with debit/credit normals and granularity levels, the 19 integrity rules that must hold across every pipeline run, the transaction grouping logic, and the known data patterns that look like bugs but aren’t. Fix reservations on Account 6000 legitimately go negative when batches consume more than was reserved — that’s expected behavior, not a defect. WMN granting lives at base level with a sentinel ActivityId = Guid.Empty. The init SP deliberately excludes batch adjustments from activity-level granting to prevent double-counting. These aren’t derivable from the schema. They’re business rules that took months to encode and would take a new developer weeks to learn.
grant-data-domain.md captures the three distinct granting paths — direct allocation, proportional distribution, and the themed package path — and the rules governing how each flows through the data model.
project-conventions.md holds the architectural decisions: the unit of work pattern, the identity generator requirement, the EF transaction boundaries, the test organization conventions. The same constraints that exist in the rules layer, documented here as narrative so agents understand why the rules exist, not just that they do.
zinnia-ux-patterns.md holds the design system — component patterns, Alpine.js conventions, panel structure rules. The UX reviewer loads this on startup.
epplus-excel-patterns.md holds the report generation conventions specific to the library’s API — the patterns that aren’t in the documentation but are load-bearing for correct output.
What doesn’t belong in memory: anything derivable from the current codebase. If an agent can read the code and figure it out, don’t put it in memory — you’ll end up with memory entries that contradict the code when someone refactors without updating the memory files. The code is the source of truth for implementation. Memory is the source of truth for business rules, design intent, and the decisions that look wrong but have reasons.
The knowledge graph: where relationships live at scale
Flat memory files have a ceiling. For a small project with a few dozen domain concepts, well-organized markdown files work fine. For a complex system — a financial platform with hundreds of entities, a healthcare system with layered clinical rules, a logistics platform with intricate constraint relationships — flat files become unwieldy and miss the most important category of knowledge: how code assets relate to domain concepts.
This is where a graph database with a domain-specific ontology becomes the right answer.
The knowledge graph encodes what flat memory cannot: the relationships between the actual implementation of code assets and the domain rules they’re supposed to enforce. Which classes implement the unit of work pattern? Which services own the granting logic? Which stored procedures touch the subledger? Which test files cover which feature areas? Flat memory says “always use the unit of work pattern.” The knowledge graph says “here are the 47 classes that implement it, here’s the one that doesn’t, and here’s the test coverage for each.”
The architecture has three layers working together:
The ontology defines the domain concepts and their relationships at the schema level. For a financial system: Entity, Aggregate, ValueObject, Repository, Service, Rule, Constraint — and the relationships between them (implements, enforces, depends-on, violates). This isn’t a general-purpose graph schema. It’s specific to your domain and your architectural patterns. Building the ontology is a design act, not a configuration act.
The vector index enables semantic retrieval. When an agent needs to find everything related to “identity generation for new domain objects,” it doesn’t need to know the exact class names. A vector search over the knowledge graph returns the relevant nodes based on semantic similarity. This is what makes the graph useful at scale — you’re not maintaining an exhaustive keyword index, you’re letting embeddings do the navigation.
RAG integration means agents query the knowledge graph as part of their context assembly. Before a domain agent starts working, it retrieves the relevant subgraph — the domain concepts in scope, the rules that apply, the code assets involved, the known constraint violations. It doesn’t load the entire graph. It retrieves the slice that’s relevant to the current task.
For MCP integration: the knowledge graph runs as an MCP server that agents query via tools. A Neo4j instance with a custom MCP server exposing query_domain_context, find_implementations, check_constraint_violations tools is a practical implementation. The agents treat the graph as a domain expert that knows where everything is and how it relates — which is exactly what it is.
MCP servers: the connective tissue
The harness components don’t run in isolation. MCP servers are how agents connect to the systems they need to work with. The right MCP configuration depends on what your agents are doing, but the pattern is consistent: every external system an agent needs should be accessible through an MCP server with a well-defined tool surface.
Source control and code review — an Azure DevOps or GitHub MCP server lets agents read the codebase, understand recent changes, review pull requests against architectural standards, and flag violations before they merge. The code review agent uses this to evaluate PRs against the rules layer without manual intervention.
The knowledge graph — as described above, a graph database MCP server is how agents query domain relationships at runtime. Neo4j’s native MCP support or a custom server over your graph API both work.
Browser automation — a Playwright MCP server gives frontend agents the ability to actually render and interact with views rather than just reading the code. A UI agent that can take a screenshot and verify that the rendered output matches the design system is dramatically more reliable than one that reasons about code without seeing the result.
Database access — a read-only database MCP tool lets domain agents run verification queries against real data rather than inferring state from code. The subledger agent that can run a balance validation query against the actual database catches classes of errors that code review alone misses.
Domain documentation — a documentation MCP server that exposes PRDs, architecture decision records, and feature specifications as queryable resources gives agents access to the intent behind the implementation, not just the implementation itself. Context7 for library documentation. A custom server for internal documents.
The principle is selective exposure: each agent has access to the MCP servers relevant to its domain and no more. The front end agent doesn’t need database access. The workflow agent doesn’t need browser automation. Scope the connections to the scope of the agent.
Where to start
The most common mistake is trying to build everything at once. The harness compounds in value over time — each session makes it better. The goal at the start is a foundation that’s useful immediately and improvable systematically.
Start with rules. Identify the three technical constraints that AI-generated code violates most often in your codebase. Transaction boundaries. Naming conventions. Test organization. Write them as rules that fire on the relevant file patterns. You’ll see the improvement immediately, and you’ll discover the next three rules within a few sessions.
Build one domain agent. Pick the highest-risk area of your system — the part where an error is most expensive to fix — and build a specialized agent for it. Give it the domain knowledge it needs: the invariants, the relationship rules, the known exceptions. Use it to review every change in that area. Measure how many issues it catches.
Seed memory with what you know. Document the structural relationships and business rules that a new team member would need to work effectively in your domain. Not the code — the intent. The rules that govern entity relationships. The invariants that must hold at the domain level. The decisions that look wrong but have reasons.
Add the knowledge graph when scale demands it. For smaller projects, well-organized memory files are sufficient. When your domain model exceeds what you can hold in a few dozen markdown files, when you need to query implementation-to-rule relationships at scale, that’s when you build the graph.
The team you’re actually building
The harness isn’t a tool configuration. It’s the institutional knowledge of your engineering organization, encoded in a form that AI can use.
Rules are your technical standards — the things every engineer is supposed to know and apply consistently. Memory is your domain knowledge — the business rules and architectural decisions that take months to absorb. Subagents are your domain experts — the specialized reviewers who catch violations that general-purpose code review misses. The knowledge graph is your institutional memory at scale — the relationships between everything, queryable in real time.
The engineer who can design this environment — who understands what belongs in rules versus memory, how to draw the boundaries between agents, when the knowledge graph is the right answer — is the engineer who multiplies AI’s output rather than just using it.
That’s what you were assessing for in the interview. This is how you build what they’re assessing.