Enforcing Architecture in an Agent-Driven Codebase

Enforcing Architecture in an Agent-Driven Codebase

Interested in working with us? We’re hiring!

At Phoebe, we use LLM-based coding tools (Claude Code, Codex, Cursor) every day; they’re fast, cheap, and pretty good at autonomously producing entire end‑to‑end features when pointed in the right direction with clear specs and quality prompting. Prompted well, they can even handle greenfield work that demands a bit of design thinking. Practically, this means the bottleneck of software engineering is shifting, from actually producing code towards reviewing it; by “reviewing”, we don’t just mean a single, final PR review before merge, but the continuous sanity‑checking and course-correcting that happens while an agent and a human co‑author a change.1

When building rapid prototypes, where code quality and scalability matter less, the benefit of LLM tools being able to churn out code at an incredible rate is undeniable. But once your codebase and product need to scale, allowing output volume to climb while attention erodes leads to quality rapidly degrading.

Earlier in our company’s life, when speed‑of‑prototype mattered more than longevity, we deliberately optimized for iteration speed by keeping constraints loose and taking on intentional technical debt. But as we’ve grown our team, customer base, and traffic, and had stronger needs for a scalable codebase and product, we’ve had to spend several cycles paying down the high volume of resulting tech debt. With AI tools, it’s easy to get short‑term speed, while falling into a slow‑motion collapse as the codebase turns into a ball of mud. This mirrors the core difficulty of using LLMs in an application: they’re powerful, but need carefully built systems set in place to constrain them.

There are some obvious ways to add guardrails to LLM coding agents: linting (we use a lot of custom lint rules written for our codebase), and unit tests. But linting only enforces local correctness and style, and tests only enforce code behavior.

They do little to enforce architectural patterns that will allow a codebase to gracefully scale without becoming incomprehensible spaghetti code: decoupling, modularization, sane dependency flow, strict layering.

Finding the Right Solution

To address this gap, we experimented with building custom static analysis tooling, to construct the dependency graph of our codebase by walking imports and asserting rules, and it sort of worked. But it was fragile, drifted as actual codebase structure evolved, and it didn’t have a strong model of the transitive dependency graph.

We realized we were effectively trying to rebuild a worse version of what Bazel provides natively.

Bazel and Agents

The core principle of Bazel is that every build has a dependency graph that is explicit and strictly enforced. The typical driver for adoption of a build system like Bazel is slow CI time in a larger company, as an enforced dependency graph allows for incremental builds (i.e. only build and test what has changed) which prevents build times from increasing quadratically over time.2 But given our company and codebase are still relatively small, CI time was not an immediate pain point (although Bazel did of course make our builds faster).

Bazel gives you a build graph that is explicit and enforceable, and, crucially, a place to encode architectural rules that fail deterministically.

A build can only see what it explicitly declares as a dependency. Package visibility turns module boundaries into an API you have to opt into. Circular dependencies fail the build early. And Bazel’s build graph gives you the first-class primitives to express and enforce policies like “X must not depend on Y.”

We also decided to run as much as possible through Bazel (local services, scripts, tests, image builds) so that there’s one pathway for humans and agents alike, 3 reducing the “works on my machine” gaps and moving policy enforcement earlier in the loop.

To illustrate with a concrete example: we have a fleet of “voice workers”, which execute streaming LLM-powered voice calls. By design, this service should not perform business logic or talk to the database. Keeping that boundary clean makes the system scalable and simple to reason about: the voice fleet can scale independently with its CPU‑heavy workload and won’t be coupled to application or database load.4 We encode this as a single test in the service’s BUILD file:5

dependency_enforcement_test(
   name = "voice_does_not_depend_on_database",
   target = ":voice",
   forbidden = [
     "//libraries/database_engine",
     "//libraries/database_models",
   ],
)

If someone (human or LLM) introduces a transitive import chain that brings database code into the voice service, CI fails during analysis with a readable path they can act on:

Forbidden dependency detected!
  Target: //services/voice:voice
  Forbidden: //libraries/database_models:database_models
  Path: //services/voice:voice -> //libraries/shift_confirmations:shift_confirmations -> //libraries/database_models:database_models

Reviewers no longer have to dig through imports to see whether a boundary was crossed; if CI is green, the invariant held.

We use the same pattern for other constraints that matter to us. For example, we don’t want a worker service to pull in a web framework, which would bloat images, slow cold starts, and tend to invite the wrong abstractions. In a codebase with several services, and libraries of shared code, it’s easy to inadvertently bring in a heavy dependency via a single bad import in a shared library. With Bazel, this is just another invariant we can encode as a test.

Bazel’s built-in visibility tools also help enforce coarse boundaries. For example, for service entrypoints, we restrict visibility to their own package, to prevent cross-service imports.

When these rules fail, the message includes the full path, which is crucial. Humans and agents can fix the actual problem: adjust imports, split a library, or add an explicit dependency where appropriate, instead of doing the naive thing and weakening a boundary just to get a green checkmark. It’s a small shift with a large effect on long‑term maintainability.

If agents try to “fix” the build by loosening a constraint (agents sometimes do this if not prompted well), this is now an obvious, explicit diff to the invariant that stands out in code review, as opposed to a subtle bad import that must be spotted by a shrewd reviewer.

All in all, migrating all our backend services to Bazel was carried out mostly by one developer, in under two weeks, while continuing to ship product features in parallel. We made sure to keep the developer experience smooth:

  • ibazel for locally running services with live reloading.
  • Gazelle for auto-generation of BUILD files.
  • Shell script shims to run common developer commands with familiar devex by forwarding args to Bazel for reproducible runs.

Beyond enforcement, Bazel improved maintainability and developer speed:

  • Hand-rolled Dockerfiles are swapped for rules_oci images built from the same Bazel targets we run locally. That alone eliminated a class of deployment failures (missing COPY statements, drifting dependency installs) because Bazel already knows exactly which sources and third‑party dependencies to include.
  • Bazel CI runs on BuildBuddy, giving us remote caching and warmed runners.6 From taking several minutes to run all tests pre-Bazel, Bazel CI now takes from 5 seconds to ~1 minute, depending on how much of the build graph is invalidated. The feedback loop ("write → run → fix") is tight and deterministic, great for humans and agents.

The main tradeoff is Bazel’s learning curve. For example, you can no longer just arbitrarily read files from the file system in Bazel builds; Bazel’s hermetic sandboxed builds require using data dependencies and runfiles. But in our experience, the small up-front cost of developer education is outweighed by the compounding benefits of enforced architecture and codebase scalability.

The point here isn’t that Bazel is clever or that we discovered a new trick. It’s that the combination of LLM‑assisted coding and a strict, explicit build graph changes the tradeoffs. You can let agents write more code, faster, without accepting erosion as the price of speed. Encode invariants in the build, surface violations deterministically, and you can elevate output without raising entropy.

Interested in working with us? We’re hiring!


Footnotes

  1. This blog post from Mitchell Hashimoto (HashiCorp founder) is a great, pragmatic walkthrough of LLM-assisted coding. Give it a read if you’re an AI skeptic or don’t have much experience with coding agents.

  2. Assuming the number of engineers grows linearly and number of tests written per unit of time per engineer is constant, then total number of tests (all of which need to be run on every build) grows O(N²). Total CI runtime (build time × number of builds per unit of time) is then growing O(N³)!

  3. Why Bazel and not something else? Nx/Turborepo are great for TypeScript monorepos, and could be used to enforce module boundaries, but we wanted one system that supports a multilingual codebase and hermetic container images built directly from the build graph. Pants and Buck2 are conceptually similar, but don’t have the same extensive ecosystem of external rules.

  4. In case you’re thinking “well just use separate repos”: we still want to be able to share some code between services, and have unified CI/CD with all services built/deployed atomically from the same commit.

  5. Here’s the basic implementation of this macro. Astute readers will notice that this is technically not much of a test: if it fails, it fails at analysis time; the test that Bazel actually executes is a no-op bash script that always passes. We implemented it this way for the developer ergonomics:

    • Conceptually, from the point of view of the developer, this is a test.
    • It gets automatically picked up by bazel test //..., which is the main test workflow that CI, developers, and coding agents will run.
    • The “test” target can be neatly colocated with the target it’s acting on, as opposed to some centrally defined registry of dependency rules.
    load(":aspects.bzl", "DependencyClosureInfo", "dependency_closure_aspect")
    
    def _dependency_enforcement_test_impl(ctx):
        if DependencyClosureInfo in ctx.attr.target:
            info = ctx.attr.target[DependencyClosureInfo]
            for forbidden_dep in ctx.attr.forbidden:
                forbidden_label = forbidden_dep.label
                if forbidden_label in info.paths:
                    path = info.paths[forbidden_label]
                    pretty_path = " -> ".join([str(x) for x in path])
                    fail(
                        "\\n\\nForbidden dependency detected!\\n" +
                        "  Target: %s\\n" % ctx.attr.target.label +
                        "  Forbidden: %s\\n" % forbidden_label +
                        "  Path: %s\\n" % pretty_path,
                    )
        script = ctx.actions.declare_file(ctx.label.name + ".sh")
        ctx.actions.write(script, content = "#! /usr/bin/env bash\\necho OK\\n", is_executable = True)
        return [DefaultInfo(executable = script)]
    
    dependency_enforcement_test = rule(
        implementation = _dependency_enforcement_test_impl,
        attrs = {
            "target": attr.label(
                mandatory = True,
                aspects = [dependency_closure_aspect],
            ),
            "forbidden": attr.label_list(mandatory = True),
        },
        test = True,
    )
    
  6. Bazel is incredibly slow when run on normal ephemeral CI runners. The primary obstacle to adopting Bazel is usually the complexity of setting up CI infrastructure for warmed runners, remote caching, remote execution, etc. As first time BuildBuddy users, we were pleasantly surprised that it provides all of this with almost no setup or configuration.