AI handles the code.
Who makes it production-ready?

Code generation is the easy 30%. Review, testing, security, deploy, and monitoring are where sprints disappear. This is the open playbook on closing that gap, distilled from how Stripe, Uber, Ramp, Coinbase, Meta, and Anthropic actually ship AI in production. Not a vendor pitch. A map.

Delivery Run
live#847
INFRASTRUCTURE
spec-driven
01 spec.yaml loaded → 12 constraints
02 sandbox provisioned 240ms
03 full-stack parity
ready → agent
AGENT CORE
harness-bound
04 harness: 147 rules active
05 generating 3 files
06 sub-agent → tests
2 agents active
OBSERVABILITY
closed-loop
07 CI pipeline passed
08 post-deploy verify
09 drift → new rule
2/3 checks
COMPOUND LAYERdelivery equity
+2 rules this run
S1
S3
S6
now
0 rules0 teamsSprint 5 was 23% faster than S1

01 The Insight · the 30/70 problem

AI covers 30% of delivery.
We cover 100.

30% Code gen
Review
Testing
Security
Deploy
Monitor
Code Review 39 Uber · uReview Dev-years saved per year. 65K diffs reviewed weekly. 75% rated useful.
Testing 1.7x CodeRabbit · Stack Overflow More bugs from AI code. Speed without verification creates debt.
Security 0% Anthropic · 2026 Report Of tasks fully delegated. 60% AI-assisted. The gap is the system.
Deployment 1,000+ Stripe · Minions Agent PRs per week. 500 curated tools. Selective CI.
Full Lifecycle 50%+ Ramp · Inspect Merged PRs from agents. Organic adoption. Growing every sprint.

Your team adopted Copilot six months ago. PRs still take the same time. The gap isn't the model. It's the methodology. AI covers the easy 30%. The other 70% (testing, security, deployment, monitoring) is where delivery breaks down. Every quarter you wait is another board meeting where you can't show the lift. Stripe, Spotify, Uber, Meta, Coinbase, and Ramp all proved methodology beats raw tooling.

02 Environment over model

Stop debugging code. Start debugging the system that produces it.

One model. Two harnesses. 42% → 78%. When an agent fails, the question isn't "better prompt?" It's "what's missing from the environment?" Princeton's CORE-Bench held the model fixed and changed only the harness (42→78%); a separate Princeton/Stanford finding measured a 64% lift from environment design alone. Inversely, METR's RCT found experienced devs 19% slower with AI when the harness wasn't built for agents. The model is a commodity. The harness is the game.

Fork

Proven base exists close to your needs.

Stripe forked Block's Goose.

Highest control, highest maintenance.

Compose

Good open-source base, need to move fast.

Ramp composed on OpenCode.

Lower maintenance, some upstream coupling.

Build

Unique constraints: security, compliance, integration.

Coinbase built Forge custom.

Highest cost. Only when Fork/Compose can't meet requirements.

Agent Framework

Open SWE (LangChain) LangGraph-based. Slack/Linear/GitHub integrations out of the box. Captures Stripe/Ramp/Coinbase convergent patterns. MIT licensed.
Tier 1
mini-SWE-agent Princeton/Stanford. >74% SWE-bench. ~100 lines Python. Used by Meta, NVIDIA, IBM. Best when simplicity is paramount.
Tier 1
Custom Harness Build only the harness layer on top of the base agent. For unique constraints: security, compliance, deep integration. Coinbase Forge is the reference.
Tier 1

Agent Observability

LangSmith Enterprise Managed LLM tracing and evaluation. Coinbase adopted company-wide. Every tool call, retrieval, and decision is traced.
Tier 1
Langfuse Self-hosted, open source. Best when client data sensitivity requires on-prem. Full control, no data leaves the environment.
Tier 1
Grafana Stack LogQL/PromQL/TraceQL for agent observability. Best when client already has Grafana/Prometheus infrastructure.
Tier 2

AI Gateway

LiteLLM Proxy PII stripping, per-client routing, audit logging, spend tracking. Non-negotiable for multi-client engagements.
Tier 1

Context Engineering

CLAUDE.md / context.md Progressive disclosure: ~100-line map pointing to docs/ directory. Per-engagement context.md with arch, domain vocab, forbidden patterns.
Tier 1

Browser Verification

Playwright MCP Mandatory. Anthropic found agents consistently mark features 'complete' that don't work. Agents must verify end-to-end in real browsers.
Tier 1

Adoption

Slack Invocation Surface Ramp pattern: Slack as agent invocation surface. Results visible in shared channels. Track 'humans prompting' as a metric.
Tier 2

03 Delivery Equity

Every sprint makes the next one faster.

Technical debt taxes every release. Delivery Equity does the opposite. Every sprint adds permanent intelligence: harness rules, spec patterns, optimization data scored against real production outcomes. By sprint 6, your team is shipping features the harness already knows how to test, secure, and deploy. The system gets sharper every sprint, not longer.

Day 1
0
rules

Empty. Same mistakes repeated.

Day 30
0
rules

Patterns absorbed. New engineers productive immediately.

Day 90
0
rules

Cross-team. Dramatically better output.

Continuous
self-evolving

Prompts and architectures evolve from data.

Everyone has the same models. Nobody has your Delivery Equity.

04 The System

The architecture behind agents that ship.

Three pillars. One compounding system. This is the architecture under the harness, and the reason the methodology works on any stack, any model, any team size.

sandbox:
  snapshot: every 30m
  warm_pool: true
  startup: <2s
  parallel_runs: 10
  state: persistent
  env: full-stack parity
  isolation: per-session
Pillar 1: Infrastructure

Isolated. Parallel. Always Warm.

Ephemeral sandboxes with full-stack parity. Run 10+ versions in parallel. Decoupled from the laptop, decoupled from the bottleneck.

clientstatusagent
slackconnectedmain
browserconnectedmain
web IDEconnectedmain
researchscanningsub-agent
reviewqueuedsub-agent
deployreadymain
Pillar 2: Agent Core

Server-First. Multi-Client. Self-Aware.

A server, not a plugin. Reachable from Slack, browser, or IDE. Agents spawn sub-agents. The system reads its own source to prevent hallucination.

  • CI/CD pipeline passed
  • Integration tests 18 passed
  • Visual verification DOM match
  • Telemetry check nominal
  • Error tracking 0 new
  • Feature flags synced
  • Environment parity verified
Pillar 3: Observability + Validation

Verify After Deploy, Not Just Before.

Telemetry, error tracking, visual verification. A PR is a hypothesis. A passing test in production parity is the proof.

The compound layer

Errors Are the Fuel

Every failure is a training signal, not a ticket. The playbook inspects what broke, scores what worked, and prunes what didn't. It gets sharper every sprint, not longer.

failures inspected31
strategies scored24
low-signal pruned9
merge rate lift+37%
pruned Retrieval strategy #4 added 380 tok, zero accuracy gain. Removed.

05 The 8 Harness Patterns

Non-negotiable, regardless of stack.

Eight patterns repeat across every production agent deployment. Stripe, Coinbase, Ramp, Anthropic, and OpenAI converged on them independently. This is the spine of any harness worth owning.

01

Pattern 01

Progressive Disclosure

Map first, depth on demand.

Don't give agents everything upfront. The root file is a ~100-line map; depth lives in nested docs. AGENTS.md is now the cross-tool standard (60k+ repos, Linux Foundation, read by Claude Code, Cursor, and Codex), and Agent Skills productize the same idea: composable SKILL.md files that load only when needed.

Sources: Anthropic Agent Skills, AGENTS.md standard, mini-SWE-agent

Used by: Stripe, Spotify, Anthropic
02

Pattern 02

Git Worktree Isolation

One agent, one worktree, always.

Parallel agents WILL conflict without filesystem isolation. Each gets its own branch, directory, and environment, validated before merge. Worktrees are the floor, not the ceiling: production climbs an isolation ladder from OS sandboxes to microVMs to bound the blast radius. Stripe, Ramp, and Cognition converged here independently.

Sources: Anthropic, Stripe Minions, Cognition Devin, Ramp

Used by: Stripe, Ramp, Coinbase
03

Pattern 03

Context Pre-hydration

The agent should never search for context it needs.

Before an agent run starts, the orchestrator pulls ALL relevant context: Jira/Linear tickets, linked docs, code search results, Slack thread context, PR history. Stripe Minions: orchestrator scans the invocation thread for links and pre-fetches everything. Ramp: Linear tickets as structured context source.

Sources: Stripe Minions, Ramp Inspect

Used by: Stripe, Spotify, Meta
04

Pattern 04

Blueprints

Deterministic + agentic nodes. Guardrails where it matters.

A state machine alternating between deterministic nodes (git clone, lint, format, test: unit-tested, reliable) and agentic subtask nodes (LLM-driven code generation, refactoring). Deterministic nodes provide guardrails; agentic nodes provide flexibility. Coinbase: separate data nodes from LLM nodes.

Sources: Stripe Minions, Coinbase Forge, LangChain

Used by: Coinbase, Ramp, Anthropic
05

Pattern 05

Spec First

If agents can't read it, it doesn't exist.

Agents are blind to Slack, Docs, and knowledge in people's heads. Specs and constraints belong in the repo as machine-readable files. Spec-driven tools (Spec Kit, Kiro) go further: spec as source of truth, code as build artifact. But spec where it earns its keep, not waterfall in markdown. DORA 2025 ranks AI-accessible internal data a top-7 capability.

Sources: DORA 2025, GitHub Spec Kit, AWS Kiro

Used by: Stripe, Uber, GitHub
06

Pattern 06

Mechanical Architecture Enforcement

Linters replace human review at scale.

Custom linters + structural tests + CI replace human review. Enforce invariants (dependency directions, boundaries, data validation), not implementations. Linter errors include remediation instructions formatted for agent context injection. At agent throughput, corrections are cheap and waiting is expensive.

Sources: Stripe Minions, Open SWE, Netflix Paved Roads

Used by: Uber, Meta, Stripe
07

Pattern 07

Integrated Feedback Loops

Quality bounded by feedback quality.

Close the loop as tightly as possible. Linter fires at edit time, not after CI. Playwright MCP for real browser verification: Anthropic found agents mark features 'complete' that don't work in the browser. Full observability exposed TO agents, not just humans. The tighter the loop, the higher the autonomous PR acceptance rate.

Sources: Anthropic, Ramp Inspect, Open SWE

Used by: Spotify, Ramp, Meta, Coinbase
08

Pattern 08

Agent Governance

Control who does what, with what permissions.

Control which agents spawn others, with what permissions and audit trail. Spawning limits prevent agent explosions. Scoped credentials stop a code-writing agent from deploying to prod, and every commit traces to its session log. But governance is only half the job: the lethal trifecta (private data, untrusted input, an exfiltration path) turns prompt injection into data loss. Defend with architectural boundaries, not just audit trails.

Sources: AWS Frontier Agents, GitHub Actions 2026, OpenAI Agent Monitoring, Snyk, Google Threat Intelligence

Used by: AWS, GitHub, OpenAI, Snyk
Emerging

Three more are moving fast from frontier labs to standard practice. Watch these. They're next.

Subagent Orchestration

One agent plans. A fleet executes.

Used by: Anthropic, Cognition, Stripe

Context Engineering

Attention is a budget. Spend it well.

Used by: Anthropic, Ramp, Spotify

Evals & Verification

Verification is the new bottleneck.

Used by: Stripe, Meta, Anthropic

06 The evidence base

Who proved it.
And what they found.

These aren't projections. Real companies, real numbers, real production systems. The pattern is consistent: methodology beats raw tooling every time.

Stripe 1,000+ agent PRs per week

No human-written code. 500 curated tools. Selective CI.

Ramp 50%+ merged PRs from agents

Organic adoption. Nobody forced it. Growing every sprint.

Uber 39 dev-years saved per year

65K diffs reviewed per week. 75% rated useful.

Anthropic 0% of tasks fully delegated

60% AI-assisted. 0% autonomous. The gap is the system.

CodeRabbit 1.7x more bugs from AI code

Speed without verification creates debt, not value.

06 The convergence map

Who's doing what. And where.

Six companies. Six SDLC phases. One clear convergence pattern. Green means production-deployed. Yellow means active or partial. The gaps are where SPEQD installs next.

Company Code GenerationCode ReviewTestingSecurityDeploy & CI/CDIncident Response
Stripe
Minions
Agent review
Selective CI
Compliance
Auto-merge
Spotify
Honk
LLM Judge
Feedback Loops
Uber
uSpec
uReview
Meta
RADAR
Diff Risk
JiT Tests
Mutation
Coinbase
Forge
Agent review
Compliance
Risk-merge
Ramp
Inspect
Closed-loop
Self-verify
Production Active / Partial Not disclosed

Data compiled from 79 engineering blogs, case studies, and production reports. Last updated March 2026.

Walk away with a plan. Whether you use us or not.

A working session: we map your stack, find the highest-leverage gap in your 70%, and hand you an implementation plan you can run tomorrow.