AI handles the code.
Who makes it production-ready?

Code generation is the easy 30%. Review, testing, security, deploy, and monitoring are where sprints disappear. This is the open playbook on closing that gap, distilled from how Stripe, Uber, Ramp, Coinbase, Meta, and Anthropic actually ship AI in production. Not a vendor pitch. A map.

Book a 30-min intro Jump to the 8 patterns ↓

Delivery Run

live#847

INFRASTRUCTURE

spec-driven

01  spec.yaml loaded → 12 constraints

02  sandbox provisioned 240ms

03  full-stack parity ✓

ready → agent

AGENT CORE

harness-bound

04  harness: 147 rules active

05  generating 3 files ⟳

06  sub-agent → tests ⟳

2 agents active

OBSERVABILITY

closed-loop

07  CI pipeline passed

08  post-deploy verify ✓

09  drift → new rule

2/3 checks

COMPOUND LAYERdelivery equity

+2 rules this run

now

0 rules0 teamsSprint 5 was 23% faster than S1

01 The Insight · the 30/70 problem

AI covers 30% of delivery.
We cover 100.

30% Code gen

Review

Testing

Security

Deploy

Monitor

Code Review 39 Uber · uReview Dev-years saved per year. 65K diffs reviewed weekly. 75% rated useful.

Testing 1.7x CodeRabbit · Stack Overflow More bugs from AI code. Speed without verification creates debt.

Security 0% Anthropic · 2026 Report Of tasks fully delegated. 60% AI-assisted. The gap is the system.

Deployment 1,000+ Stripe · Minions Agent PRs per week. 500 curated tools. Selective CI.

Full Lifecycle 50%+ Ramp · Inspect Merged PRs from agents. Organic adoption. Growing every sprint.

Your team adopted Copilot six months ago. PRs still take the same time. The gap isn't the model. It's the methodology. AI covers the easy 30%. The other 70% (testing, security, deployment, monitoring) is where delivery breaks down. Every quarter you wait is another board meeting where you can't show the lift. Stripe, Spotify, Uber, Meta, Coinbase, and Ramp all proved methodology beats raw tooling.

02 Environment over model

Stop debugging code. Start debugging the system that produces it.

One model. Two harnesses. 42% → 78%. When an agent fails, the question isn't "better prompt?" It's "what's missing from the environment?" Princeton's CORE-Bench held the model fixed and changed only the harness (42→78%); a separate Princeton/Stanford finding measured a 64% lift from environment design alone. Inversely, METR's RCT found experienced devs 19% slower with AI when the harness wasn't built for agents. The model is a commodity. The harness is the game.

Fork

Proven base exists close to your needs.

Stripe forked Block's Goose.

Highest control, highest maintenance.

Compose

Good open-source base, need to move fast.

Ramp composed on OpenCode.

Lower maintenance, some upstream coupling.

Build

Unique constraints: security, compliance, integration.

Coinbase built Forge custom.

Highest cost. Only when Fork/Compose can't meet requirements.

Agent Framework

Open SWE (LangChain) LangGraph-based. Slack/Linear/GitHub integrations out of the box. Captures Stripe/Ramp/Coinbase convergent patterns. MIT licensed.

Tier 1 ✓

mini-SWE-agent Princeton/Stanford. >74% SWE-bench. ~100 lines Python. Used by Meta, NVIDIA, IBM. Best when simplicity is paramount.

Tier 1 ✓

Custom Harness Build only the harness layer on top of the base agent. For unique constraints: security, compliance, deep integration. Coinbase Forge is the reference.

Tier 1

Agent Observability

LangSmith Enterprise Managed LLM tracing and evaluation. Coinbase adopted company-wide. Every tool call, retrieval, and decision is traced.

Tier 1 ✓

Langfuse Self-hosted, open source. Best when client data sensitivity requires on-prem. Full control, no data leaves the environment.

Tier 1 ✓

Grafana Stack LogQL/PromQL/TraceQL for agent observability. Best when client already has Grafana/Prometheus infrastructure.

Tier 2 ✓

AI Gateway

LiteLLM Proxy PII stripping, per-client routing, audit logging, spend tracking. Non-negotiable for multi-client engagements.

Tier 1 ✓

Context Engineering

CLAUDE.md / context.md Progressive disclosure: ~100-line map pointing to docs/ directory. Per-engagement context.md with arch, domain vocab, forbidden patterns.

Tier 1 ✓

Browser Verification

Playwright MCP Mandatory. Anthropic found agents consistently mark features 'complete' that don't work. Agents must verify end-to-end in real browsers.

Tier 1 ✓

Adoption

Slack Invocation Surface Ramp pattern: Slack as agent invocation surface. Results visible in shared channels. Track 'humans prompting' as a metric.

Tier 2 ✓

03 Delivery Equity

Every sprint makes the next one faster.

Technical debt taxes every release. Delivery Equity does the opposite. Every sprint adds permanent intelligence: harness rules, spec patterns, optimization data scored against real production outcomes. By sprint 6, your team is shipping features the harness already knows how to test, secure, and deploy. The system gets sharper every sprint, not longer.

Day 1

0

rules

Empty. Same mistakes repeated.

Day 30

0

rules

Patterns absorbed. New engineers productive immediately.

Day 90

0

rules

Cross-team. Dramatically better output.

Continuous

∞

self-evolving

Prompts and architectures evolve from data.

Everyone has the same models. Nobody has your Delivery Equity.

04 The System

The architecture behind agents that ship.

Three pillars. One compounding system. This is the architecture under the harness, and the reason the methodology works on any stack, any model, any team size.

sandbox:
  snapshot: every 30m
  warm_pool: true
  startup: <2s
  parallel_runs: 10
  state: persistent
  env: full-stack parity
  isolation: per-session

Pillar 1: Infrastructure

Isolated. Parallel. Always Warm.

Ephemeral sandboxes with full-stack parity. Run 10+ versions in parallel. Decoupled from the laptop, decoupled from the bottleneck.

client	status	agent
slack	connected	main
browser	connected	main
web IDE	connected	main
research	scanning	sub-agent
review	queued	sub-agent
deploy	ready	main

Pillar 2: Agent Core

Server-First. Multi-Client. Self-Aware.

A server, not a plugin. Reachable from Slack, browser, or IDE. Agents spawn sub-agents. The system reads its own source to prevent hallucination.

CI/CD pipeline passed
Integration tests 18 passed
Visual verification DOM match
Telemetry check nominal
Error tracking 0 new
Feature flags synced
Environment parity verified

Pillar 3: Observability + Validation

Verify After Deploy, Not Just Before.

Telemetry, error tracking, visual verification. A PR is a hypothesis. A passing test in production parity is the proof.

The compound layer

Errors Are the Fuel

Every failure is a training signal, not a ticket. The playbook inspects what broke, scores what worked, and prunes what didn't. It gets sharper every sprint, not longer.

failures inspected31

strategies scored24

low-signal pruned9

merge rate lift+37%

pruned Retrieval strategy #4 added 380 tok, zero accuracy gain. Removed.

05 The 8 Harness Patterns

Non-negotiable, regardless of stack.

Eight patterns repeat across every production agent deployment. Stripe, Coinbase, Ramp, Anthropic, and OpenAI converged on them independently. This is the spine of any harness worth owning.

Pattern 01

Progressive Disclosure

Map first, depth on demand.

Don't give agents everything upfront. The root file is a ~100-line map; depth lives in nested docs. AGENTS.md is now the cross-tool standard (60k+ repos, Linux Foundation, read by Claude Code, Cursor, and Codex), and Agent Skills productize the same idea: composable SKILL.md files that load only when needed.

Sources: Anthropic Agent Skills, AGENTS.md standard, mini-SWE-agent

Used by: Stripe, Spotify, Anthropic

Pattern 02

Git Worktree Isolation

One agent, one worktree, always.

Parallel agents WILL conflict without filesystem isolation. Each gets its own branch, directory, and environment, validated before merge. Worktrees are the floor, not the ceiling: production climbs an isolation ladder from OS sandboxes to microVMs to bound the blast radius. Stripe, Ramp, and Cognition converged here independently.

Sources: Anthropic, Stripe Minions, Cognition Devin, Ramp

Used by: Stripe, Ramp, Coinbase

Pattern 03

Context Pre-hydration

The agent should never search for context it needs.

Before an agent run starts, the orchestrator pulls ALL relevant context: Jira/Linear tickets, linked docs, code search results, Slack thread context, PR history. Stripe Minions: orchestrator scans the invocation thread for links and pre-fetches everything. Ramp: Linear tickets as structured context source.

Sources: Stripe Minions, Ramp Inspect

Used by: Stripe, Spotify, Meta

Pattern 04

Blueprints

Deterministic + agentic nodes. Guardrails where it matters.

A state machine alternating between deterministic nodes (git clone, lint, format, test: unit-tested, reliable) and agentic subtask nodes (LLM-driven code generation, refactoring). Deterministic nodes provide guardrails; agentic nodes provide flexibility. Coinbase: separate data nodes from LLM nodes.

Sources: Stripe Minions, Coinbase Forge, LangChain

Used by: Coinbase, Ramp, Anthropic

Pattern 05

Spec First

If agents can't read it, it doesn't exist.

Agents are blind to Slack, Docs, and knowledge in people's heads. Specs and constraints belong in the repo as machine-readable files. Spec-driven tools (Spec Kit, Kiro) go further: spec as source of truth, code as build artifact. But spec where it earns its keep, not waterfall in markdown. DORA 2025 ranks AI-accessible internal data a top-7 capability.

Sources: DORA 2025, GitHub Spec Kit, AWS Kiro

Used by: Stripe, Uber, GitHub

Pattern 06

Mechanical Architecture Enforcement

Linters replace human review at scale.

Custom linters + structural tests + CI replace human review. Enforce invariants (dependency directions, boundaries, data validation), not implementations. Linter errors include remediation instructions formatted for agent context injection. At agent throughput, corrections are cheap and waiting is expensive.

Sources: Stripe Minions, Open SWE, Netflix Paved Roads

Used by: Uber, Meta, Stripe

Pattern 07

Integrated Feedback Loops

Quality bounded by feedback quality.

Close the loop as tightly as possible. Linter fires at edit time, not after CI. Playwright MCP for real browser verification: Anthropic found agents mark features 'complete' that don't work in the browser. Full observability exposed TO agents, not just humans. The tighter the loop, the higher the autonomous PR acceptance rate.

Sources: Anthropic, Ramp Inspect, Open SWE

Used by: Spotify, Ramp, Meta, Coinbase

Pattern 08

Agent Governance

Control who does what, with what permissions.

Control which agents spawn others, with what permissions and audit trail. Spawning limits prevent agent explosions. Scoped credentials stop a code-writing agent from deploying to prod, and every commit traces to its session log. But governance is only half the job: the lethal trifecta (private data, untrusted input, an exfiltration path) turns prompt injection into data loss. Defend with architectural boundaries, not just audit trails.

Sources: AWS Frontier Agents, GitHub Actions 2026, OpenAI Agent Monitoring, Snyk, Google Threat Intelligence

Used by: AWS, GitHub, OpenAI, Snyk

Emerging

Three more are moving fast from frontier labs to standard practice. Watch these. They're next.

Subagent Orchestration

One agent plans. A fleet executes.

Used by: Anthropic, Cognition, Stripe

Context Engineering

Attention is a budget. Spend it well.

Used by: Anthropic, Ramp, Spotify

Evals & Verification

Verification is the new bottleneck.

Used by: Stripe, Meta, Anthropic

06 The evidence base

Who proved it.
And what they found.

These aren't projections. Real companies, real numbers, real production systems. The pattern is consistent: methodology beats raw tooling every time.

Stripe 1,000+ agent PRs per week

No human-written code. 500 curated tools. Selective CI.

Ramp 50%+ merged PRs from agents

Organic adoption. Nobody forced it. Growing every sprint.

Uber 39 dev-years saved per year

65K diffs reviewed per week. 75% rated useful.

Anthropic 0% of tasks fully delegated

60% AI-assisted. 0% autonomous. The gap is the system.

CodeRabbit 1.7x more bugs from AI code

Speed without verification creates debt, not value.

06 The convergence map

Who's doing what. And where.

Six companies. Six SDLC phases. One clear convergence pattern. Green means production-deployed. Yellow means active or partial. The gaps are where SPEQD installs next.

Company	Code Generation	Code Review	Testing	Security	Deploy & CI/CD
Stripe	Minions	Agent review	Selective CI	Compliance	Auto-merge
Spotify	Honk	LLM Judge	Feedback Loops
Uber	uSpec	uReview
Meta	RADAR	Diff Risk	JiT Tests	Mutation
Coinbase	Forge	Agent review		Compliance	Risk-merge
Ramp	Inspect	Closed-loop	Self-verify

Production Active / Partial Not disclosed

Data compiled from 79 engineering blogs, case studies, and production reports. Last updated March 2026.

07 For engineering leaders

Live in production in 4 weeks. Independent in 8.

One use case. All three pillars. Then the harness compounds across teams, and we step back. Most engagements end at Phase 2 because the team owns it.

Phase 1

Install

4 weeks

One use case. All three pillars. Live in production.

Harness architecture + agent framework selection
Sandboxes configured for your stack
Closed-loop verification + security gates
First agent PRs merged to production

Phase 2

Optimize

4 weeks

The compound layer kicks in.

Feedback loops tuned from real production data
Coverage expanded to additional teams
Observability dashboards live
Champions trained to own the system

Phase 3

Compound

Ongoing

The system runs itself. We step back.

Cross-team knowledge reuse
New model integration as they ship
Quarterly architecture reviews
Most teams go independent after Phase 2

For CTOs & VPs of Engineering

Your team adopted Copilot six months ago. PRs still take the same time. The gap isn't the model. It's the methodology. A production playbook closes the 70% your tools don't touch.

For platform & DevEx leaders

AI amplifies what already exists, dysfunction included. So we start with a readiness diagnostic, not a rollout. Build the harness on solid ground, and every team you serve gets faster. Build it on sand, and you scale the chaos.

For tech leads & senior engineers

Stop reviewing AI-generated code that breaks at every integration. Get a system that handles testing, security, and deployment. You focus on architecture. The playbook handles the rest.

CXO enablement

SPEQD embeds as an Applied-AI Enablement leader. Deep enough with the engineers to ship the harness, fluent enough with the C-suite to connect it to the P&L. We codify the methodology as a spec so it survives after we leave. The methodology is the moat.

Walk away with a plan. Whether you use us or not.

A working session: we map your stack, find the highest-leverage gap in your 70%, and hand you an implementation plan you can run tomorrow.

Book a 30-min intro

Essays & writing →

AI handles the code.Who makes it production-ready?

AI covers 30% of delivery.We cover 100.

Stop debugging code. Start debugging the system that produces it.

Fork

Compose

Build

Agent Framework

Agent Observability

AI Gateway

Context Engineering

Browser Verification

Adoption

Every sprint makes the next one faster.

The architecture behind agents that ship.

Isolated. Parallel. Always Warm.

Server-First. Multi-Client. Self-Aware.

Verify After Deploy, Not Just Before.

Errors Are the Fuel

Non-negotiable, regardless of stack.

Progressive Disclosure

Git Worktree Isolation

Context Pre-hydration

Blueprints

Spec First

Mechanical Architecture Enforcement

Integrated Feedback Loops

Agent Governance

Subagent Orchestration

Context Engineering

Evals & Verification

Who proved it.And what they found.

Who's doing what. And where.

Live in production in 4 weeks. Independent in 8.

Install

Optimize

Compound

Walk away with a plan. Whether you use us or not.

AI handles the code.
Who makes it production-ready?

AI covers 30% of delivery.
We cover 100.

Who proved it.
And what they found.