Aller au contenu principal

ADR-061 — Council Multi-Stage Review System

Status: Accepted (S1+S2 live, S3 implementation in progress — CAB-2046) Date: 2026-04-11 Author: Christophe (CAB Ingenierie) Parent Tickets: CAB-2046 [MEGA] Council Stage 3 — Automated Code Review, CAB-2047 (CLI), CAB-2048 (pre-push hook), CAB-2049 (CI workflow), CAB-2050 (history rotation), CAB-2051 (shadow mode)


Context

STOA's AI Factory runs up to five parallel Claude Code instances (backend, frontend, auth, mcp, qa) that generate, review, and merge code largely without human intervention. At scale, three classes of failure slip through a purely test-based gate:

  1. Wrong work — an instance picks a ticket that is too vague, mis-scoped, or already obsolete, and spends hours building something that nobody needs.
  2. Wrong plan — a plan is technically valid but conflicts with a recent architectural decision, misreads the blast radius, or ignores a compliance constraint.
  3. Wrong code — the implementation passes lint, tests, and typecheck, but introduces subtle debt, widens the attack surface, breaks a cross-component contract, or violates a convention that no automated linter catches.

Lint, tests, and CodeQL are necessary but not sufficient. They detect syntactic and known-pattern issues; they do not evaluate judgment.

A naïve answer is "have a human review every PR." That does not scale when 5 parallel instances push 15-30 PRs per day, and it wastes the most expensive resource (Christophe) on PRs that an LLM could validate correctly 95% of the time.

What we need: a validation system that injects LLM judgment at the three decision points above — before work starts, before code is written, and before code is merged — at a bounded cost, with a clear kill-switch, and with reasoning visible to humans for audit.


Decision

Implement a 3-stage Council system where each stage gates the next. Stages 1 and 2 evaluate prose artifacts (tickets, plans) through a 4-persona LLM jury. Stage 3 evaluates code diffs through a 4-axis parallel LLM review via a CLI tool (scripts/council-review.sh).

┌──────────────────┐   ┌────────────────┐   ┌─────────────┐   ┌──────────────────┐
│ S1: ticket │ → │ S2: plan │ → │ implement │ → │ S3: code review │ → merge
│ pertinence │ │ validation │ │ │ │ (4 axes, diff) │
│ (prose, │ │ (prose, │ │ │ │ (CLI, │
│ 4 personas) │ │ 4 personas) │ │ │ │ parallel API) │
└──────────────────┘ └────────────────┘ └─────────────┘ └──────────────────┘

Each stage emits a binary verdict: Go (proceed), Fix (apply adjustments, re-run), or Redo (reject, rewrite from scratch). A global score of ≥ 8.0/10 is required to Go.

Stage 1 — Ticket Pertinence (live since C11)

Invoked by the /council skill on a feature/ADR description before any work begins. Four personas evaluate the ticket itself:

PersonaRoleFocus
Chucky (Devil's Advocate)Challenges assumptionsRisk, edge cases, failure modes, hidden complexity
OSS Killer (VC Skeptic)Market viabilityCompetitive moat, user value, ROI, me-too risk
Archi 50x50 (Veteran Architect)Technical qualityCoupling, scalability, pattern fit, tech debt
Better Call Saul (Legal/IP)Legal/complianceLicense, GDPR/DORA, competitive claims, disclaimers

Each persona returns a score /10 + verdict + specific adjustments to apply. The average is combined with the Context Compiler Impact Score modifier (LOW 0, MEDIUM 0, HIGH −0.5, CRITICAL −1.0) to produce a final score. Threshold: ≥ 8.0 → Go.

Stage 1 catches: mis-scoped work, features with no measurable value, scope creep, unverified competitive claims, overly-vague acceptance criteria. It runs in ~30 seconds (single Sonnet call, Council skill).

Stage 2 — Plan Validation (live since C11)

Invoked by the same /council skill on a completed implementation plan (files to touch, steps, expected LOC, alternatives considered). Same 4-persona jury, same scoring, same adjustment loop — but the questions now focus on technical correctness of the plan, not pertinence of the work.

Stage 2 catches: files in the wrong component, steps with hidden ordering dependencies, LOC estimates off by 2-5×, alternatives that weren't considered, security holes in the proposed approach, breaking contract changes masquerading as refactors.

Plans scoring < 8.0 receive numbered adjustments that the plan author applies, then Stage 2 re-runs on the revised plan. Tickets labeled ship-fast-path (≤ 5 pts Ship-mode) skip Stage 2 entirely — the expected value of plan validation is lower than its cost at that size.

Stage 3 — Automated Code Review (implementation in progress — CAB-2046)

Invoked after implementation on the actual git diff. Unlike S1/S2, Stage 3 is a CLI tool (scripts/council-review.sh), not a Claude skill. It runs four Anthropic API calls in parallel isolated subshells, one per axis, and aggregates their scores into a single verdict.

AxisEvaluatesContext Injected
conformanceCoding conventions, commit format, naming, dead code, TODO hygieneLinear ticket title + description (read-only via issueSearch)
debtTech debt, duplication, dead code, complexity, borderline abstractionsDiff only
attack_surfaceNew endpoints, secret handling, SSRF, input validation, auth bypassTrivy JSON report (best-effort)
contract_impactCross-component contracts, breaking API changes, schema driftdocs/stoa-impact.db extract — affected components + contracts

Each axis is strictly isolated: a separate subshell, a separate tempfile, a separate system prompt loaded from scripts/council-prompts/<axis>.md. PIDs are captured incrementally (one assignment immediately after each &, not a single trailing $! — bash only returns the most recent background PID). A wait loop tallies failures and routes results to a pure aggregate_scores function that emits the verdict JSON.

Stage 3 is wired to two enforcement points:

  1. Pre-push hook (CAB-2048) — runs on every git push. Blocks push if verdict is REWORK or error. Can be disabled locally via DISABLE_COUNCIL_GATE=1 for emergencies.
  2. CI workflow (council-gate.yml, CAB-2049) — runs on every PR. Status check, required for merge. Feature-flagged on vars.COUNCIL_S3_ENABLED for phased rollout (shadow mode first — CAB-2051).

Hard guardrails (applied before any API call)

All three stages share five guardrails enforced by the CLI:

  1. COUNCIL_DISABLE=1 kill-switch — fast-path exit 0, no review. Redundant with CI feature flag and pre-push DISABLE_COUNCIL_GATE, but covers manual invocation.
  2. Daily cost capCOUNCIL_DAILY_CAP_EUR (default €5). Before each run, the CLI reads council-history.jsonl, sums today's cost_eur, and exits 0 (not 2) if the total is at cap. The intent is "skip review today", not "fail the PR".
  3. SHA dedup — SHA1 of the full diff. Identical diffs in the same day (CI retries, cherry-picks, empty pushes) skip the review. Bypass: COUNCIL_FORCE_DEDUP=0.
  4. Gitleaks pre-flight (BLOCKING) — the diff is scanned by gitleaks with --exit-code 1 before any API call. Any leak → exit 2, no API traffic. This is the cheapest possible guard against exfiltrating secrets to Anthropic.
  5. Diff truncation — 10 000-line hard cap on the diff sent to the model. Truncation is logged and recorded in the JSONL entry (diff_truncated: true).

Every run, whether APPROVED/REWORK/error/BYPASSED, appends one JSONL entry to council-history.jsonl:

{
"timestamp": "2026-04-11T17:39:01Z",
"ticket": "CAB-2047",
"status": "APPROVED",
"global_score": 8.50,
"axes_evaluated": 4,
"db_fresh": true,
"model": "claude-sonnet-4-5",
"tokens": 3200,
"input_tokens": 2800,
"output_tokens": 400,
"cost_eur": 0.014,
"diff_lines": 142,
"diff_truncated": false,
"duration_ms": 18500,
"diff_sha": "ab2a795c7ca0924f21cac2d606fde5666de17bb1"
}

This file is the authoritative cost ledger for the daily cap, the dedup cache, and the audit trail for security/compliance. It is gitignored (machine-local) and rotated at 500 lines by CAB-2050.

Language choice: bash, not Python (Adj #12)

CAB-2047 Council S2 flagged the ~400-LOC size of the CLI as a bash red flag and asked explicitly whether Python would be better. The answer is bash, for four reasons:

  1. Consistency with scripts/ai-ops/ai-factory-notify.sh (232 LOC) and scripts/ai-ops/model-router.sh (120 LOC) — all CI tooling is already bash.
  2. Zero new CI toolchain — GitHub Actions runners have bash + jq + curl + gitleaks preinstalled. Python would add venv/pip/requirements to every CI run that invokes Council S3.
  3. Testable pure function — the core logic (aggregate_scores) is a pure function over a tempdir of JSON files. It's trivially testable via bats-core against fabricated fixtures, without shell-injection risk.
  4. Cheap adversarial hardening — shellcheck -S style catches most bash footguns. set -euo pipefail, explicit variable quoting, and whitelisted component-id regex (^[a-zA-Z0-9_-]+$) before any SQL interpolation cover the remaining cases.

Fragility risk is mitigated by the guardrails above, by shellcheck in CI, and by the MOCK_API fixture suite (MOCK_API=1 short-circuits every API call to a local JSON fixture — enables fast, hermetic, zero-cost tests).

Scoring rubric

All three stages use the same rubric. Each axis/persona returns an integer 1-10:

ScoreVerdict
9-10Perfect or near-perfect. No blockers.
8Acceptable. Minor non-blocking comments. Minimum to Go.
6-7Fix required. Specific adjustments must be applied, then re-run.
1-5Redo. The underlying approach is wrong; start over.

The global score is the arithmetic average of the valid axes (S3) or the 4 personas (S1/S2), optionally adjusted by the Context Compiler Impact modifier. A stage Goes if the global score is ≥ 8.0 and no axis/persona individually returned a Redo.

Cost model

StageTriggerModelTypical costFrequency
S1/council skill on ticket descriptionSonnet 4.5~€0.005 per run5-15/day
S2/council skill on planSonnet 4.5~€0.008 per run3-10/day
S3pre-push hook + CI on every diffSonnet 4.5 × 4 parallel~€0.03-0.06 per run15-30/day

Expected steady-state daily cost: €1-2/day (well under the €5 cap). The cap exists as a circuit breaker for pathological loops (e.g., a CI retry storm), not as a budget target.


Alternatives Considered

A. Human-only review

Rejected. With 5 parallel instances pushing 15-30 PRs/day, this puts Christophe on a 2-hour review backlog and blocks the entire factory on one person's availability. It also wastes human judgment on PRs where an LLM would correctly approve 95% of them.

B. Lint + tests only (no Council)

Rejected. This is the pre-Council baseline. It misses the three failure classes Council targets: wrong work, wrong plan, subtle code issues that syntactic tools cannot see. CAB-1944 shipped 5 regressions at mocked test boundaries that 5 700 passing tests did not catch — a Stage 3 check on the diff would have flagged the boundary mocking pattern.

C. Single-stage Council (code review only)

Rejected. S3 alone catches bad code, not bad work. An instance that spends 6 hours implementing the wrong ticket will pass S3 easily (the code is clean) and waste the time. S1 and S2 catch failures hours to days earlier, when the cost of reversal is minimal.

D. Local open-source LLM (no API calls)

Rejected for now. The quality gap between Sonnet 4.5 and self-hosted 8B-70B models on structured code review is still large (verified April 2026 — local models hallucinate blockers or miss real ones). API cost at €1-2/day is cheap compared to the engineering time lost to a bad merge. Revisit when local models close the quality gap or when API cost drives >€20/day.

E. Python CLI for S3 (instead of bash)

Rejected. See "Language choice" above. Python was the default assumption coming out of Council S2, but the actual analysis favored bash on three of four dimensions (consistency, CI toolchain, hardening) and tied on the fourth (testability — bats-core matches pytest for pure-function tests).

F. Single-axis S3 (conformance only)

Rejected as the final target, but accepted as the MVP path. Steps 1-3b of CAB-2047 shipped conformance-only as an intermediate gate while the orchestration was wired up. Step 3c completed the transition to all 4 axes. The single-axis fallback is preserved as a debug option via COUNCIL_AXES_ONLY=conformance.


Consequences

Positive

  • Three independent safety nets against AI Factory regressions, each catching a different failure class.
  • Reasoning is visible. Every Go/Fix/Redo comes with per-axis feedback and a list of adjustments. Humans auditing a merge can read the Council reasoning, not just the score.
  • Cost is bounded and observable. council-history.jsonl gives exact cost-per-review, and the daily cap is a hard circuit breaker.
  • Kill-switches at every layer. COUNCIL_DISABLE, DISABLE_COUNCIL_GATE, vars.COUNCIL_S3_ENABLED, COUNCIL_FORCE_DEDUP=0 — any instance, CI workflow, or operator can disable any piece independently.
  • Shadow-mode rollout (CAB-2051) de-risks production enforcement: S3 runs on every PR but its verdict is non-blocking for 2-3 weeks. We calibrate thresholds against real diffs before flipping the required-check flag.
  • Gitleaks pre-flight prevents secret exfiltration to Anthropic, addressing the most acute privacy concern about LLM-based code review.

Negative

  • LLM judgment is not deterministic. The same diff reviewed twice can score 8.1 and 7.9. We accept this; the threshold is an average over 4 axes, which reduces per-run variance significantly.
  • Prompt calibration is iterative. Steps 4 and 5 of CAB-2047 will tune per-axis prompts against real diffs before S3 is enforced. Expect 2-3 rounds of prompt iteration in the first month.
  • Additional API dependency. S3 depends on the Anthropic API being reachable from CI. Outages block merges. Mitigations: the daily cap is a soft skip (exit 0), the 30s timeout + 2 retries covers transient failures, and COUNCIL_DISABLE=1 is always available.
  • Cost drift risk. A runaway loop or a model price change could blow through the cap. The cap is the first line of defense; the second is a weekly audit of council-history.jsonl.

Neutral

  • Bash maintenance cost. The CLI is ~1 300 LOC of bash. Shellcheck and bats-core cover most of the risk, but a large refactor would still be more effort than in Python. We accept this trade because it's CI tooling that changes rarely.
  • MOCK_API fixtures drift. Fixture JSONs must be updated when we change the record_review tool schema. Mitigations: schema is simple (score, feedback, blockers) and changes are rare; any drift shows up immediately in the MOCK_API test runs.

Rollout Plan (CAB-2046 phases)

PhaseTicketDescriptionStatus
1CAB-2047scripts/council-review.sh CLI + 4 axes + bats tests + docIn progress (Steps 1-3c merged, Step 4 next)
2CAB-2048Pre-push hook extension — invoke council-review.sh locallyBlocked by CAB-2047
3CAB-2049council-gate.yml CI workflow + feature flagBlocked by CAB-2047
4CAB-2050Rotation of council-history.jsonl (500-line rolling window)Blocked by CAB-2047
5CAB-2051Shadow mode observation (2-3 weeks non-blocking) → flip to requiredBlocked by CAB-2048 + CAB-2049 + CAB-2050

Enforcement becomes mandatory only after Phase 5 completes and shadow-mode data confirms calibration (false-positive rate < 10%, false-negative rate < 5% against a manual-review ground truth).


Verification

  • S1/S2 coverage: every new ticket with estimate ≥ 5 pts OR label council-review. Skill: /council <description>.
  • S3 coverage: every git push (pre-push hook, local) AND every PR (CI, remote). Fail-closed once shadow mode ends.
  • Audit trail: council-history.jsonl (local), Linear comments with council:ticket-go / council:plan-go labels for S1/S2, GitHub status check history for S3.
  • Cost observability: weekly audit via scripts/ai-ops/council-audit.sh (CAB-2050 deliverable) — reports total runs, cost, verdict distribution, calibration drift.

References

  • CAB-2046 — [MEGA] Council Stage 3 — Automated Code Review (parent)
  • CAB-2047 — council-review.sh CLI (13 pts, in progress)
  • CAB-2048 — pre-push hook extension
  • CAB-2049 — council-gate.yml CI workflow
  • CAB-2050 — council-history.jsonl rotation
  • CAB-2051 — Shadow mode observation
  • .claude/skills/council/SKILL.md — S1/S2 runbook
  • .claude/rules/council-s3.md — S3 runbook (CAB-2047 Step 4 deliverable)
  • scripts/council-review.sh — S3 CLI implementation
  • scripts/council-prompts/{conformance,debt,attack_surface,contract_impact}.md — per-axis system prompts (externalized in CAB-2047 Step 3a)
  • ADR-050 — Guardrails, Token Budget, State Machine (Council first used here, score 7.25/10 Fix)
  • ADR-056 — FAPI 2.0 Architecture (Council 8.13/10 Go)
  • ADR-059 — Simplified Deployment SSE (Council 8.00/10 Go, CRITICAL impact)
  • ADR-060 — AI-Verified UI Testing (complementary — Council validates judgment, ADR-060 validates visual output)

Reviewers

RoleStatus
Council S18.125/10 Go (on CAB-2046)
Council S28.5/10 Go (on CAB-2046, 13 adjustments applied)
ArchitectChristophe (CAB Ingenierie)
Impact ScoreHIGH (20) — cross-component (CI, pre-push hook, scripts/), feature-flagged rollout, kill-switches at every layer