Skip to content

Orchestration Design

Overview

Farmerspec is a multi-agent development framework that orchestrates AI agents to develop features using spec-driven, test-first development. This document defines the orchestration layer, phase transitions, artifacts, and validation gates.


Workflow Phases (13-Phase TDD Architecture)

The workflow follows a TDD (Test-Driven Development) sequence where tests are written BEFORE implementation.

Phase Type Agent Description
01 Planning Baron (PM) WRD Intake - Validate WRD, extract requirements
02 Planning Duc (Architecture) Blueprint Creation - Create technical blueprint
03 Review Baron (PM) Blueprint Review - Approve or request changes
04 Planning Marie (Testing) Test Planning - Plan E2E tests (UI + Backend)
05 Implementation Marie (Testing) Test Implementation - Write E2E tests (TDD red)
06 Planning Dede (Backend) Backend Planning - Plan backend tasks
07 Implementation Dede (Backend) Backend Implementation - Implement against tests
08 Planning Dali (Frontend) Frontend Planning - Plan frontend tasks
09 Implementation Dali (Frontend) Frontend Implementation - Implement UI
10 Planning Maigret (SRE) SRE Planning - Plan observability tasks
11 Implementation Maigret (SRE) SRE Implementation - Implement monitoring
12 Planning Gustave (DevOps) DevOps Planning - Plan infrastructure tasks
13 Implementation Gustave (DevOps) DevOps Implementation - Deploy and configure

Core Principles

TDD Sequence

Tests are written BEFORE implementation: 1. Marie writes E2E tests (phases 4-5) 2. Backend/Frontend implement to make tests pass (phases 6-9) 3. SRE/DevOps handle observability and deployment (phases 10-13)

Stateless Agents

All agents are stateless—each invocation is a fresh instance with no memory of previous runs. This means: - Agents cannot learn from experience directly - All context must be provided in the prompt + input files - Improvement happens through RL updates to prompts and knowledge docs

Dual-Agent Validation

Each phase uses a dual-agent model:

┌─────────────────────────────────────────────────────────────────┐
│  EXECUTOR AGENT (Duc, Marie, Dede, etc.)                        │
│  • Receives: input files + prompt + knowledge + guardrails      │
│  • Produces: output.json + source claims                        │
│  • Does NOT self-score                                          │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  BARON (Evaluator - fresh instance)                             │
│  • Receives: input files + output.json + claims                 │
│  • Verifies: "You claimed line 45 of WRD - let me check"       │
│  • Produces: feedback.json with validated scores                │
│  • Verdict: pass | flag | escalate                             │
└─────────────────────────────────────────────────────────────────┘

Why dual-agent? - Executor can't accurately self-report where output came from - Evaluator can verify claims against actual documents - Separation of concerns: execute vs validate - Fresh instance = no bias carryover

Escalation = Abort

When Baron's validation escalates, the run is aborted. Human intervention is required before re-running.

Verdict Action
pass Continue to next phase
flag Log warning, continue
escalate Abort run, human must review

Agents

Executor Agents

Agent Role Phases
Baron PM / Orchestrator 1 (WRD Intake), 3 (Blueprint Review)
Duc Architect 2 (Blueprint Creation)
Marie QA Engineer 4-5 (Test Planning + Implementation)
Dede Backend Developer 6-7 (Backend Planning + Implementation)
Dali Frontend Developer 8-9 (Frontend Planning + Implementation)
Maigret SRE 10-11 (SRE Planning + Implementation)
Gustave DevOps / GitOps 12-13 (DevOps Planning + Implementation)

Validation & RL Agents

Agent Role Scope When
Baron In-Flight Validator Single phase, single run During workflow
Socrate Retrospective Analyst Cross-run, cross-agent After runs complete

Baron: In-Flight Validator

Baron validates ALL phases during workflow execution—including phases where Baron is also the executor.

Responsibilities

Task Description
Verify claims Check that executor's source claims match actual documents
Compute feedback Produce validated sourceAttribution scores
Quality gate Pass / flag / escalate based on output quality
Traceability Ensure outputs can be traced to inputs

What Baron Checks

def get_baron_validation_prompt(self) -> str:
    return """
    Validate the executor's output:
    1. CLAIM VERIFICATION: Do source claims match actual documents?
    2. COMPLETENESS: Are all required fields present and accurate?
    3. TRACEABILITY: Can each output be traced to an input source?
    4. QUALITY: Is the output sufficient for the next phase?

    Verify each claim by checking the referenced document.

    Return: pass | flag | escalate
    """

Baron Validation Result

@dataclass
class BaronValidationResult:
    verdict: Literal["pass", "flag", "escalate"]
    confidence: float  # 0.0-1.0
    feedback: ValidatedFeedback  # Verified source attribution
    issues: list[ValidationIssue]
    escalation_reason: str | None  # Why human needed (if escalate)

Socrate: Retrospective Analyst

Socrate analyzes completed runs to improve agents. This is a separate process from live workflow validation.

Responsibilities

Task Description
Pattern detection Find recurring gaps across runs
Cross-agent analysis Compare performance across agents
Correlation analysis Link low inputClarity to high ungrounded
Improvement suggestions Concrete changes to prompts/knowledge

What Socrate Analyzes

-- Example: Find agents with high invention rates
SELECT agent_code, AVG(invented) as avg_invented
FROM phase_scores
GROUP BY agent_code
ORDER BY avg_invented DESC;

-- Example: Find recurring knowledge gaps
SELECT target_file, description, COUNT(*) as occurrences
FROM gaps
WHERE gap_type = 'knowledge'
GROUP BY target_file, description
ORDER BY occurrences DESC;

Socrate Output

{
  "analysis_period": "2026-01-01 to 2026-01-17",
  "runs_analyzed": 47,
  "insights": [
    {
      "type": "pattern",
      "severity": "high",
      "finding": "Backend agent invents timeout values in 40% of runs",
      "recommendation": {
        "artifact": "knowledge",
        "file": "agents/backend/knowledge/api-patterns.md",
        "section": "Timeouts",
        "content": "Add default timeout guidance: 30s for API calls, 5s for health checks"
      }
    }
  ]
}

Baron vs Socrate Comparison

Aspect Baron (In-Flight) Socrate (Retrospective)
When During phase execution After runs complete
Scope Single phase, single run Cross-run, cross-agent
Purpose Quality gate Continuous improvement
Action Pass/flag/escalate Suggest prompt/knowledge updates
Blocking Yes (escalate aborts) No (async analysis)
Output Verified feedback per phase Improvement recommendations

Phase Execution Flow

WRD Document
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 1: WRD Intake                                            │
│  Executor: Baron  │  Evaluator: Baron (fresh instance)          │
│  • Extract requirements, classify, validate completeness        │
│  • Deterministic validation: Schema, required fields            │
│  • Baron evaluation: Verify claims, compute feedback            │
└─────────────────────────────────────────────────────────────────┘
    │ (pass/flag → continue, escalate → ABORT)
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 2: Blueprint Creation                                    │
│  Executor: Duc  │  Evaluator: Baron                             │
│  • Create technical blueprint from requirements                 │
│  • Define components, files, APIs, data models                 │
│  • Executor makes claims, Baron verifies                        │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 3: Blueprint Review                                      │
│  Executor: Baron  │  Evaluator: Baron (fresh instance)          │
│  • Review blueprint for feasibility and completeness           │
│  • Approve, request revision, or reject                        │
│  • Self-evaluation with fresh context                          │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASES 4-5: Test Planning + Implementation                     │
│  Executor: Marie  │  Evaluator: Baron                           │
│  • Plan E2E tests covering acceptance criteria                 │
│  • Write tests BEFORE implementation (TDD red phase)           │
│  • Tests should FAIL initially (no implementation yet)         │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASES 6-7: Backend Planning + Implementation                  │
│  Executor: Dede  │  Evaluator: Baron                            │
│  • Plan backend tasks from blueprint                           │
│  • Implement to make tests pass (TDD green phase)              │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASES 8-9: Frontend Planning + Implementation                 │
│  Executor: Dali  │  Evaluator: Baron                            │
│  • Plan frontend tasks from blueprint                          │
│  • Implement UI components                                     │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASES 10-11: SRE Planning + Implementation                    │
│  Executor: Maigret  │  Evaluator: Baron                         │
│  • Plan monitoring, alerting, observability                   │
│  • Implement dashboards, alerts, SLOs                         │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASES 12-13: DevOps Planning + Implementation                 │
│  Executor: Gustave  │  Evaluator: Baron                         │
│  • Plan infrastructure, CI/CD, deployment                     │
│  • Implement and deploy                                        │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
  RUN COMPLETE
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  SOCRATE (Async, Post-Workflow)                                 │
│  • Aggregates feedback from completed run                      │
│  • Analyzes patterns across run history                        │
│  • Produces improvement suggestions for prompts/knowledge      │
└─────────────────────────────────────────────────────────────────┘

Validation System

Two-Layer Validation

Each phase has two validation layers:

Layer Type When Purpose
Deterministic Code-based Before & after LLM Schema, required fields, format
Baron LLM AI-reviewed After deterministic Verify claims, semantic quality

Deterministic Validators

Fast, code-based checks that run BEFORE Baron evaluation:

Validator Purpose
SchemaValidator Output matches expected Pydantic schema
RequiredFieldsValidator Required fields present and non-empty
FileExistsValidator Referenced file paths exist
CrossReferenceValidator IDs reference valid entities from previous phases
ValueInSetValidator Field values in allowed set

Phase Execution Steps

Every phase follows this 5-step pattern:

Step Name On Failure
1 Input files -
2 Validate input (deterministic) FAIL
3 Execute phase (Executor LLM) FAIL
4 Validate output (deterministic) FAIL
5 Evaluate output (Baron LLM) ABORT (if escalate)

Executor Contract

Every executor agent MUST return output with source claims:

{
  "status": "success | blocked | needs_clarification",
  "result": { /* phase-specific output */ },
  "claims": [
    {
      "outputPath": "result.components[0].name",
      "source": "input | code | knowledge | guardrails",
      "ref": "wrd.yaml#scope.backend.services[0]",
      "quote": "Payment service module"
    }
  ],
  "blockers": []
}

Claim Structure

Field Description
outputPath JSON path to the output field
source Priority level: input > code > knowledge > guardrails
ref Document path + section reference
quote Relevant excerpt from source

Claims enable Baron to verify: "Did the executor actually get this from where they say?"


Feedback Structure

Baron produces validated feedback for each phase:

{
  "feedback": {
    "scores": {
      "sourceAttribution": {
        "overall": 0.82,
        "breakdown": {
          "fromInput": 0.55,
          "fromCode": 0.20,
          "fromKnowledge": 0.15,
          "fromGuardrails": 0.05,
          "ungrounded": 0.05
        }
      },
      "guardrailsCompliance": { "overall": 0.90 },
      "inputClarity": { "overall": 0.65 },
      "outputConfidence": { "overall": 0.72 }
    },
    "traceability": [ /* verified source traces */ ],
    "gaps": [ /* identified documentation gaps */ ],
    "suggestions": [ /* improvement recommendations */ ]
  }
}

See agents/pm/knowledge/feedback-production.md for complete structure.


RL Loop

The RL loop uses feedback to improve agents:

┌────────────────────────────────────────────────────────────────┐
│  WHAT RL CAN IMPROVE                                           │
├────────────────────────────────────────────────────────────────┤
│  ✅ System prompts      agents/{code}/prompt.md                │
│  ✅ Mission prompts     agents/{code}/missions/{mission}.md    │
│  ✅ Knowledge docs      agents/{code}/knowledge/*.md           │
│  ✅ Guardrails          docs/constitution/guardrails/*.md      │
│  ✅ Templates           templates/*.md                         │
├────────────────────────────────────────────────────────────────┤
│  WHAT RL CANNOT IMPROVE                                        │
├────────────────────────────────────────────────────────────────┤
│  ❌ Model weights       (we use Claude as-is)                  │
│  ❌ Agent memory         (stateless)                           │
│  ❌ Training data        (baked into model)                    │
└────────────────────────────────────────────────────────────────┘

Every piece of feedback MUST point to an improvable artifact:

{
  "suggestion": {
    "target": {
      "artifact": "knowledge | prompt | guardrails | template",
      "file": "agents/backend/knowledge/api-patterns.md",
      "section": "Timeouts"
    },
    "problem": "No guidance on default timeout values",
    "recommendation": "Add section: 30s for API calls, 5s for health checks"
  }
}

Phase Architecture

Each phase is a self-contained module:

backend/app/services/phases/
├── __init__.py              # Phase registry
├── base.py                  # BasePhase, PhaseContext
├── wrd_intake.py            # Phase 1 (Baron → Baron)
├── blueprint_creation.py    # Phase 2 (Duc → Baron)
├── blueprint_review.py      # Phase 3 (Baron → Baron)
├── test_planning.py         # Phase 4 (Marie → Baron)
├── test_implementation.py   # Phase 5 (Marie → Baron)
├── backend_planning.py      # Phase 6 (Dede → Baron)
├── backend_implementation.py # Phase 7 (Dede → Baron)
├── frontend_planning.py     # Phase 8 (Dali → Baron)
├── frontend_implementation.py # Phase 9 (Dali → Baron)
├── sre_planning.py          # Phase 10 (Maigret → Baron)
├── sre_implementation.py    # Phase 11 (Maigret → Baron)
├── devops_planning.py       # Phase 12 (Gustave → Baron)
└── devops_implementation.py # Phase 13 (Gustave → Baron)

Phase Base Class

class BasePhase(ABC):
    phase_type: PhaseType
    executor_agent: AgentCode      # Who executes
    evaluator_agent: AgentCode     # Who evaluates (always Baron)
    input_schema: type[BaseModel]
    output_schema: type[BaseModel]

    @abstractmethod
    async def build_input(self, context: PhaseContext) -> dict[str, Any]:
        """Build phase-specific input from run context."""
        pass

    @abstractmethod
    async def execute(self, input_data: dict[str, Any]) -> PhaseResult:
        """Execute the phase (call executor agent)."""
        pass

    @abstractmethod
    def get_deterministic_validators(self) -> list[DeterministicValidator]:
        """Return list of deterministic checks for this phase's output."""
        pass

    @abstractmethod
    def get_baron_evaluation_prompt(self) -> str:
        """Return Baron evaluation prompt for claim verification."""
        pass

Document Flow

Document Format Naming
WRD (input) YAML wrd-{type}-{slug}.yaml
Phase Output JSON data/wrds/{wrd_id}/runs/{run_id}/phases/{phase}.json
Feedback JSON Embedded in phase output
Socrate Analysis JSON data/analysis/{date}/socrate_report.json

Commands

# Start a new run from WRD
farmerspec.run --wrd=./wrd-feature-oauth.yaml

# Check run status
farmerspec.status --run={run_id}

# List phases for a run
farmerspec.phases --run={run_id}

# View phase output
farmerspec.phase --run={run_id} --phase=wrd_intake

# Re-run from specific phase (after human fixes)
farmerspec.rerun --run={run_id} --from=blueprint_creation

# Run Socrate analysis
farmerspec.socrate --since=2026-01-01

Version History

Version Date Changes
0.1 2026-01-13 Initial design
0.5 2026-01-14 WRD naming, Phase 00 (WRD Discovery)
1.0 2026-01-17 13-phase TDD architecture with validation system
1.1 2026-01-17 Dual-agent model: Executor + Baron evaluator
1.2 2026-01-17 Baron/Socrate split: in-flight vs retrospective