Orchestration Design

Overview

Farmerspec is a multi-agent development framework that orchestrates AI agents to develop features using spec-driven, test-first development. This document defines the orchestration layer, phase transitions, artifacts, and validation gates.

Workflow Phases (13-Phase TDD Architecture)

The workflow follows a TDD (Test-Driven Development) sequence where tests are written BEFORE implementation.

Phase	Type	Agent	Description
01	Planning	Baron (PM)	WRD Intake - Validate WRD, extract requirements
02	Planning	Duc (Architecture)	Blueprint Creation - Create technical blueprint
03	Review	Baron (PM)	Blueprint Review - Approve or request changes
04	Planning	Marie (Testing)	Test Planning - Plan E2E tests (UI + Backend)
05	Implementation	Marie (Testing)	Test Implementation - Write E2E tests (TDD red)
06	Planning	Dede (Backend)	Backend Planning - Plan backend tasks
07	Implementation	Dede (Backend)	Backend Implementation - Implement against tests
08	Planning	Dali (Frontend)	Frontend Planning - Plan frontend tasks
09	Implementation	Dali (Frontend)	Frontend Implementation - Implement UI
10	Planning	Maigret (SRE)	SRE Planning - Plan observability tasks
11	Implementation	Maigret (SRE)	SRE Implementation - Implement monitoring
12	Planning	Gustave (DevOps)	DevOps Planning - Plan infrastructure tasks
13	Implementation	Gustave (DevOps)	DevOps Implementation - Deploy and configure

Core Principles

TDD Sequence

Tests are written BEFORE implementation: 1. Marie writes E2E tests (phases 4-5) 2. Backend/Frontend implement to make tests pass (phases 6-9) 3. SRE/DevOps handle observability and deployment (phases 10-13)

Stateless Agents

All agents are stateless—each invocation is a fresh instance with no memory of previous runs. This means: - Agents cannot learn from experience directly - All context must be provided in the prompt + input files - Improvement happens through RL updates to prompts and knowledge docs

Dual-Agent Validation

Each phase uses a dual-agent model:

┌─────────────────────────────────────────────────────────────────┐
│  EXECUTOR AGENT (Duc, Marie, Dede, etc.)                        │
│  • Receives: input files + prompt + knowledge + guardrails      │
│  • Produces: output.json + source claims                        │
│  • Does NOT self-score                                          │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  BARON (Evaluator - fresh instance)                             │
│  • Receives: input files + output.json + claims                 │
│  • Verifies: "You claimed line 45 of WRD - let me check"       │
│  • Produces: feedback.json with validated scores                │
│  • Verdict: pass | flag | escalate                             │
└─────────────────────────────────────────────────────────────────┘

Why dual-agent? - Executor can't accurately self-report where output came from - Evaluator can verify claims against actual documents - Separation of concerns: execute vs validate - Fresh instance = no bias carryover

Escalation = Abort

When Baron's validation escalates, the run is aborted. Human intervention is required before re-running.

Verdict	Action
`pass`	Continue to next phase
`flag`	Log warning, continue
`escalate`	Abort run, human must review

Agents

Executor Agents

Agent	Role	Phases
Baron	PM / Orchestrator	1 (WRD Intake), 3 (Blueprint Review)
Duc	Architect	2 (Blueprint Creation)
Marie	QA Engineer	4-5 (Test Planning + Implementation)
Dede	Backend Developer	6-7 (Backend Planning + Implementation)
Dali	Frontend Developer	8-9 (Frontend Planning + Implementation)
Maigret	SRE	10-11 (SRE Planning + Implementation)
Gustave	DevOps / GitOps	12-13 (DevOps Planning + Implementation)

Validation & RL Agents

Agent	Role	Scope	When
Baron	In-Flight Validator	Single phase, single run	During workflow
Socrate	Retrospective Analyst	Cross-run, cross-agent	After runs complete

Baron: In-Flight Validator

Baron validates ALL phases during workflow execution—including phases where Baron is also the executor.

Responsibilities

Task	Description
Verify claims	Check that executor's source claims match actual documents
Compute feedback	Produce validated `sourceAttribution` scores
Quality gate	Pass / flag / escalate based on output quality
Traceability	Ensure outputs can be traced to inputs

What Baron Checks

def get_baron_validation_prompt(self) -> str:
    return """
    Validate the executor's output:
    1. CLAIM VERIFICATION: Do source claims match actual documents?
    2. COMPLETENESS: Are all required fields present and accurate?
    3. TRACEABILITY: Can each output be traced to an input source?
    4. QUALITY: Is the output sufficient for the next phase?

    Verify each claim by checking the referenced document.

    Return: pass | flag | escalate
    """

Baron Validation Result

@dataclass
class BaronValidationResult:
    verdict: Literal["pass", "flag", "escalate"]
    confidence: float  # 0.0-1.0
    feedback: ValidatedFeedback  # Verified source attribution
    issues: list[ValidationIssue]
    escalation_reason: str | None  # Why human needed (if escalate)

Socrate: Retrospective Analyst

Socrate analyzes completed runs to improve agents. This is a separate process from live workflow validation.

Responsibilities

Task	Description
Pattern detection	Find recurring gaps across runs
Cross-agent analysis	Compare performance across agents
Correlation analysis	Link low inputClarity to high ungrounded
Improvement suggestions	Concrete changes to prompts/knowledge

What Socrate Analyzes

-- Example: Find agents with high invention rates
SELECT agent_code, AVG(invented) as avg_invented
FROM phase_scores
GROUP BY agent_code
ORDER BY avg_invented DESC;

-- Example: Find recurring knowledge gaps
SELECT target_file, description, COUNT(*) as occurrences
FROM gaps
WHERE gap_type = 'knowledge'
GROUP BY target_file, description
ORDER BY occurrences DESC;

Socrate Output

{
  "analysis_period": "2026-01-01 to 2026-01-17",
  "runs_analyzed": 47,
  "insights": [
    {
      "type": "pattern",
      "severity": "high",
      "finding": "Backend agent invents timeout values in 40% of runs",
      "recommendation": {
        "artifact": "knowledge",
        "file": "agents/backend/knowledge/api-patterns.md",
        "section": "Timeouts",
        "content": "Add default timeout guidance: 30s for API calls, 5s for health checks"
      }
    }
  ]
}

Baron vs Socrate Comparison

Aspect	Baron (In-Flight)	Socrate (Retrospective)
When	During phase execution	After runs complete
Scope	Single phase, single run	Cross-run, cross-agent
Purpose	Quality gate	Continuous improvement
Action	Pass/flag/escalate	Suggest prompt/knowledge updates
Blocking	Yes (escalate aborts)	No (async analysis)
Output	Verified feedback per phase	Improvement recommendations

Phase Execution Flow

WRD Document
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 1: WRD Intake                                            │
│  Executor: Baron  │  Evaluator: Baron (fresh instance)          │
│  • Extract requirements, classify, validate completeness        │
│  • Deterministic validation: Schema, required fields            │
│  • Baron evaluation: Verify claims, compute feedback            │
└─────────────────────────────────────────────────────────────────┘
    │ (pass/flag → continue, escalate → ABORT)
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 2: Blueprint Creation                                    │
│  Executor: Duc  │  Evaluator: Baron                             │
│  • Create technical blueprint from requirements                 │
│  • Define components, files, APIs, data models                 │
│  • Executor makes claims, Baron verifies                        │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASE 3: Blueprint Review                                      │
│  Executor: Baron  │  Evaluator: Baron (fresh instance)          │
│  • Review blueprint for feasibility and completeness           │
│  • Approve, request revision, or reject                        │
│  • Self-evaluation with fresh context                          │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASES 4-5: Test Planning + Implementation                     │
│  Executor: Marie  │  Evaluator: Baron                           │
│  • Plan E2E tests covering acceptance criteria                 │
│  • Write tests BEFORE implementation (TDD red phase)           │
│  • Tests should FAIL initially (no implementation yet)         │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASES 6-7: Backend Planning + Implementation                  │
│  Executor: Dede  │  Evaluator: Baron                            │
│  • Plan backend tasks from blueprint                           │
│  • Implement to make tests pass (TDD green phase)              │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASES 8-9: Frontend Planning + Implementation                 │
│  Executor: Dali  │  Evaluator: Baron                            │
│  • Plan frontend tasks from blueprint                          │
│  • Implement UI components                                     │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASES 10-11: SRE Planning + Implementation                    │
│  Executor: Maigret  │  Evaluator: Baron                         │
│  • Plan monitoring, alerting, observability                   │
│  • Implement dashboards, alerts, SLOs                         │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  PHASES 12-13: DevOps Planning + Implementation                 │
│  Executor: Gustave  │  Evaluator: Baron                         │
│  • Plan infrastructure, CI/CD, deployment                     │
│  • Implement and deploy                                        │
└─────────────────────────────────────────────────────────────────┘
    │
    ▼
  RUN COMPLETE
    │
    ▼
┌─────────────────────────────────────────────────────────────────┐
│  SOCRATE (Async, Post-Workflow)                                 │
│  • Aggregates feedback from completed run                      │
│  • Analyzes patterns across run history                        │
│  • Produces improvement suggestions for prompts/knowledge      │
└─────────────────────────────────────────────────────────────────┘

Validation System

Two-Layer Validation

Each phase has two validation layers:

Layer	Type	When	Purpose
Deterministic	Code-based	Before & after LLM	Schema, required fields, format
Baron LLM	AI-reviewed	After deterministic	Verify claims, semantic quality

Deterministic Validators

Fast, code-based checks that run BEFORE Baron evaluation:

Validator	Purpose
`SchemaValidator`	Output matches expected Pydantic schema
`RequiredFieldsValidator`	Required fields present and non-empty
`FileExistsValidator`	Referenced file paths exist
`CrossReferenceValidator`	IDs reference valid entities from previous phases
`ValueInSetValidator`	Field values in allowed set

Phase Execution Steps

Every phase follows this 5-step pattern:

Step	Name	On Failure
1	Input files	-
2	Validate input (deterministic)	FAIL
3	Execute phase (Executor LLM)	FAIL
4	Validate output (deterministic)	FAIL
5	Evaluate output (Baron LLM)	ABORT (if escalate)

Executor Contract

Every executor agent MUST return output with source claims:

{
  "status": "success | blocked | needs_clarification",
  "result": { /* phase-specific output */ },
  "claims": [
    {
      "outputPath": "result.components[0].name",
      "source": "input | code | knowledge | guardrails",
      "ref": "wrd.yaml#scope.backend.services[0]",
      "quote": "Payment service module"
    }
  ],
  "blockers": []
}

Claim Structure

Field	Description
`outputPath`	JSON path to the output field
`source`	Priority level: `input` > `code` > `knowledge` > `guardrails`
`ref`	Document path + section reference
`quote`	Relevant excerpt from source

Claims enable Baron to verify: "Did the executor actually get this from where they say?"

Feedback Structure

Baron produces validated feedback for each phase:

{
  "feedback": {
    "scores": {
      "sourceAttribution": {
        "overall": 0.82,
        "breakdown": {
          "fromInput": 0.55,
          "fromCode": 0.20,
          "fromKnowledge": 0.15,
          "fromGuardrails": 0.05,
          "ungrounded": 0.05
        }
      },
      "guardrailsCompliance": { "overall": 0.90 },
      "inputClarity": { "overall": 0.65 },
      "outputConfidence": { "overall": 0.72 }
    },
    "traceability": [ /* verified source traces */ ],
    "gaps": [ /* identified documentation gaps */ ],
    "suggestions": [ /* improvement recommendations */ ]
  }
}

See agents/pm/knowledge/feedback-production.md for complete structure.

RL Loop

The RL loop uses feedback to improve agents:

┌────────────────────────────────────────────────────────────────┐
│  WHAT RL CAN IMPROVE                                           │
├────────────────────────────────────────────────────────────────┤
│  ✅ System prompts      agents/{code}/prompt.md                │
│  ✅ Mission prompts     agents/{code}/missions/{mission}.md    │
│  ✅ Knowledge docs      agents/{code}/knowledge/*.md           │
│  ✅ Guardrails          docs/constitution/guardrails/*.md      │
│  ✅ Templates           templates/*.md                         │
├────────────────────────────────────────────────────────────────┤
│  WHAT RL CANNOT IMPROVE                                        │
├────────────────────────────────────────────────────────────────┤
│  ❌ Model weights       (we use Claude as-is)                  │
│  ❌ Agent memory         (stateless)                           │
│  ❌ Training data        (baked into model)                    │
└────────────────────────────────────────────────────────────────┘

Every piece of feedback MUST point to an improvable artifact:

{
  "suggestion": {
    "target": {
      "artifact": "knowledge | prompt | guardrails | template",
      "file": "agents/backend/knowledge/api-patterns.md",
      "section": "Timeouts"
    },
    "problem": "No guidance on default timeout values",
    "recommendation": "Add section: 30s for API calls, 5s for health checks"
  }
}

Phase Architecture

Each phase is a self-contained module:

backend/app/services/phases/
├── __init__.py              # Phase registry
├── base.py                  # BasePhase, PhaseContext
├── wrd_intake.py            # Phase 1 (Baron → Baron)
├── blueprint_creation.py    # Phase 2 (Duc → Baron)
├── blueprint_review.py      # Phase 3 (Baron → Baron)
├── test_planning.py         # Phase 4 (Marie → Baron)
├── test_implementation.py   # Phase 5 (Marie → Baron)
├── backend_planning.py      # Phase 6 (Dede → Baron)
├── backend_implementation.py # Phase 7 (Dede → Baron)
├── frontend_planning.py     # Phase 8 (Dali → Baron)
├── frontend_implementation.py # Phase 9 (Dali → Baron)
├── sre_planning.py          # Phase 10 (Maigret → Baron)
├── sre_implementation.py    # Phase 11 (Maigret → Baron)
├── devops_planning.py       # Phase 12 (Gustave → Baron)
└── devops_implementation.py # Phase 13 (Gustave → Baron)

Phase Base Class

class BasePhase(ABC):
    phase_type: PhaseType
    executor_agent: AgentCode      # Who executes
    evaluator_agent: AgentCode     # Who evaluates (always Baron)
    input_schema: type[BaseModel]
    output_schema: type[BaseModel]

    @abstractmethod
    async def build_input(self, context: PhaseContext) -> dict[str, Any]:
        """Build phase-specific input from run context."""
        pass

    @abstractmethod
    async def execute(self, input_data: dict[str, Any]) -> PhaseResult:
        """Execute the phase (call executor agent)."""
        pass

    @abstractmethod
    def get_deterministic_validators(self) -> list[DeterministicValidator]:
        """Return list of deterministic checks for this phase's output."""
        pass

    @abstractmethod
    def get_baron_evaluation_prompt(self) -> str:
        """Return Baron evaluation prompt for claim verification."""
        pass

Document Flow

Document	Format	Naming
WRD (input)	YAML	`wrd-{type}-{slug}.yaml`
Phase Output	JSON	`data/wrds/{wrd_id}/runs/{run_id}/phases/{phase}.json`
Feedback	JSON	Embedded in phase output
Socrate Analysis	JSON	`data/analysis/{date}/socrate_report.json`

Commands

# Start a new run from WRD
farmerspec.run --wrd=./wrd-feature-oauth.yaml

# Check run status
farmerspec.status --run={run_id}

# List phases for a run
farmerspec.phases --run={run_id}

# View phase output
farmerspec.phase --run={run_id} --phase=wrd_intake

# Re-run from specific phase (after human fixes)
farmerspec.rerun --run={run_id} --from=blueprint_creation

# Run Socrate analysis
farmerspec.socrate --since=2026-01-01

Version History

Version	Date	Changes
0.1	2026-01-13	Initial design
0.5	2026-01-14	WRD naming, Phase 00 (WRD Discovery)
1.0	2026-01-17	13-phase TDD architecture with validation system
1.1	2026-01-17	Dual-agent model: Executor + Baron evaluator
1.2	2026-01-17	Baron/Socrate split: in-flight vs retrospective