Orchestration Design
Overview
Farmerspec is a multi-agent development framework that orchestrates AI agents to develop features using spec-driven, test-first development. This document defines the orchestration layer, phase transitions, artifacts, and validation gates.
Workflow Phases (13-Phase TDD Architecture)
The workflow follows a TDD (Test-Driven Development) sequence where tests are written BEFORE implementation.
| Phase | Type | Agent | Description |
|---|---|---|---|
| 01 | Planning | Baron (PM) | WRD Intake - Validate WRD, extract requirements |
| 02 | Planning | Duc (Architecture) | Blueprint Creation - Create technical blueprint |
| 03 | Review | Baron (PM) | Blueprint Review - Approve or request changes |
| 04 | Planning | Marie (Testing) | Test Planning - Plan E2E tests (UI + Backend) |
| 05 | Implementation | Marie (Testing) | Test Implementation - Write E2E tests (TDD red) |
| 06 | Planning | Dede (Backend) | Backend Planning - Plan backend tasks |
| 07 | Implementation | Dede (Backend) | Backend Implementation - Implement against tests |
| 08 | Planning | Dali (Frontend) | Frontend Planning - Plan frontend tasks |
| 09 | Implementation | Dali (Frontend) | Frontend Implementation - Implement UI |
| 10 | Planning | Maigret (SRE) | SRE Planning - Plan observability tasks |
| 11 | Implementation | Maigret (SRE) | SRE Implementation - Implement monitoring |
| 12 | Planning | Gustave (DevOps) | DevOps Planning - Plan infrastructure tasks |
| 13 | Implementation | Gustave (DevOps) | DevOps Implementation - Deploy and configure |
Core Principles
TDD Sequence
Tests are written BEFORE implementation: 1. Marie writes E2E tests (phases 4-5) 2. Backend/Frontend implement to make tests pass (phases 6-9) 3. SRE/DevOps handle observability and deployment (phases 10-13)
Stateless Agents
All agents are stateless—each invocation is a fresh instance with no memory of previous runs. This means: - Agents cannot learn from experience directly - All context must be provided in the prompt + input files - Improvement happens through RL updates to prompts and knowledge docs
Dual-Agent Validation
Each phase uses a dual-agent model:
┌─────────────────────────────────────────────────────────────────┐
│ EXECUTOR AGENT (Duc, Marie, Dede, etc.) │
│ • Receives: input files + prompt + knowledge + guardrails │
│ • Produces: output.json + source claims │
│ • Does NOT self-score │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ BARON (Evaluator - fresh instance) │
│ • Receives: input files + output.json + claims │
│ • Verifies: "You claimed line 45 of WRD - let me check" │
│ • Produces: feedback.json with validated scores │
│ • Verdict: pass | flag | escalate │
└─────────────────────────────────────────────────────────────────┘
Why dual-agent? - Executor can't accurately self-report where output came from - Evaluator can verify claims against actual documents - Separation of concerns: execute vs validate - Fresh instance = no bias carryover
Escalation = Abort
When Baron's validation escalates, the run is aborted. Human intervention is required before re-running.
| Verdict | Action |
|---|---|
pass |
Continue to next phase |
flag |
Log warning, continue |
escalate |
Abort run, human must review |
Agents
Executor Agents
| Agent | Role | Phases |
|---|---|---|
| Baron | PM / Orchestrator | 1 (WRD Intake), 3 (Blueprint Review) |
| Duc | Architect | 2 (Blueprint Creation) |
| Marie | QA Engineer | 4-5 (Test Planning + Implementation) |
| Dede | Backend Developer | 6-7 (Backend Planning + Implementation) |
| Dali | Frontend Developer | 8-9 (Frontend Planning + Implementation) |
| Maigret | SRE | 10-11 (SRE Planning + Implementation) |
| Gustave | DevOps / GitOps | 12-13 (DevOps Planning + Implementation) |
Validation & RL Agents
| Agent | Role | Scope | When |
|---|---|---|---|
| Baron | In-Flight Validator | Single phase, single run | During workflow |
| Socrate | Retrospective Analyst | Cross-run, cross-agent | After runs complete |
Baron: In-Flight Validator
Baron validates ALL phases during workflow execution—including phases where Baron is also the executor.
Responsibilities
| Task | Description |
|---|---|
| Verify claims | Check that executor's source claims match actual documents |
| Compute feedback | Produce validated sourceAttribution scores |
| Quality gate | Pass / flag / escalate based on output quality |
| Traceability | Ensure outputs can be traced to inputs |
What Baron Checks
def get_baron_validation_prompt(self) -> str:
return """
Validate the executor's output:
1. CLAIM VERIFICATION: Do source claims match actual documents?
2. COMPLETENESS: Are all required fields present and accurate?
3. TRACEABILITY: Can each output be traced to an input source?
4. QUALITY: Is the output sufficient for the next phase?
Verify each claim by checking the referenced document.
Return: pass | flag | escalate
"""
Baron Validation Result
@dataclass
class BaronValidationResult:
verdict: Literal["pass", "flag", "escalate"]
confidence: float # 0.0-1.0
feedback: ValidatedFeedback # Verified source attribution
issues: list[ValidationIssue]
escalation_reason: str | None # Why human needed (if escalate)
Socrate: Retrospective Analyst
Socrate analyzes completed runs to improve agents. This is a separate process from live workflow validation.
Responsibilities
| Task | Description |
|---|---|
| Pattern detection | Find recurring gaps across runs |
| Cross-agent analysis | Compare performance across agents |
| Correlation analysis | Link low inputClarity to high ungrounded |
| Improvement suggestions | Concrete changes to prompts/knowledge |
What Socrate Analyzes
-- Example: Find agents with high invention rates
SELECT agent_code, AVG(invented) as avg_invented
FROM phase_scores
GROUP BY agent_code
ORDER BY avg_invented DESC;
-- Example: Find recurring knowledge gaps
SELECT target_file, description, COUNT(*) as occurrences
FROM gaps
WHERE gap_type = 'knowledge'
GROUP BY target_file, description
ORDER BY occurrences DESC;
Socrate Output
{
"analysis_period": "2026-01-01 to 2026-01-17",
"runs_analyzed": 47,
"insights": [
{
"type": "pattern",
"severity": "high",
"finding": "Backend agent invents timeout values in 40% of runs",
"recommendation": {
"artifact": "knowledge",
"file": "agents/backend/knowledge/api-patterns.md",
"section": "Timeouts",
"content": "Add default timeout guidance: 30s for API calls, 5s for health checks"
}
}
]
}
Baron vs Socrate Comparison
| Aspect | Baron (In-Flight) | Socrate (Retrospective) |
|---|---|---|
| When | During phase execution | After runs complete |
| Scope | Single phase, single run | Cross-run, cross-agent |
| Purpose | Quality gate | Continuous improvement |
| Action | Pass/flag/escalate | Suggest prompt/knowledge updates |
| Blocking | Yes (escalate aborts) | No (async analysis) |
| Output | Verified feedback per phase | Improvement recommendations |
Phase Execution Flow
WRD Document
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 1: WRD Intake │
│ Executor: Baron │ Evaluator: Baron (fresh instance) │
│ • Extract requirements, classify, validate completeness │
│ • Deterministic validation: Schema, required fields │
│ • Baron evaluation: Verify claims, compute feedback │
└─────────────────────────────────────────────────────────────────┘
│ (pass/flag → continue, escalate → ABORT)
▼
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 2: Blueprint Creation │
│ Executor: Duc │ Evaluator: Baron │
│ • Create technical blueprint from requirements │
│ • Define components, files, APIs, data models │
│ • Executor makes claims, Baron verifies │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 3: Blueprint Review │
│ Executor: Baron │ Evaluator: Baron (fresh instance) │
│ • Review blueprint for feasibility and completeness │
│ • Approve, request revision, or reject │
│ • Self-evaluation with fresh context │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ PHASES 4-5: Test Planning + Implementation │
│ Executor: Marie │ Evaluator: Baron │
│ • Plan E2E tests covering acceptance criteria │
│ • Write tests BEFORE implementation (TDD red phase) │
│ • Tests should FAIL initially (no implementation yet) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ PHASES 6-7: Backend Planning + Implementation │
│ Executor: Dede │ Evaluator: Baron │
│ • Plan backend tasks from blueprint │
│ • Implement to make tests pass (TDD green phase) │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ PHASES 8-9: Frontend Planning + Implementation │
│ Executor: Dali │ Evaluator: Baron │
│ • Plan frontend tasks from blueprint │
│ • Implement UI components │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ PHASES 10-11: SRE Planning + Implementation │
│ Executor: Maigret │ Evaluator: Baron │
│ • Plan monitoring, alerting, observability │
│ • Implement dashboards, alerts, SLOs │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ PHASES 12-13: DevOps Planning + Implementation │
│ Executor: Gustave │ Evaluator: Baron │
│ • Plan infrastructure, CI/CD, deployment │
│ • Implement and deploy │
└─────────────────────────────────────────────────────────────────┘
│
▼
RUN COMPLETE
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ SOCRATE (Async, Post-Workflow) │
│ • Aggregates feedback from completed run │
│ • Analyzes patterns across run history │
│ • Produces improvement suggestions for prompts/knowledge │
└─────────────────────────────────────────────────────────────────┘
Validation System
Two-Layer Validation
Each phase has two validation layers:
| Layer | Type | When | Purpose |
|---|---|---|---|
| Deterministic | Code-based | Before & after LLM | Schema, required fields, format |
| Baron LLM | AI-reviewed | After deterministic | Verify claims, semantic quality |
Deterministic Validators
Fast, code-based checks that run BEFORE Baron evaluation:
| Validator | Purpose |
|---|---|
SchemaValidator |
Output matches expected Pydantic schema |
RequiredFieldsValidator |
Required fields present and non-empty |
FileExistsValidator |
Referenced file paths exist |
CrossReferenceValidator |
IDs reference valid entities from previous phases |
ValueInSetValidator |
Field values in allowed set |
Phase Execution Steps
Every phase follows this 5-step pattern:
| Step | Name | On Failure |
|---|---|---|
| 1 | Input files | - |
| 2 | Validate input (deterministic) | FAIL |
| 3 | Execute phase (Executor LLM) | FAIL |
| 4 | Validate output (deterministic) | FAIL |
| 5 | Evaluate output (Baron LLM) | ABORT (if escalate) |
Executor Contract
Every executor agent MUST return output with source claims:
{
"status": "success | blocked | needs_clarification",
"result": { /* phase-specific output */ },
"claims": [
{
"outputPath": "result.components[0].name",
"source": "input | code | knowledge | guardrails",
"ref": "wrd.yaml#scope.backend.services[0]",
"quote": "Payment service module"
}
],
"blockers": []
}
Claim Structure
| Field | Description |
|---|---|
outputPath |
JSON path to the output field |
source |
Priority level: input > code > knowledge > guardrails |
ref |
Document path + section reference |
quote |
Relevant excerpt from source |
Claims enable Baron to verify: "Did the executor actually get this from where they say?"
Feedback Structure
Baron produces validated feedback for each phase:
{
"feedback": {
"scores": {
"sourceAttribution": {
"overall": 0.82,
"breakdown": {
"fromInput": 0.55,
"fromCode": 0.20,
"fromKnowledge": 0.15,
"fromGuardrails": 0.05,
"ungrounded": 0.05
}
},
"guardrailsCompliance": { "overall": 0.90 },
"inputClarity": { "overall": 0.65 },
"outputConfidence": { "overall": 0.72 }
},
"traceability": [ /* verified source traces */ ],
"gaps": [ /* identified documentation gaps */ ],
"suggestions": [ /* improvement recommendations */ ]
}
}
See agents/pm/knowledge/feedback-production.md for complete structure.
RL Loop
The RL loop uses feedback to improve agents:
┌────────────────────────────────────────────────────────────────┐
│ WHAT RL CAN IMPROVE │
├────────────────────────────────────────────────────────────────┤
│ ✅ System prompts agents/{code}/prompt.md │
│ ✅ Mission prompts agents/{code}/missions/{mission}.md │
│ ✅ Knowledge docs agents/{code}/knowledge/*.md │
│ ✅ Guardrails docs/constitution/guardrails/*.md │
│ ✅ Templates templates/*.md │
├────────────────────────────────────────────────────────────────┤
│ WHAT RL CANNOT IMPROVE │
├────────────────────────────────────────────────────────────────┤
│ ❌ Model weights (we use Claude as-is) │
│ ❌ Agent memory (stateless) │
│ ❌ Training data (baked into model) │
└────────────────────────────────────────────────────────────────┘
Every piece of feedback MUST point to an improvable artifact:
{
"suggestion": {
"target": {
"artifact": "knowledge | prompt | guardrails | template",
"file": "agents/backend/knowledge/api-patterns.md",
"section": "Timeouts"
},
"problem": "No guidance on default timeout values",
"recommendation": "Add section: 30s for API calls, 5s for health checks"
}
}
Phase Architecture
Each phase is a self-contained module:
backend/app/services/phases/
├── __init__.py # Phase registry
├── base.py # BasePhase, PhaseContext
├── wrd_intake.py # Phase 1 (Baron → Baron)
├── blueprint_creation.py # Phase 2 (Duc → Baron)
├── blueprint_review.py # Phase 3 (Baron → Baron)
├── test_planning.py # Phase 4 (Marie → Baron)
├── test_implementation.py # Phase 5 (Marie → Baron)
├── backend_planning.py # Phase 6 (Dede → Baron)
├── backend_implementation.py # Phase 7 (Dede → Baron)
├── frontend_planning.py # Phase 8 (Dali → Baron)
├── frontend_implementation.py # Phase 9 (Dali → Baron)
├── sre_planning.py # Phase 10 (Maigret → Baron)
├── sre_implementation.py # Phase 11 (Maigret → Baron)
├── devops_planning.py # Phase 12 (Gustave → Baron)
└── devops_implementation.py # Phase 13 (Gustave → Baron)
Phase Base Class
class BasePhase(ABC):
phase_type: PhaseType
executor_agent: AgentCode # Who executes
evaluator_agent: AgentCode # Who evaluates (always Baron)
input_schema: type[BaseModel]
output_schema: type[BaseModel]
@abstractmethod
async def build_input(self, context: PhaseContext) -> dict[str, Any]:
"""Build phase-specific input from run context."""
pass
@abstractmethod
async def execute(self, input_data: dict[str, Any]) -> PhaseResult:
"""Execute the phase (call executor agent)."""
pass
@abstractmethod
def get_deterministic_validators(self) -> list[DeterministicValidator]:
"""Return list of deterministic checks for this phase's output."""
pass
@abstractmethod
def get_baron_evaluation_prompt(self) -> str:
"""Return Baron evaluation prompt for claim verification."""
pass
Document Flow
| Document | Format | Naming |
|---|---|---|
| WRD (input) | YAML | wrd-{type}-{slug}.yaml |
| Phase Output | JSON | data/wrds/{wrd_id}/runs/{run_id}/phases/{phase}.json |
| Feedback | JSON | Embedded in phase output |
| Socrate Analysis | JSON | data/analysis/{date}/socrate_report.json |
Commands
# Start a new run from WRD
farmerspec.run --wrd=./wrd-feature-oauth.yaml
# Check run status
farmerspec.status --run={run_id}
# List phases for a run
farmerspec.phases --run={run_id}
# View phase output
farmerspec.phase --run={run_id} --phase=wrd_intake
# Re-run from specific phase (after human fixes)
farmerspec.rerun --run={run_id} --from=blueprint_creation
# Run Socrate analysis
farmerspec.socrate --since=2026-01-01
Version History
| Version | Date | Changes |
|---|---|---|
| 0.1 | 2026-01-13 | Initial design |
| 0.5 | 2026-01-14 | WRD naming, Phase 00 (WRD Discovery) |
| 1.0 | 2026-01-17 | 13-phase TDD architecture with validation system |
| 1.1 | 2026-01-17 | Dual-agent model: Executor + Baron evaluator |
| 1.2 | 2026-01-17 | Baron/Socrate split: in-flight vs retrospective |