RFC-006: Fault Isolation & Recovery Domains¶
Status: Draft
Authors: RCT Kernel Team
Created: 2026-02-26
Layer: OS Primitive (Reliability)
Implements: core/kernel/fault_isolation.py (Plan 30)
Test Evidence: tests/kernel/test_fault_isolation.py (26 tests)
Depends on: RFC-002 (ProcessTable — state transitions for recovery)
Abstract¶
This RFC formalizes the Fault Isolation Manager — the kernel component that ensures a single agent crash never kills the simulation tick. It classifies exceptions into three severity levels (RECOVERABLE, QUARANTINE, EXILE), applies automatic recovery policies, and maintains structured fault logs.
The core invariant: process isolation guarantees that any individual agent failure is contained and does not propagate to other agents or the kernel.
Motivation¶
In a multi-agent simulation, exceptions are inevitable: - Division by zero in FDIA scoring - Key lookup failures on shared resources - Timeout in LLM integration calls - Stack overflow from recursive agent strategies
Without fault isolation, any exception crashes the entire simulation tick, affecting all agents. The Fault Isolation Manager provides per-agent containment with graduated recovery policies.
Specification¶
1. FaultSeverity Classification¶
Three severity levels with distinct recovery policies:
| Severity | Recovery Policy | Agent State After |
|---|---|---|
RECOVERABLE |
Skip action this tick, resume next tick | READY (unchanged) |
QUARANTINE |
Block agent for N ticks (default: 3) | BLOCKED (timed) |
EXILE |
Immediately terminate agent | TERMINATED |
2. Exception-to-Severity Mapping¶
Default classification by Python exception type:
| Severity | Exception Types |
|---|---|
RECOVERABLE |
ValueError, KeyError, IndexError, TypeError, AttributeError, ArithmeticError |
QUARANTINE |
RuntimeError, TimeoutError, OSError |
EXILE |
RecursionError, MemoryError, SystemError |
Classification Algorithm¶
- Exact type match — check
type(exception)against fault map - Inheritance check — check
isinstance()in severity order (EXILE first, then QUARANTINE, then RECOVERABLE) - Fallback — unknown exception types default to
QUARANTINE
Classification is deterministic and does not depend on exception message content.
3. Recovery Policies¶
RECOVERABLE¶
if agent.state == RUNNING:
process_table.set_ready(pid) # RUNNING → READY
# Agent remains READY, will be scheduled next tick
# Action for THIS tick is skipped (no side effects)
Rationale: Transient errors (key not found, bad value) are likely to resolve next tick when world state changes.
QUARANTINE¶
until_tick = current_tick + quarantine_ticks # default: 3
if agent.state == RUNNING:
process_table.block(pid, until_tick=until_tick)
elif agent.state == READY:
process_table.transition_state(pid, RUNNING)
process_table.block(pid, until_tick=until_tick)
Rationale: Non-transient errors (runtime failures, timeouts) need cooling
period. Agent is blocked for quarantine_ticks before re-entering the ready
queue. Auto-unblocked by ProcessTable.unblock_expired() during scheduler
build_schedule() (RFC-003).
EXILE¶
Rationale: Catastrophic errors (stack overflow, memory exhaustion) indicate an agent that cannot safely continue. Immediate termination prevents cascading failures.
4. FaultRecord (Structured Fault Log)¶
@dataclass
class FaultRecord:
pid: int # Faulted agent's PID
agent_id: str # Application-level agent name
tick: int # Tick at which fault occurred
phase: str # Simulation phase (OBSERVE, DECIDE, ACT)
exception_type: str # e.g. "ValueError", "RuntimeError"
message: str # First 200 chars of exception message
traceback_hash: str # MD5 hash of traceback (12-char hex)
severity: FaultSeverity # Classified severity
recovery_action: str # What recovery was applied
Traceback hashing enables grouping duplicate faults across agents and ticks without storing full tracebacks (privacy + space efficiency).
5. FaultIsolationManager API¶
class FaultIsolationManager:
def __init__(
self,
process_table: ProcessTable, # For state transitions
quarantine_ticks: int = 3, # QUARANTINE duration
fault_map: Dict[Type, FaultSeverity] # Custom mapping (optional)
)
| Method | Description |
|---|---|
on_agent_fault(pid, agent_id, tick, phase, exception) |
Classify + recover + log |
get_fault_log(pid=None) |
Structured fault records (filterable by PID) |
get_fault_count(pid) |
Total faults for one agent |
get_total_faults() |
Total faults across all agents |
get_fault_summary() |
Count by severity (RECOVERABLE/QUARANTINE/EXILE) |
get_healthy_ratio() |
Fraction of alive agents with zero faults |
6. Integration with SimulationEngine¶
# Inside SimulationEngine._run_tick():
for pid in execution_order:
try:
# ... agent OBSERVE → DECIDE → ACT ...
pass
except Exception as e:
fault_manager.on_agent_fault(pid, agent_id, tick, phase, e)
# Simulation continues with next agent
# No tick-level crash
7. Health Metrics¶
get_healthy_ratio() computes:
This metric is included in SimulationResult for post-run analysis: - 1.0 = all alive agents are fault-free - < 0.5 = majority of agents have experienced faults (potential systemic issue)
8. Custom Fault Maps¶
The default fault map can be overridden for domain-specific exception handling:
custom_map = {
MyDomainError: FaultSeverity.RECOVERABLE,
ExternalAPIError: FaultSeverity.QUARANTINE,
CriticalStateCorruption: FaultSeverity.EXILE,
}
manager = FaultIsolationManager(process_table, fault_map=custom_map)
Custom maps are merged with defaults — explicit entries override defaults for the same exception type.
Test Coverage¶
26 tests in tests/kernel/test_fault_isolation.py covering:
- Classification of all default exception types
- RECOVERABLE recovery (agent stays READY)
- QUARANTINE recovery (agent BLOCKED for N ticks)
- EXILE recovery (agent TERMINATED)
- Fault record structure and serialization
- Traceback hash determinism (same exception → same hash)
- Fault count per agent tracking
- Total fault summary by severity
- Healthy ratio calculation (0 faults, some faults, all faulted)
- Custom fault map override
- Unknown exception defaults to QUARANTINE
- Multiple faults for same agent (escalation tracking)
Interaction with Other RFCs¶
| RFC | Relationship |
|---|---|
| RFC-002 | Uses ProcessTable for state transitions (READY/BLOCKED/TERMINATED) |
| RFC-003 | Blocked agents excluded from next scheduler pass; unblocked by scheduler |
| RFC-004 | Exiled agents' mailboxes should be deregistered (future integration) |
| RFC-005 | Fault Isolation may revoke capabilities for faulted agents (future) |
Backward Compatibility¶
- Full backward compatibility. FaultIsolationManager wraps existing try/except patterns — no changes to exception handling semantics.
- Simulations that do not use FaultIsolationManager continue to propagate exceptions normally (existing behavior preserved).
Core Invariant¶
A single agent crash NEVER kills the simulation tick.
This is the defining property of the Fault Isolation layer. It is validated by test cases that inject exceptions of every severity level and verify that: 1. The faulted agent is correctly recovered/quarantined/exiled 2. All other agents in the same tick execute normally 3. SimulationEngine completes the tick without raising