# Design Document - Auto-Heal Hybride (Fiche #22) ## Overview Le système d'Auto-Heal Hybride implémente une approche équilibrée entre continuité de service et sécurité. Il utilise une machine d'état pour gérer les transitions entre différents modes d'exécution, des circuit breakers pour éviter les boucles infinies, et un système de versioning pour permettre le rollback de l'apprentissage. L'architecture s'appuie sur les systèmes existants (Fiche #19 pour la capture d'échecs, Fiche #18 pour l'apprentissage persistant, Fiche #16 pour les rapports) tout en ajoutant une couche intelligente de supervision et de protection. ## Architecture ```mermaid graph TB subgraph "Auto-Heal Hybrid System" AHM[Auto Heal Manager] CB[Circuit Breaker] VS[Versioned Store] PC[Policy Config] end subgraph "Execution Layer" EL[Execution Loop] AE[Action Executor] TR[Target Resolver] end subgraph "Learning Layer" TMS[Target Memory Store] FAISS[FAISS Index] PROTO[Prototypes] end subgraph "Integration Layer" FCR[Failure Case Recorder] SR[Simulation Report] PM[Precision Metrics] end EL --> AHM AHM --> CB AHM --> VS AHM --> PC AHM --> FCR AHM --> SR AHM --> PM VS --> TMS VS --> FAISS VS --> PROTO AHM --> AE AE --> TR ``` ## Components and Interfaces ### 1. Auto Heal Manager (core/system/auto_heal_manager.py) **Responsabilité:** Gestionnaire central des états d'exécution et des politiques de sécurité. ```python class ExecutionState(Enum): RUNNING = "running" DEGRADED = "degraded" QUARANTINED = "quarantined" ROLLBACK = "rollback" PAUSED = "paused" class AutoHealManager: def __init__(self, policy_path: Path = Path("data/config/auto_heal_policy.json")) def should_execute_step(self, workflow_id: str, step_id: str) -> Tuple[bool, str] def on_step_result(self, workflow_id: str, step_id: str, result: ExecutionResult) -> None def get_mode(self, workflow_id: str) -> ExecutionState def force_transition(self, workflow_id: str, new_state: ExecutionState, reason: str) -> None def get_status_report(self) -> Dict[str, Any] ``` **Intégration avec l'execution loop:** ```python # Dans execution_loop.py ou action_executor.py before_step = auto_heal_manager.should_execute_step(workflow_id, step_id) if not before_step[0]: return ExecutionResult(status=ExecutionStatus.BLOCKED, message=before_step[1]) # Exécuter l'action... result = execute_action(...) # Après exécution auto_heal_manager.on_step_result(workflow_id, step_id, result) ``` ### 2. Circuit Breaker (core/system/circuit_breaker.py) **Responsabilité:** Mécanisme anti-boucle avec fenêtres glissantes. ```python class CircuitBreaker: def __init__(self, policy: Dict[str, Any]) def record_failure(self, workflow_id: str, step_id: str, failure_type: str) -> None def record_success(self, workflow_id: str, step_id: str) -> None def should_trigger_degraded(self, workflow_id: str, step_id: str) -> bool def should_trigger_quarantine(self, workflow_id: str) -> bool def should_trigger_global_pause(self) -> bool def get_failure_counts(self, workflow_id: str) -> Dict[str, int] ``` **Fenêtres glissantes:** - Step level: 3 échecs consécutifs → DEGRADED - Workflow level: 10 échecs en 10 minutes → QUARANTINED - Global level: 30 échecs en 10 minutes → PAUSE (optionnel) ### 3. Versioned Store (core/learning/versioned_store.py) **Responsabilité:** Système de versioning pour l'apprentissage réversible. ```python class VersionedStore: def __init__(self, base_path: Path = Path("data")) def snapshot_version(self, workflow_id: str) -> str def rollback_to_previous(self, workflow_id: str, version: Optional[str] = None) -> bool def list_versions(self, workflow_id: str) -> List[VersionInfo] def cleanup_old_versions(self, workflow_id: str, keep_count: int = 5) -> None # Versioning des composants def version_prototypes(self, workflow_id: str, version: str) -> None def version_faiss_index(self, workflow_id: str, version: str) -> None def version_target_memory(self, workflow_id: str, version: str) -> None ``` **Structure de versioning:** ``` data/ ├── learning/ │ └── prototypes/ │ └── v001/ # Version snapshots │ └── v002/ ├── faiss_index/ │ └── workflow_/ │ └── v001/ # Versioned indices │ └── v002/ └── target_memory_snapshots/ └── v001.db # SQLite snapshots └── v002.db ``` ### 4. Policy Configuration (data/config/auto_heal_policy.json) **Structure de configuration:** ```json { "mode": "hybrid", "step_fail_streak_to_degraded": 3, "workflow_fail_window_s": 600, "workflow_fail_max_in_window": 10, "global_fail_max_in_window": 30, "min_confidence_normal": 0.72, "min_confidence_degraded": 0.82, "min_margin_top1_top2_degraded": 0.08, "disable_learning_in_degraded": true, "rollback_on_regression": true, "regression_window_steps": 50, "regression_fail_ratio": 0.20, "quarantine_duration_s": 1800, "max_versions_to_keep": 5 } ``` ## Data Models ### ExecutionStateInfo ```python @dataclass class ExecutionStateInfo: workflow_id: str current_state: ExecutionState state_since: datetime failure_count: int last_failure: Optional[datetime] confidence_threshold: float learning_enabled: bool quarantine_until: Optional[datetime] ``` ### FailureWindow ```python @dataclass class FailureWindow: window_start: datetime window_duration_s: int failures: List[FailureEvent] def add_failure(self, failure: FailureEvent) -> None def get_failure_count(self) -> int def cleanup_expired(self) -> None ``` ### VersionInfo ```python @dataclass class VersionInfo: version_id: str created_at: datetime workflow_id: str success_rate_before: float success_rate_after: Optional[float] components_versioned: List[str] # ["prototypes", "faiss", "memory"] ``` ## Correctness Properties *A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.* ### Property 1: State Transition Consistency *For any* workflow execution state, transitions should follow valid state machine rules and maintain consistency across all system components. **Validates: Requirements 1.1, 1.2, 1.3, 1.4, 1.5, 1.6** ### Property 2: Circuit Breaker Threshold Enforcement *For any* sequence of step failures, when thresholds are exceeded, the circuit breaker should trigger appropriate state transitions within the configured time windows. **Validates: Requirements 2.1, 2.2, 2.3** ### Property 3: Degraded Mode Safety *For any* workflow in DEGRADED state, all execution decisions should use increased confidence thresholds and learning updates should be disabled. **Validates: Requirements 3.1, 3.2, 3.3, 3.4, 3.5, 3.6** ### Property 4: Rollback Consistency *For any* rollback operation, all versioned components (prototypes, FAISS indices, target memory) should be restored to the same consistent version point. **Validates: Requirements 4.1, 4.2, 4.3, 4.5, 4.6** ### Property 5: Hybrid Storage Integrity *For any* execution decision, audit records should always be written to JSONL, and SQLite records should only be written for validated successes when not in DEGRADED mode. **Validates: Requirements 5.1, 5.2, 5.3, 5.4** ### Property 6: Configuration Consistency *For any* configuration change, all system components should apply the new settings consistently without requiring restart. **Validates: Requirements 6.1, 6.2, 6.3, 6.4, 6.5** ### Property 7: Integration Compatibility *For any* existing system integration point, the auto-healing system should maintain backward compatibility and enhance functionality without breaking existing workflows. **Validates: Requirements 7.1, 7.2, 7.3, 7.4, 7.5** ## Error Handling ### Failure Classification 1. **TARGET_NOT_FOUND**: Élément UI non trouvé 2. **POSTCONDITION_FAILED**: Post-conditions non satisfaites 3. **WATCHDOG_TIMEOUT**: Timeout de surveillance 4. **LOW_CONFIDENCE**: Confiance FAISS insuffisante 5. **RUNTIME_DRIFT**: Changement de résolution/scale ### Recovery Strategies 1. **Immediate**: Retry avec paramètres normaux 2. **Degraded**: Retry avec seuils augmentés 3. **Quarantine**: Arrêt du workflow avec capture 4. **Rollback**: Restauration version précédente 5. **Manual**: Intervention humaine requise ### Error Propagation - Les erreurs de step remontent au niveau workflow - Les erreurs de workflow peuvent déclencher des actions globales - Chaque erreur génère un FailureCase (Fiche #19) - Les erreurs critiques génèrent des rapports (Fiche #16) ## Testing Strategy ### Unit Tests - Test des transitions d'état individuelles - Test des seuils de circuit breaker - Test des opérations de versioning - Test de la configuration policy ### Property Tests - Test des propriétés de cohérence d'état - Test des invariants de seuil - Test de l'intégrité des rollbacks - Test de la consistance du stockage hybride ### Integration Tests - Test avec les systèmes existants (Fiche #19, #18, #16) - Test des hooks d'exécution - Test des scénarios de dégradation - Test des rollbacks complets ### Scenario Tests - Simulation de 3 échecs consécutifs → DEGRADED - Simulation de 10 échecs en 10 min → QUARANTINED - Simulation de dégradation d'apprentissage → ROLLBACK - Test de récupération après quarantaine La stratégie de test utilise à la fois des tests unitaires pour les cas spécifiques et des tests de propriétés pour valider les invariants universels. Les tests d'intégration vérifient la compatibilité avec les systèmes existants, tandis que les tests de scénarios valident les comportements de bout en bout.