# Implementation Plan: Auto-Heal Hybride (Fiche #22) ## Overview Implémentation du système d'auto-healing hybride qui équilibre continuité et sécurité. Le plan suit une approche incrémentale : configuration et modèles de base, circuit breaker, gestionnaire principal, système de versioning, intégration avec l'exécution, et tests complets. ## Tasks - [ ] 1. Setup configuration and base models - Create policy configuration structure - Define execution state enums and data models - Set up base directory structure for versioning - _Requirements: 6.1, 6.2, 6.3_ - [x] 1.1 Create policy configuration system - Implement auto_heal_policy.json structure - Create PolicyConfig class with validation - Add configuration loading and hot-reload capability - _Requirements: 6.1, 6.4, 6.5_ - [ ]* 1.2 Write property test for configuration consistency - **Property 6: Configuration Consistency** - **Validates: Requirements 6.1, 6.2, 6.3, 6.4, 6.5** - [x] 1.3 Implement base data models - Create ExecutionStateInfo, FailureWindow, VersionInfo dataclasses - Implement ExecutionState enum with validation - Add serialization/deserialization methods - _Requirements: 1.1, 2.1, 4.1_ - [ ]* 1.4 Write unit tests for data models - Test state transitions and validation - Test failure window operations - Test version info management - _Requirements: 1.1, 2.1, 4.1_ - [ ] 2. Implement Circuit Breaker system - [ ] 2.1 Create CircuitBreaker class with sliding windows - Implement failure counting with time windows - Add step-level, workflow-level, and global-level tracking - Create threshold checking methods - _Requirements: 2.1, 2.2, 2.3_ - [ ]* 2.2 Write property test for circuit breaker thresholds - **Property 2: Circuit Breaker Threshold Enforcement** - **Validates: Requirements 2.1, 2.2, 2.3** - [ ] 2.3 Implement failure window management - Create sliding window data structure - Add automatic cleanup of expired failures - Implement efficient failure counting - _Requirements: 2.1, 2.2, 2.3_ - [ ]* 2.4 Write unit tests for circuit breaker - Test sliding window behavior - Test threshold triggering - Test failure cleanup - _Requirements: 2.1, 2.2, 2.3_ - [ ] 3. Create Versioned Store system - [ ] 3.1 Implement VersionedStore class - Create version snapshot functionality - Implement rollback operations - Add version listing and cleanup - _Requirements: 4.1, 4.2, 4.3, 4.6_ - [ ] 3.2 Implement component versioning - Add prototypes versioning (data/learning/prototypes/vNN/) - Add FAISS index versioning (data/faiss_index/workflow_/vNN/) - Add target memory versioning (SQLite snapshots) - _Requirements: 4.1, 4.2, 4.3_ - [ ]* 3.3 Write property test for rollback consistency - **Property 4: Rollback Consistency** - **Validates: Requirements 4.1, 4.2, 4.3, 4.5, 4.6** - [ ]* 3.4 Write unit tests for versioned store - Test snapshot creation - Test rollback operations - Test version cleanup - _Requirements: 4.1, 4.2, 4.3, 4.6_ - [ ] 4. Checkpoint - Ensure core components work - Ensure all tests pass, ask the user if questions arise. - [ ] 5. Implement Auto Heal Manager - [ ] 5.1 Create AutoHealManager class - Implement state machine for execution states - Add should_execute_step() and on_step_result() methods - Create state transition logic - _Requirements: 1.1, 1.2, 1.3, 1.4, 1.5, 1.6_ - [ ] 5.2 Implement degraded mode logic - Add confidence threshold adjustment - Implement learning disable functionality - Add strict validation for ambiguous targets - _Requirements: 3.1, 3.2, 3.3, 3.4, 3.5, 3.6_ - [ ]* 5.3 Write property test for state transitions - **Property 1: State Transition Consistency** - **Validates: Requirements 1.1, 1.2, 1.3, 1.4, 1.5, 1.6** - [ ]* 5.4 Write property test for degraded mode safety - **Property 3: Degraded Mode Safety** - **Validates: Requirements 3.1, 3.2, 3.3, 3.4, 3.5, 3.6** - [ ] 5.5 Implement hybrid storage system - Add JSONL audit logging for all decisions - Add SQLite logging only for validated successes - Implement storage filtering based on execution state - _Requirements: 5.1, 5.2, 5.3, 5.4, 5.5_ - [ ]* 5.6 Write property test for hybrid storage integrity - **Property 5: Hybrid Storage Integrity** - **Validates: Requirements 5.1, 5.2, 5.3, 5.4** - [ ] 6. Integrate with existing systems - [ ] 6.1 Add Fiche #19 FailureCase integration - Integrate with failure_case_recorder.py - Add automatic capture on quarantine events - Include auto-heal context in failure records - _Requirements: 7.1, 2.4_ - [ ] 6.2 Add Fiche #16 simulation report integration - Generate mini reports on circuit breaker triggers - Include scenario data when available - Add auto-heal specific report sections - _Requirements: 7.2, 2.5_ - [ ] 6.3 Add Fiche #18 persistent learning integration - Integrate with target_memory_store.py for rollback decisions - Add learning success rate monitoring - Implement regression detection - _Requirements: 7.3, 4.4_ - [ ] 6.4 Add Fiche #10 precision metrics integration - Use precision metrics for confidence scoring - Integrate with metrics_engine.py - Add auto-heal metrics to precision dashboard - _Requirements: 7.4_ - [ ]* 6.5 Write property test for integration compatibility - **Property 7: Integration Compatibility** - **Validates: Requirements 7.1, 7.2, 7.3, 7.4, 7.5** - [ ] 7. Add execution loop hooks - [ ] 7.1 Modify action_executor.py integration - Add should_execute_step() call before action execution - Add on_step_result() call after action completion - Implement execution blocking for quarantined workflows - _Requirements: 7.5, 1.4_ - [ ] 7.2 Implement trigger detection - Add TARGET_NOT_FOUND detection and counting - Add POSTCONDITION_FAILED detection and counting - Add WATCHDOG_TIMEOUT detection - Add low confidence detection from FAISS matches - _Requirements: 9.1, 9.2, 9.3, 9.4_ - [ ]* 7.3 Write integration tests for execution hooks - Test execution flow with auto-heal manager - Test blocking of quarantined workflows - Test state transitions during execution - _Requirements: 7.5, 1.4, 1.5_ - [ ] 8. Implement monitoring and metrics - [ ] 8.1 Add status reporting - Implement get_status_report() method - Add real-time state monitoring - Create health endpoint integration - _Requirements: 8.3, 8.4_ - [ ] 8.2 Add metrics collection - Track state transition history and durations - Maintain sliding window counters - Add performance metrics for auto-heal operations - _Requirements: 8.1, 8.2, 8.5_ - [ ]* 8.3 Write unit tests for monitoring - Test status reporting accuracy - Test metrics collection - Test alert generation - _Requirements: 8.1, 8.2, 8.3, 8.4_ - [ ] 9. Create test scenarios and validation - [ ] 9.1 Implement forced failure scenarios - Create test workflow with non-existent targets - Add configuration for test mode - Implement failure injection capabilities - _Requirements: 10.1, 10.2_ - [ ] 9.2 Create degraded mode test - Force 3 consecutive TARGET_NOT_FOUND failures - Verify transition to DEGRADED state - Validate increased confidence thresholds - _Requirements: 10.2, 3.1, 3.2_ - [ ] 9.3 Create quarantine test - Force 10 failures in 10-minute window - Verify transition to QUARANTINED state - Validate FailureCase creation - _Requirements: 10.3, 2.2, 2.4_ - [ ] 9.4 Create rollback test - Intentionally degrade prototype quality - Trigger automatic rollback on performance drop - Verify restoration of previous versions - _Requirements: 10.4, 4.4, 4.5_ - [ ]* 9.5 Write comprehensive integration tests - Test complete auto-heal workflow scenarios - Test interaction with all integrated systems - Test configuration changes and hot-reload - _Requirements: 10.1, 10.2, 10.3, 10.4, 10.5_ - [ ] 10. Final checkpoint and documentation - [ ] 10.1 Create demo script - Implement demo_auto_heal_hybrid.py - Show all state transitions and features - Include performance metrics and monitoring - _Requirements: 10.1, 10.2, 10.3, 10.4, 10.5_ - [ ] 10.2 Update system integration - Update run.sh and launch scripts - Add auto-heal configuration to deployment - Update monitoring dashboard integration - _Requirements: 7.5, 8.5_ - [ ] 10.3 Final validation - Run complete test suite - Validate all integration points - Test production deployment scenario - _Requirements: 10.5_ - [ ] 11. Final checkpoint - Ensure all tests pass - Ensure all tests pass, ask the user if questions arise. ## Notes - Tasks marked with `*` are optional and can be skipped for faster MVP - Each task references specific requirements for traceability - Checkpoints ensure incremental validation - Property tests validate universal correctness properties - Unit tests validate specific examples and edge cases - Integration tests verify compatibility with existing systems (Fiche #19, #18, #16, #10) - The implementation follows the hybrid approach: continue when safe, degrade when uncertain, stop when dangerous