# Implementation Plan: Auto-Heal Hybride (Fiche #22)

## Overview

Implémentation du système d'auto-healing hybride qui équilibre continuité et sécurité. Le plan suit une approche incrémentale : configuration et modèles de base, circuit breaker, gestionnaire principal, système de versioning, intégration avec l'exécution, et tests complets.

## Tasks

- [ ] 1. Setup configuration and base models
  - Create policy configuration structure
  - Define execution state enums and data models
  - Set up base directory structure for versioning
  - _Requirements: 6.1, 6.2, 6.3_

- [x] 1.1 Create policy configuration system
  - Implement auto_heal_policy.json structure
  - Create PolicyConfig class with validation
  - Add configuration loading and hot-reload capability
  - _Requirements: 6.1, 6.4, 6.5_

- [ ]* 1.2 Write property test for configuration consistency
  - **Property 6: Configuration Consistency**
  - **Validates: Requirements 6.1, 6.2, 6.3, 6.4, 6.5**

- [x] 1.3 Implement base data models
  - Create ExecutionStateInfo, FailureWindow, VersionInfo dataclasses
  - Implement ExecutionState enum with validation
  - Add serialization/deserialization methods
  - _Requirements: 1.1, 2.1, 4.1_

- [ ]* 1.4 Write unit tests for data models
  - Test state transitions and validation
  - Test failure window operations
  - Test version info management
  - _Requirements: 1.1, 2.1, 4.1_

- [ ] 2. Implement Circuit Breaker system
  - [ ] 2.1 Create CircuitBreaker class with sliding windows
    - Implement failure counting with time windows
    - Add step-level, workflow-level, and global-level tracking
    - Create threshold checking methods
    - _Requirements: 2.1, 2.2, 2.3_

  - [ ]* 2.2 Write property test for circuit breaker thresholds
    - **Property 2: Circuit Breaker Threshold Enforcement**
    - **Validates: Requirements 2.1, 2.2, 2.3**

  - [ ] 2.3 Implement failure window management
    - Create sliding window data structure
    - Add automatic cleanup of expired failures
    - Implement efficient failure counting
    - _Requirements: 2.1, 2.2, 2.3_

  - [ ]* 2.4 Write unit tests for circuit breaker
    - Test sliding window behavior
    - Test threshold triggering
    - Test failure cleanup
    - _Requirements: 2.1, 2.2, 2.3_

- [ ] 3. Create Versioned Store system
  - [ ] 3.1 Implement VersionedStore class
    - Create version snapshot functionality
    - Implement rollback operations
    - Add version listing and cleanup
    - _Requirements: 4.1, 4.2, 4.3, 4.6_

  - [ ] 3.2 Implement component versioning
    - Add prototypes versioning (data/learning/prototypes/vNN/)
    - Add FAISS index versioning (data/faiss_index/workflow_<id>/vNN/)
    - Add target memory versioning (SQLite snapshots)
    - _Requirements: 4.1, 4.2, 4.3_

  - [ ]* 3.3 Write property test for rollback consistency
    - **Property 4: Rollback Consistency**
    - **Validates: Requirements 4.1, 4.2, 4.3, 4.5, 4.6**

  - [ ]* 3.4 Write unit tests for versioned store
    - Test snapshot creation
    - Test rollback operations
    - Test version cleanup
    - _Requirements: 4.1, 4.2, 4.3, 4.6_

- [ ] 4. Checkpoint - Ensure core components work
  - Ensure all tests pass, ask the user if questions arise.

- [ ] 5. Implement Auto Heal Manager
  - [ ] 5.1 Create AutoHealManager class
    - Implement state machine for execution states
    - Add should_execute_step() and on_step_result() methods
    - Create state transition logic
    - _Requirements: 1.1, 1.2, 1.3, 1.4, 1.5, 1.6_

  - [ ] 5.2 Implement degraded mode logic
    - Add confidence threshold adjustment
    - Implement learning disable functionality
    - Add strict validation for ambiguous targets
    - _Requirements: 3.1, 3.2, 3.3, 3.4, 3.5, 3.6_

  - [ ]* 5.3 Write property test for state transitions
    - **Property 1: State Transition Consistency**
    - **Validates: Requirements 1.1, 1.2, 1.3, 1.4, 1.5, 1.6**

  - [ ]* 5.4 Write property test for degraded mode safety
    - **Property 3: Degraded Mode Safety**
    - **Validates: Requirements 3.1, 3.2, 3.3, 3.4, 3.5, 3.6**

  - [ ] 5.5 Implement hybrid storage system
    - Add JSONL audit logging for all decisions
    - Add SQLite logging only for validated successes
    - Implement storage filtering based on execution state
    - _Requirements: 5.1, 5.2, 5.3, 5.4, 5.5_

  - [ ]* 5.6 Write property test for hybrid storage integrity
    - **Property 5: Hybrid Storage Integrity**
    - **Validates: Requirements 5.1, 5.2, 5.3, 5.4**

- [ ] 6. Integrate with existing systems
  - [ ] 6.1 Add Fiche #19 FailureCase integration
    - Integrate with failure_case_recorder.py
    - Add automatic capture on quarantine events
    - Include auto-heal context in failure records
    - _Requirements: 7.1, 2.4_

  - [ ] 6.2 Add Fiche #16 simulation report integration
    - Generate mini reports on circuit breaker triggers
    - Include scenario data when available
    - Add auto-heal specific report sections
    - _Requirements: 7.2, 2.5_

  - [ ] 6.3 Add Fiche #18 persistent learning integration
    - Integrate with target_memory_store.py for rollback decisions
    - Add learning success rate monitoring
    - Implement regression detection
    - _Requirements: 7.3, 4.4_

  - [ ] 6.4 Add Fiche #10 precision metrics integration
    - Use precision metrics for confidence scoring
    - Integrate with metrics_engine.py
    - Add auto-heal metrics to precision dashboard
    - _Requirements: 7.4_

  - [ ]* 6.5 Write property test for integration compatibility
    - **Property 7: Integration Compatibility**
    - **Validates: Requirements 7.1, 7.2, 7.3, 7.4, 7.5**

- [ ] 7. Add execution loop hooks
  - [ ] 7.1 Modify action_executor.py integration
    - Add should_execute_step() call before action execution
    - Add on_step_result() call after action completion
    - Implement execution blocking for quarantined workflows
    - _Requirements: 7.5, 1.4_

  - [ ] 7.2 Implement trigger detection
    - Add TARGET_NOT_FOUND detection and counting
    - Add POSTCONDITION_FAILED detection and counting
    - Add WATCHDOG_TIMEOUT detection
    - Add low confidence detection from FAISS matches
    - _Requirements: 9.1, 9.2, 9.3, 9.4_

  - [ ]* 7.3 Write integration tests for execution hooks
    - Test execution flow with auto-heal manager
    - Test blocking of quarantined workflows
    - Test state transitions during execution
    - _Requirements: 7.5, 1.4, 1.5_

- [ ] 8. Implement monitoring and metrics
  - [ ] 8.1 Add status reporting
    - Implement get_status_report() method
    - Add real-time state monitoring
    - Create health endpoint integration
    - _Requirements: 8.3, 8.4_

  - [ ] 8.2 Add metrics collection
    - Track state transition history and durations
    - Maintain sliding window counters
    - Add performance metrics for auto-heal operations
    - _Requirements: 8.1, 8.2, 8.5_

  - [ ]* 8.3 Write unit tests for monitoring
    - Test status reporting accuracy
    - Test metrics collection
    - Test alert generation
    - _Requirements: 8.1, 8.2, 8.3, 8.4_

- [ ] 9. Create test scenarios and validation
  - [ ] 9.1 Implement forced failure scenarios
    - Create test workflow with non-existent targets
    - Add configuration for test mode
    - Implement failure injection capabilities
    - _Requirements: 10.1, 10.2_

  - [ ] 9.2 Create degraded mode test
    - Force 3 consecutive TARGET_NOT_FOUND failures
    - Verify transition to DEGRADED state
    - Validate increased confidence thresholds
    - _Requirements: 10.2, 3.1, 3.2_

  - [ ] 9.3 Create quarantine test
    - Force 10 failures in 10-minute window
    - Verify transition to QUARANTINED state
    - Validate FailureCase creation
    - _Requirements: 10.3, 2.2, 2.4_

  - [ ] 9.4 Create rollback test
    - Intentionally degrade prototype quality
    - Trigger automatic rollback on performance drop
    - Verify restoration of previous versions
    - _Requirements: 10.4, 4.4, 4.5_

  - [ ]* 9.5 Write comprehensive integration tests
    - Test complete auto-heal workflow scenarios
    - Test interaction with all integrated systems
    - Test configuration changes and hot-reload
    - _Requirements: 10.1, 10.2, 10.3, 10.4, 10.5_

- [ ] 10. Final checkpoint and documentation
  - [ ] 10.1 Create demo script
    - Implement demo_auto_heal_hybrid.py
    - Show all state transitions and features
    - Include performance metrics and monitoring
    - _Requirements: 10.1, 10.2, 10.3, 10.4, 10.5_

  - [ ] 10.2 Update system integration
    - Update run.sh and launch scripts
    - Add auto-heal configuration to deployment
    - Update monitoring dashboard integration
    - _Requirements: 7.5, 8.5_

  - [ ] 10.3 Final validation
    - Run complete test suite
    - Validate all integration points
    - Test production deployment scenario
    - _Requirements: 10.5_

- [ ] 11. Final checkpoint - Ensure all tests pass
  - Ensure all tests pass, ask the user if questions arise.

## Notes

- Tasks marked with `*` are optional and can be skipped for faster MVP
- Each task references specific requirements for traceability
- Checkpoints ensure incremental validation
- Property tests validate universal correctness properties
- Unit tests validate specific examples and edge cases
- Integration tests verify compatibility with existing systems (Fiche #19, #18, #16, #10)
- The implementation follows the hybrid approach: continue when safe, degrade when uncertain, stop when dangerous