Files
rpa_vision_v3/.kiro/specs/auto-heal-hybrid/tasks.md
Dom a7de6a488b feat: replay E2E fonctionnel — 25/25 actions, 0 retries, SomEngine via serveur
Validé sur PC Windows (DESKTOP-58D5CAC, 2560x1600) :
- 8 clics résolus visuellement (1 anchor_template, 1 som_text_match, 6 som_vlm)
- Score moyen 0.75, temps moyen 1.6s
- Texte tapé correctement (bonjour, test word, date, email)
- 0 retries, 2 actions non vérifiées (OK)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 14:04:41 +02:00

9.2 KiB

Implementation Plan: Auto-Heal Hybride (Fiche #22)

Overview

Implémentation du système d'auto-healing hybride qui équilibre continuité et sécurité. Le plan suit une approche incrémentale : configuration et modèles de base, circuit breaker, gestionnaire principal, système de versioning, intégration avec l'exécution, et tests complets.

Tasks

  • 1. Setup configuration and base models

    • Create policy configuration structure
    • Define execution state enums and data models
    • Set up base directory structure for versioning
    • Requirements: 6.1, 6.2, 6.3
  • 1.1 Create policy configuration system

    • Implement auto_heal_policy.json structure
    • Create PolicyConfig class with validation
    • Add configuration loading and hot-reload capability
    • Requirements: 6.1, 6.4, 6.5
  • * 1.2 Write property test for configuration consistency

    • Property 6: Configuration Consistency
    • Validates: Requirements 6.1, 6.2, 6.3, 6.4, 6.5
  • 1.3 Implement base data models

    • Create ExecutionStateInfo, FailureWindow, VersionInfo dataclasses
    • Implement ExecutionState enum with validation
    • Add serialization/deserialization methods
    • Requirements: 1.1, 2.1, 4.1
  • * 1.4 Write unit tests for data models

    • Test state transitions and validation
    • Test failure window operations
    • Test version info management
    • Requirements: 1.1, 2.1, 4.1
  • 2. Implement Circuit Breaker system

    • 2.1 Create CircuitBreaker class with sliding windows

      • Implement failure counting with time windows
      • Add step-level, workflow-level, and global-level tracking
      • Create threshold checking methods
      • Requirements: 2.1, 2.2, 2.3
    • * 2.2 Write property test for circuit breaker thresholds

      • Property 2: Circuit Breaker Threshold Enforcement
      • Validates: Requirements 2.1, 2.2, 2.3
    • 2.3 Implement failure window management

      • Create sliding window data structure
      • Add automatic cleanup of expired failures
      • Implement efficient failure counting
      • Requirements: 2.1, 2.2, 2.3
    • * 2.4 Write unit tests for circuit breaker

      • Test sliding window behavior
      • Test threshold triggering
      • Test failure cleanup
      • Requirements: 2.1, 2.2, 2.3
  • 3. Create Versioned Store system

    • 3.1 Implement VersionedStore class

      • Create version snapshot functionality
      • Implement rollback operations
      • Add version listing and cleanup
      • Requirements: 4.1, 4.2, 4.3, 4.6
    • 3.2 Implement component versioning

      • Add prototypes versioning (data/learning/prototypes/vNN/)
      • Add FAISS index versioning (data/faiss_index/workflow_/vNN/)
      • Add target memory versioning (SQLite snapshots)
      • Requirements: 4.1, 4.2, 4.3
    • * 3.3 Write property test for rollback consistency

      • Property 4: Rollback Consistency
      • Validates: Requirements 4.1, 4.2, 4.3, 4.5, 4.6
    • * 3.4 Write unit tests for versioned store

      • Test snapshot creation
      • Test rollback operations
      • Test version cleanup
      • Requirements: 4.1, 4.2, 4.3, 4.6
  • 4. Checkpoint - Ensure core components work

    • Ensure all tests pass, ask the user if questions arise.
  • 5. Implement Auto Heal Manager

    • 5.1 Create AutoHealManager class

      • Implement state machine for execution states
      • Add should_execute_step() and on_step_result() methods
      • Create state transition logic
      • Requirements: 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
    • 5.2 Implement degraded mode logic

      • Add confidence threshold adjustment
      • Implement learning disable functionality
      • Add strict validation for ambiguous targets
      • Requirements: 3.1, 3.2, 3.3, 3.4, 3.5, 3.6
    • * 5.3 Write property test for state transitions

      • Property 1: State Transition Consistency
      • Validates: Requirements 1.1, 1.2, 1.3, 1.4, 1.5, 1.6
    • * 5.4 Write property test for degraded mode safety

      • Property 3: Degraded Mode Safety
      • Validates: Requirements 3.1, 3.2, 3.3, 3.4, 3.5, 3.6
    • 5.5 Implement hybrid storage system

      • Add JSONL audit logging for all decisions
      • Add SQLite logging only for validated successes
      • Implement storage filtering based on execution state
      • Requirements: 5.1, 5.2, 5.3, 5.4, 5.5
    • * 5.6 Write property test for hybrid storage integrity

      • Property 5: Hybrid Storage Integrity
      • Validates: Requirements 5.1, 5.2, 5.3, 5.4
  • 6. Integrate with existing systems

    • 6.1 Add Fiche #19 FailureCase integration

      • Integrate with failure_case_recorder.py
      • Add automatic capture on quarantine events
      • Include auto-heal context in failure records
      • Requirements: 7.1, 2.4
    • 6.2 Add Fiche #16 simulation report integration

      • Generate mini reports on circuit breaker triggers
      • Include scenario data when available
      • Add auto-heal specific report sections
      • Requirements: 7.2, 2.5
    • 6.3 Add Fiche #18 persistent learning integration

      • Integrate with target_memory_store.py for rollback decisions
      • Add learning success rate monitoring
      • Implement regression detection
      • Requirements: 7.3, 4.4
    • 6.4 Add Fiche #10 precision metrics integration

      • Use precision metrics for confidence scoring
      • Integrate with metrics_engine.py
      • Add auto-heal metrics to precision dashboard
      • Requirements: 7.4
    • * 6.5 Write property test for integration compatibility

      • Property 7: Integration Compatibility
      • Validates: Requirements 7.1, 7.2, 7.3, 7.4, 7.5
  • 7. Add execution loop hooks

    • 7.1 Modify action_executor.py integration

      • Add should_execute_step() call before action execution
      • Add on_step_result() call after action completion
      • Implement execution blocking for quarantined workflows
      • Requirements: 7.5, 1.4
    • 7.2 Implement trigger detection

      • Add TARGET_NOT_FOUND detection and counting
      • Add POSTCONDITION_FAILED detection and counting
      • Add WATCHDOG_TIMEOUT detection
      • Add low confidence detection from FAISS matches
      • Requirements: 9.1, 9.2, 9.3, 9.4
    • * 7.3 Write integration tests for execution hooks

      • Test execution flow with auto-heal manager
      • Test blocking of quarantined workflows
      • Test state transitions during execution
      • Requirements: 7.5, 1.4, 1.5
  • 8. Implement monitoring and metrics

    • 8.1 Add status reporting

      • Implement get_status_report() method
      • Add real-time state monitoring
      • Create health endpoint integration
      • Requirements: 8.3, 8.4
    • 8.2 Add metrics collection

      • Track state transition history and durations
      • Maintain sliding window counters
      • Add performance metrics for auto-heal operations
      • Requirements: 8.1, 8.2, 8.5
    • * 8.3 Write unit tests for monitoring

      • Test status reporting accuracy
      • Test metrics collection
      • Test alert generation
      • Requirements: 8.1, 8.2, 8.3, 8.4
  • 9. Create test scenarios and validation

    • 9.1 Implement forced failure scenarios

      • Create test workflow with non-existent targets
      • Add configuration for test mode
      • Implement failure injection capabilities
      • Requirements: 10.1, 10.2
    • 9.2 Create degraded mode test

      • Force 3 consecutive TARGET_NOT_FOUND failures
      • Verify transition to DEGRADED state
      • Validate increased confidence thresholds
      • Requirements: 10.2, 3.1, 3.2
    • 9.3 Create quarantine test

      • Force 10 failures in 10-minute window
      • Verify transition to QUARANTINED state
      • Validate FailureCase creation
      • Requirements: 10.3, 2.2, 2.4
    • 9.4 Create rollback test

      • Intentionally degrade prototype quality
      • Trigger automatic rollback on performance drop
      • Verify restoration of previous versions
      • Requirements: 10.4, 4.4, 4.5
    • * 9.5 Write comprehensive integration tests

      • Test complete auto-heal workflow scenarios
      • Test interaction with all integrated systems
      • Test configuration changes and hot-reload
      • Requirements: 10.1, 10.2, 10.3, 10.4, 10.5
  • 10. Final checkpoint and documentation

    • 10.1 Create demo script

      • Implement demo_auto_heal_hybrid.py
      • Show all state transitions and features
      • Include performance metrics and monitoring
      • Requirements: 10.1, 10.2, 10.3, 10.4, 10.5
    • 10.2 Update system integration

      • Update run.sh and launch scripts
      • Add auto-heal configuration to deployment
      • Update monitoring dashboard integration
      • Requirements: 7.5, 8.5
    • 10.3 Final validation

      • Run complete test suite
      • Validate all integration points
      • Test production deployment scenario
      • Requirements: 10.5
  • 11. Final checkpoint - Ensure all tests pass

    • Ensure all tests pass, ask the user if questions arise.

Notes

  • Tasks marked with * are optional and can be skipped for faster MVP
  • Each task references specific requirements for traceability
  • Checkpoints ensure incremental validation
  • Property tests validate universal correctness properties
  • Unit tests validate specific examples and edge cases
  • Integration tests verify compatibility with existing systems (Fiche #19, #18, #16, #10)
  • The implementation follows the hybrid approach: continue when safe, degrade when uncertain, stop when dangerous