Files
rpa_vision_v3/.kiro/specs/auto-heal-hybrid/design.md
Dom a7de6a488b feat: replay E2E fonctionnel — 25/25 actions, 0 retries, SomEngine via serveur
Validé sur PC Windows (DESKTOP-58D5CAC, 2560x1600) :
- 8 clics résolus visuellement (1 anchor_template, 1 som_text_match, 6 som_vlm)
- Score moyen 0.75, temps moyen 1.6s
- Texte tapé correctement (bonjour, test word, date, email)
- 0 retries, 2 actions non vérifiées (OK)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 14:04:41 +02:00

10 KiB

Design Document - Auto-Heal Hybride (Fiche #22)

Overview

Le système d'Auto-Heal Hybride implémente une approche équilibrée entre continuité de service et sécurité. Il utilise une machine d'état pour gérer les transitions entre différents modes d'exécution, des circuit breakers pour éviter les boucles infinies, et un système de versioning pour permettre le rollback de l'apprentissage.

L'architecture s'appuie sur les systèmes existants (Fiche #19 pour la capture d'échecs, Fiche #18 pour l'apprentissage persistant, Fiche #16 pour les rapports) tout en ajoutant une couche intelligente de supervision et de protection.

Architecture

graph TB
    subgraph "Auto-Heal Hybrid System"
        AHM[Auto Heal Manager]
        CB[Circuit Breaker]
        VS[Versioned Store]
        PC[Policy Config]
    end
    
    subgraph "Execution Layer"
        EL[Execution Loop]
        AE[Action Executor]
        TR[Target Resolver]
    end
    
    subgraph "Learning Layer"
        TMS[Target Memory Store]
        FAISS[FAISS Index]
        PROTO[Prototypes]
    end
    
    subgraph "Integration Layer"
        FCR[Failure Case Recorder]
        SR[Simulation Report]
        PM[Precision Metrics]
    end
    
    EL --> AHM
    AHM --> CB
    AHM --> VS
    AHM --> PC
    
    AHM --> FCR
    AHM --> SR
    AHM --> PM
    
    VS --> TMS
    VS --> FAISS
    VS --> PROTO
    
    AHM --> AE
    AE --> TR

Components and Interfaces

1. Auto Heal Manager (core/system/auto_heal_manager.py)

Responsabilité: Gestionnaire central des états d'exécution et des politiques de sécurité.

class ExecutionState(Enum):
    RUNNING = "running"
    DEGRADED = "degraded" 
    QUARANTINED = "quarantined"
    ROLLBACK = "rollback"
    PAUSED = "paused"

class AutoHealManager:
    def __init__(self, policy_path: Path = Path("data/config/auto_heal_policy.json"))
    def should_execute_step(self, workflow_id: str, step_id: str) -> Tuple[bool, str]
    def on_step_result(self, workflow_id: str, step_id: str, result: ExecutionResult) -> None
    def get_mode(self, workflow_id: str) -> ExecutionState
    def force_transition(self, workflow_id: str, new_state: ExecutionState, reason: str) -> None
    def get_status_report(self) -> Dict[str, Any]

Intégration avec l'execution loop:

# Dans execution_loop.py ou action_executor.py
before_step = auto_heal_manager.should_execute_step(workflow_id, step_id)
if not before_step[0]:
    return ExecutionResult(status=ExecutionStatus.BLOCKED, message=before_step[1])

# Exécuter l'action...
result = execute_action(...)

# Après exécution
auto_heal_manager.on_step_result(workflow_id, step_id, result)

2. Circuit Breaker (core/system/circuit_breaker.py)

Responsabilité: Mécanisme anti-boucle avec fenêtres glissantes.

class CircuitBreaker:
    def __init__(self, policy: Dict[str, Any])
    def record_failure(self, workflow_id: str, step_id: str, failure_type: str) -> None
    def record_success(self, workflow_id: str, step_id: str) -> None
    def should_trigger_degraded(self, workflow_id: str, step_id: str) -> bool
    def should_trigger_quarantine(self, workflow_id: str) -> bool
    def should_trigger_global_pause(self) -> bool
    def get_failure_counts(self, workflow_id: str) -> Dict[str, int]

Fenêtres glissantes:

  • Step level: 3 échecs consécutifs → DEGRADED
  • Workflow level: 10 échecs en 10 minutes → QUARANTINED
  • Global level: 30 échecs en 10 minutes → PAUSE (optionnel)

3. Versioned Store (core/learning/versioned_store.py)

Responsabilité: Système de versioning pour l'apprentissage réversible.

class VersionedStore:
    def __init__(self, base_path: Path = Path("data"))
    def snapshot_version(self, workflow_id: str) -> str
    def rollback_to_previous(self, workflow_id: str, version: Optional[str] = None) -> bool
    def list_versions(self, workflow_id: str) -> List[VersionInfo]
    def cleanup_old_versions(self, workflow_id: str, keep_count: int = 5) -> None
    
    # Versioning des composants
    def version_prototypes(self, workflow_id: str, version: str) -> None
    def version_faiss_index(self, workflow_id: str, version: str) -> None  
    def version_target_memory(self, workflow_id: str, version: str) -> None

Structure de versioning:

data/
├── learning/
│   └── prototypes/
│       └── v001/  # Version snapshots
│       └── v002/
├── faiss_index/
│   └── workflow_<id>/
│       └── v001/  # Versioned indices
│       └── v002/
└── target_memory_snapshots/
    └── v001.db   # SQLite snapshots
    └── v002.db

4. Policy Configuration (data/config/auto_heal_policy.json)

Structure de configuration:

{
  "mode": "hybrid",
  "step_fail_streak_to_degraded": 3,
  "workflow_fail_window_s": 600,
  "workflow_fail_max_in_window": 10,
  "global_fail_max_in_window": 30,
  
  "min_confidence_normal": 0.72,
  "min_confidence_degraded": 0.82,
  "min_margin_top1_top2_degraded": 0.08,
  
  "disable_learning_in_degraded": true,
  "rollback_on_regression": true,
  "regression_window_steps": 50,
  "regression_fail_ratio": 0.20,
  
  "quarantine_duration_s": 1800,
  "max_versions_to_keep": 5
}

Data Models

ExecutionStateInfo

@dataclass
class ExecutionStateInfo:
    workflow_id: str
    current_state: ExecutionState
    state_since: datetime
    failure_count: int
    last_failure: Optional[datetime]
    confidence_threshold: float
    learning_enabled: bool
    quarantine_until: Optional[datetime]

FailureWindow

@dataclass  
class FailureWindow:
    window_start: datetime
    window_duration_s: int
    failures: List[FailureEvent]
    
    def add_failure(self, failure: FailureEvent) -> None
    def get_failure_count(self) -> int
    def cleanup_expired(self) -> None

VersionInfo

@dataclass
class VersionInfo:
    version_id: str
    created_at: datetime
    workflow_id: str
    success_rate_before: float
    success_rate_after: Optional[float]
    components_versioned: List[str]  # ["prototypes", "faiss", "memory"]

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

Property 1: State Transition Consistency

For any workflow execution state, transitions should follow valid state machine rules and maintain consistency across all system components. Validates: Requirements 1.1, 1.2, 1.3, 1.4, 1.5, 1.6

Property 2: Circuit Breaker Threshold Enforcement

For any sequence of step failures, when thresholds are exceeded, the circuit breaker should trigger appropriate state transitions within the configured time windows. Validates: Requirements 2.1, 2.2, 2.3

Property 3: Degraded Mode Safety

For any workflow in DEGRADED state, all execution decisions should use increased confidence thresholds and learning updates should be disabled. Validates: Requirements 3.1, 3.2, 3.3, 3.4, 3.5, 3.6

Property 4: Rollback Consistency

For any rollback operation, all versioned components (prototypes, FAISS indices, target memory) should be restored to the same consistent version point. Validates: Requirements 4.1, 4.2, 4.3, 4.5, 4.6

Property 5: Hybrid Storage Integrity

For any execution decision, audit records should always be written to JSONL, and SQLite records should only be written for validated successes when not in DEGRADED mode. Validates: Requirements 5.1, 5.2, 5.3, 5.4

Property 6: Configuration Consistency

For any configuration change, all system components should apply the new settings consistently without requiring restart. Validates: Requirements 6.1, 6.2, 6.3, 6.4, 6.5

Property 7: Integration Compatibility

For any existing system integration point, the auto-healing system should maintain backward compatibility and enhance functionality without breaking existing workflows. Validates: Requirements 7.1, 7.2, 7.3, 7.4, 7.5

Error Handling

Failure Classification

  1. TARGET_NOT_FOUND: Élément UI non trouvé
  2. POSTCONDITION_FAILED: Post-conditions non satisfaites
  3. WATCHDOG_TIMEOUT: Timeout de surveillance
  4. LOW_CONFIDENCE: Confiance FAISS insuffisante
  5. RUNTIME_DRIFT: Changement de résolution/scale

Recovery Strategies

  1. Immediate: Retry avec paramètres normaux
  2. Degraded: Retry avec seuils augmentés
  3. Quarantine: Arrêt du workflow avec capture
  4. Rollback: Restauration version précédente
  5. Manual: Intervention humaine requise

Error Propagation

  • Les erreurs de step remontent au niveau workflow
  • Les erreurs de workflow peuvent déclencher des actions globales
  • Chaque erreur génère un FailureCase (Fiche #19)
  • Les erreurs critiques génèrent des rapports (Fiche #16)

Testing Strategy

Unit Tests

  • Test des transitions d'état individuelles
  • Test des seuils de circuit breaker
  • Test des opérations de versioning
  • Test de la configuration policy

Property Tests

  • Test des propriétés de cohérence d'état
  • Test des invariants de seuil
  • Test de l'intégrité des rollbacks
  • Test de la consistance du stockage hybride

Integration Tests

  • Test avec les systèmes existants (Fiche #19, #18, #16)
  • Test des hooks d'exécution
  • Test des scénarios de dégradation
  • Test des rollbacks complets

Scenario Tests

  • Simulation de 3 échecs consécutifs → DEGRADED
  • Simulation de 10 échecs en 10 min → QUARANTINED
  • Simulation de dégradation d'apprentissage → ROLLBACK
  • Test de récupération après quarantaine

La stratégie de test utilise à la fois des tests unitaires pour les cas spécifiques et des tests de propriétés pour valider les invariants universels. Les tests d'intégration vérifient la compatibilité avec les systèmes existants, tandis que les tests de scénarios valident les comportements de bout en bout.