rpa_vision_v3/.kiro/specs/auto-heal-hybrid/design.md

# Design Document - Auto-Heal Hybride (Fiche #22)

## Overview

Le système d'Auto-Heal Hybride implémente une approche équilibrée entre continuité de service et sécurité. Il utilise une machine d'état pour gérer les transitions entre différents modes d'exécution, des circuit breakers pour éviter les boucles infinies, et un système de versioning pour permettre le rollback de l'apprentissage.

L'architecture s'appuie sur les systèmes existants (Fiche #19 pour la capture d'échecs, Fiche #18 pour l'apprentissage persistant, Fiche #16 pour les rapports) tout en ajoutant une couche intelligente de supervision et de protection.

## Architecture

```mermaid
graph TB
    subgraph "Auto-Heal Hybrid System"
        AHM[Auto Heal Manager]
        CB[Circuit Breaker]
        VS[Versioned Store]
        PC[Policy Config]
    end

    subgraph "Execution Layer"
        EL[Execution Loop]
        AE[Action Executor]
        TR[Target Resolver]
    end

    subgraph "Learning Layer"
        TMS[Target Memory Store]
        FAISS[FAISS Index]
        PROTO[Prototypes]
    end

    subgraph "Integration Layer"
        FCR[Failure Case Recorder]
        SR[Simulation Report]
        PM[Precision Metrics]
    end

    EL --> AHM
    AHM --> CB
    AHM --> VS
    AHM --> PC

    AHM --> FCR
    AHM --> SR
    AHM --> PM

    VS --> TMS
    VS --> FAISS
    VS --> PROTO

    AHM --> AE
    AE --> TR
```

## Components and Interfaces

### 1. Auto Heal Manager (core/system/auto_heal_manager.py)

**Responsabilité:** Gestionnaire central des états d'exécution et des politiques de sécurité.

```python
class ExecutionState(Enum):
    RUNNING = "running"
    DEGRADED = "degraded"
    QUARANTINED = "quarantined"
    ROLLBACK = "rollback"
    PAUSED = "paused"

class AutoHealManager:
    def __init__(self, policy_path: Path = Path("data/config/auto_heal_policy.json"))
    def should_execute_step(self, workflow_id: str, step_id: str) -> Tuple[bool, str]
    def on_step_result(self, workflow_id: str, step_id: str, result: ExecutionResult) -> None
    def get_mode(self, workflow_id: str) -> ExecutionState
    def force_transition(self, workflow_id: str, new_state: ExecutionState, reason: str) -> None
    def get_status_report(self) -> Dict[str, Any]
```

**Intégration avec l'execution loop:**
```python
# Dans execution_loop.py ou action_executor.py
before_step = auto_heal_manager.should_execute_step(workflow_id, step_id)
if not before_step[0]:
    return ExecutionResult(status=ExecutionStatus.BLOCKED, message=before_step[1])

# Exécuter l'action...
result = execute_action(...)

# Après exécution
auto_heal_manager.on_step_result(workflow_id, step_id, result)
```

### 2. Circuit Breaker (core/system/circuit_breaker.py)

**Responsabilité:** Mécanisme anti-boucle avec fenêtres glissantes.

```python
class CircuitBreaker:
    def __init__(self, policy: Dict[str, Any])
    def record_failure(self, workflow_id: str, step_id: str, failure_type: str) -> None
    def record_success(self, workflow_id: str, step_id: str) -> None
    def should_trigger_degraded(self, workflow_id: str, step_id: str) -> bool
    def should_trigger_quarantine(self, workflow_id: str) -> bool
    def should_trigger_global_pause(self) -> bool
    def get_failure_counts(self, workflow_id: str) -> Dict[str, int]
```

**Fenêtres glissantes:**
- Step level: 3 échecs consécutifs → DEGRADED
- Workflow level: 10 échecs en 10 minutes → QUARANTINED
- Global level: 30 échecs en 10 minutes → PAUSE (optionnel)

### 3. Versioned Store (core/learning/versioned_store.py)

**Responsabilité:** Système de versioning pour l'apprentissage réversible.

```python
class VersionedStore:
    def __init__(self, base_path: Path = Path("data"))
    def snapshot_version(self, workflow_id: str) -> str
    def rollback_to_previous(self, workflow_id: str, version: Optional[str] = None) -> bool
    def list_versions(self, workflow_id: str) -> List[VersionInfo]
    def cleanup_old_versions(self, workflow_id: str, keep_count: int = 5) -> None

    # Versioning des composants
    def version_prototypes(self, workflow_id: str, version: str) -> None
    def version_faiss_index(self, workflow_id: str, version: str) -> None
    def version_target_memory(self, workflow_id: str, version: str) -> None
```

**Structure de versioning:**
```
data/
├── learning/
│   └── prototypes/
│       └── v001/  # Version snapshots
│       └── v002/
├── faiss_index/
│   └── workflow_<id>/
│       └── v001/  # Versioned indices
│       └── v002/
└── target_memory_snapshots/
    └── v001.db   # SQLite snapshots
    └── v002.db
```

### 4. Policy Configuration (data/config/auto_heal_policy.json)

**Structure de configuration:**
```json
{
  "mode": "hybrid",
  "step_fail_streak_to_degraded": 3,
  "workflow_fail_window_s": 600,
  "workflow_fail_max_in_window": 10,
  "global_fail_max_in_window": 30,

  "min_confidence_normal": 0.72,
  "min_confidence_degraded": 0.82,
  "min_margin_top1_top2_degraded": 0.08,

  "disable_learning_in_degraded": true,
  "rollback_on_regression": true,
  "regression_window_steps": 50,
  "regression_fail_ratio": 0.20,

  "quarantine_duration_s": 1800,
  "max_versions_to_keep": 5
}
```

## Data Models

### ExecutionStateInfo
```python
@dataclass
class ExecutionStateInfo:
    workflow_id: str
    current_state: ExecutionState
    state_since: datetime
    failure_count: int
    last_failure: Optional[datetime]
    confidence_threshold: float
    learning_enabled: bool
    quarantine_until: Optional[datetime]
```

### FailureWindow
```python
@dataclass
class FailureWindow:
    window_start: datetime
    window_duration_s: int
    failures: List[FailureEvent]

    def add_failure(self, failure: FailureEvent) -> None
    def get_failure_count(self) -> int
    def cleanup_expired(self) -> None
```

### VersionInfo
```python
@dataclass
class VersionInfo:
    version_id: str
    created_at: datetime
    workflow_id: str
    success_rate_before: float
    success_rate_after: Optional[float]
    components_versioned: List[str]  # ["prototypes", "faiss", "memory"]
```

## Correctness Properties

*A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*

### Property 1: State Transition Consistency
*For any* workflow execution state, transitions should follow valid state machine rules and maintain consistency across all system components.
**Validates: Requirements 1.1, 1.2, 1.3, 1.4, 1.5, 1.6**

### Property 2: Circuit Breaker Threshold Enforcement
*For any* sequence of step failures, when thresholds are exceeded, the circuit breaker should trigger appropriate state transitions within the configured time windows.
**Validates: Requirements 2.1, 2.2, 2.3**

### Property 3: Degraded Mode Safety
*For any* workflow in DEGRADED state, all execution decisions should use increased confidence thresholds and learning updates should be disabled.
**Validates: Requirements 3.1, 3.2, 3.3, 3.4, 3.5, 3.6**

### Property 4: Rollback Consistency
*For any* rollback operation, all versioned components (prototypes, FAISS indices, target memory) should be restored to the same consistent version point.
**Validates: Requirements 4.1, 4.2, 4.3, 4.5, 4.6**

### Property 5: Hybrid Storage Integrity
*For any* execution decision, audit records should always be written to JSONL, and SQLite records should only be written for validated successes when not in DEGRADED mode.
**Validates: Requirements 5.1, 5.2, 5.3, 5.4**

### Property 6: Configuration Consistency
*For any* configuration change, all system components should apply the new settings consistently without requiring restart.
**Validates: Requirements 6.1, 6.2, 6.3, 6.4, 6.5**

### Property 7: Integration Compatibility
*For any* existing system integration point, the auto-healing system should maintain backward compatibility and enhance functionality without breaking existing workflows.
**Validates: Requirements 7.1, 7.2, 7.3, 7.4, 7.5**

## Error Handling

### Failure Classification
1. **TARGET_NOT_FOUND**: Élément UI non trouvé
2. **POSTCONDITION_FAILED**: Post-conditions non satisfaites
3. **WATCHDOG_TIMEOUT**: Timeout de surveillance
4. **LOW_CONFIDENCE**: Confiance FAISS insuffisante
5. **RUNTIME_DRIFT**: Changement de résolution/scale

### Recovery Strategies
1. **Immediate**: Retry avec paramètres normaux
2. **Degraded**: Retry avec seuils augmentés
3. **Quarantine**: Arrêt du workflow avec capture
4. **Rollback**: Restauration version précédente
5. **Manual**: Intervention humaine requise

### Error Propagation
- Les erreurs de step remontent au niveau workflow
- Les erreurs de workflow peuvent déclencher des actions globales
- Chaque erreur génère un FailureCase (Fiche #19)
- Les erreurs critiques génèrent des rapports (Fiche #16)

## Testing Strategy

### Unit Tests
- Test des transitions d'état individuelles
- Test des seuils de circuit breaker
- Test des opérations de versioning
- Test de la configuration policy

### Property Tests
- Test des propriétés de cohérence d'état
- Test des invariants de seuil
- Test de l'intégrité des rollbacks
- Test de la consistance du stockage hybride

### Integration Tests
- Test avec les systèmes existants (Fiche #19, #18, #16)
- Test des hooks d'exécution
- Test des scénarios de dégradation
- Test des rollbacks complets

### Scenario Tests
- Simulation de 3 échecs consécutifs → DEGRADED
- Simulation de 10 échecs en 10 min → QUARANTINED
- Simulation de dégradation d'apprentissage → ROLLBACK
- Test de récupération après quarantaine

La stratégie de test utilise à la fois des tests unitaires pour les cas spécifiques et des tests de propriétés pour valider les invariants universels. Les tests d'intégration vérifient la compatibilité avec les systèmes existants, tandis que les tests de scénarios valident les comportements de bout en bout.