Validé sur PC Windows (DESKTOP-58D5CAC, 2560x1600) : - 8 clics résolus visuellement (1 anchor_template, 1 som_text_match, 6 som_vlm) - Score moyen 0.75, temps moyen 1.6s - Texte tapé correctement (bonjour, test word, date, email) - 0 retries, 2 actions non vérifiées (OK) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
292 lines
10 KiB
Markdown
292 lines
10 KiB
Markdown
# Design Document - Auto-Heal Hybride (Fiche #22)
|
|
|
|
## Overview
|
|
|
|
Le système d'Auto-Heal Hybride implémente une approche équilibrée entre continuité de service et sécurité. Il utilise une machine d'état pour gérer les transitions entre différents modes d'exécution, des circuit breakers pour éviter les boucles infinies, et un système de versioning pour permettre le rollback de l'apprentissage.
|
|
|
|
L'architecture s'appuie sur les systèmes existants (Fiche #19 pour la capture d'échecs, Fiche #18 pour l'apprentissage persistant, Fiche #16 pour les rapports) tout en ajoutant une couche intelligente de supervision et de protection.
|
|
|
|
## Architecture
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Auto-Heal Hybrid System"
|
|
AHM[Auto Heal Manager]
|
|
CB[Circuit Breaker]
|
|
VS[Versioned Store]
|
|
PC[Policy Config]
|
|
end
|
|
|
|
subgraph "Execution Layer"
|
|
EL[Execution Loop]
|
|
AE[Action Executor]
|
|
TR[Target Resolver]
|
|
end
|
|
|
|
subgraph "Learning Layer"
|
|
TMS[Target Memory Store]
|
|
FAISS[FAISS Index]
|
|
PROTO[Prototypes]
|
|
end
|
|
|
|
subgraph "Integration Layer"
|
|
FCR[Failure Case Recorder]
|
|
SR[Simulation Report]
|
|
PM[Precision Metrics]
|
|
end
|
|
|
|
EL --> AHM
|
|
AHM --> CB
|
|
AHM --> VS
|
|
AHM --> PC
|
|
|
|
AHM --> FCR
|
|
AHM --> SR
|
|
AHM --> PM
|
|
|
|
VS --> TMS
|
|
VS --> FAISS
|
|
VS --> PROTO
|
|
|
|
AHM --> AE
|
|
AE --> TR
|
|
```
|
|
|
|
## Components and Interfaces
|
|
|
|
### 1. Auto Heal Manager (core/system/auto_heal_manager.py)
|
|
|
|
**Responsabilité:** Gestionnaire central des états d'exécution et des politiques de sécurité.
|
|
|
|
```python
|
|
class ExecutionState(Enum):
|
|
RUNNING = "running"
|
|
DEGRADED = "degraded"
|
|
QUARANTINED = "quarantined"
|
|
ROLLBACK = "rollback"
|
|
PAUSED = "paused"
|
|
|
|
class AutoHealManager:
|
|
def __init__(self, policy_path: Path = Path("data/config/auto_heal_policy.json"))
|
|
def should_execute_step(self, workflow_id: str, step_id: str) -> Tuple[bool, str]
|
|
def on_step_result(self, workflow_id: str, step_id: str, result: ExecutionResult) -> None
|
|
def get_mode(self, workflow_id: str) -> ExecutionState
|
|
def force_transition(self, workflow_id: str, new_state: ExecutionState, reason: str) -> None
|
|
def get_status_report(self) -> Dict[str, Any]
|
|
```
|
|
|
|
**Intégration avec l'execution loop:**
|
|
```python
|
|
# Dans execution_loop.py ou action_executor.py
|
|
before_step = auto_heal_manager.should_execute_step(workflow_id, step_id)
|
|
if not before_step[0]:
|
|
return ExecutionResult(status=ExecutionStatus.BLOCKED, message=before_step[1])
|
|
|
|
# Exécuter l'action...
|
|
result = execute_action(...)
|
|
|
|
# Après exécution
|
|
auto_heal_manager.on_step_result(workflow_id, step_id, result)
|
|
```
|
|
|
|
### 2. Circuit Breaker (core/system/circuit_breaker.py)
|
|
|
|
**Responsabilité:** Mécanisme anti-boucle avec fenêtres glissantes.
|
|
|
|
```python
|
|
class CircuitBreaker:
|
|
def __init__(self, policy: Dict[str, Any])
|
|
def record_failure(self, workflow_id: str, step_id: str, failure_type: str) -> None
|
|
def record_success(self, workflow_id: str, step_id: str) -> None
|
|
def should_trigger_degraded(self, workflow_id: str, step_id: str) -> bool
|
|
def should_trigger_quarantine(self, workflow_id: str) -> bool
|
|
def should_trigger_global_pause(self) -> bool
|
|
def get_failure_counts(self, workflow_id: str) -> Dict[str, int]
|
|
```
|
|
|
|
**Fenêtres glissantes:**
|
|
- Step level: 3 échecs consécutifs → DEGRADED
|
|
- Workflow level: 10 échecs en 10 minutes → QUARANTINED
|
|
- Global level: 30 échecs en 10 minutes → PAUSE (optionnel)
|
|
|
|
### 3. Versioned Store (core/learning/versioned_store.py)
|
|
|
|
**Responsabilité:** Système de versioning pour l'apprentissage réversible.
|
|
|
|
```python
|
|
class VersionedStore:
|
|
def __init__(self, base_path: Path = Path("data"))
|
|
def snapshot_version(self, workflow_id: str) -> str
|
|
def rollback_to_previous(self, workflow_id: str, version: Optional[str] = None) -> bool
|
|
def list_versions(self, workflow_id: str) -> List[VersionInfo]
|
|
def cleanup_old_versions(self, workflow_id: str, keep_count: int = 5) -> None
|
|
|
|
# Versioning des composants
|
|
def version_prototypes(self, workflow_id: str, version: str) -> None
|
|
def version_faiss_index(self, workflow_id: str, version: str) -> None
|
|
def version_target_memory(self, workflow_id: str, version: str) -> None
|
|
```
|
|
|
|
**Structure de versioning:**
|
|
```
|
|
data/
|
|
├── learning/
|
|
│ └── prototypes/
|
|
│ └── v001/ # Version snapshots
|
|
│ └── v002/
|
|
├── faiss_index/
|
|
│ └── workflow_<id>/
|
|
│ └── v001/ # Versioned indices
|
|
│ └── v002/
|
|
└── target_memory_snapshots/
|
|
└── v001.db # SQLite snapshots
|
|
└── v002.db
|
|
```
|
|
|
|
### 4. Policy Configuration (data/config/auto_heal_policy.json)
|
|
|
|
**Structure de configuration:**
|
|
```json
|
|
{
|
|
"mode": "hybrid",
|
|
"step_fail_streak_to_degraded": 3,
|
|
"workflow_fail_window_s": 600,
|
|
"workflow_fail_max_in_window": 10,
|
|
"global_fail_max_in_window": 30,
|
|
|
|
"min_confidence_normal": 0.72,
|
|
"min_confidence_degraded": 0.82,
|
|
"min_margin_top1_top2_degraded": 0.08,
|
|
|
|
"disable_learning_in_degraded": true,
|
|
"rollback_on_regression": true,
|
|
"regression_window_steps": 50,
|
|
"regression_fail_ratio": 0.20,
|
|
|
|
"quarantine_duration_s": 1800,
|
|
"max_versions_to_keep": 5
|
|
}
|
|
```
|
|
|
|
## Data Models
|
|
|
|
### ExecutionStateInfo
|
|
```python
|
|
@dataclass
|
|
class ExecutionStateInfo:
|
|
workflow_id: str
|
|
current_state: ExecutionState
|
|
state_since: datetime
|
|
failure_count: int
|
|
last_failure: Optional[datetime]
|
|
confidence_threshold: float
|
|
learning_enabled: bool
|
|
quarantine_until: Optional[datetime]
|
|
```
|
|
|
|
### FailureWindow
|
|
```python
|
|
@dataclass
|
|
class FailureWindow:
|
|
window_start: datetime
|
|
window_duration_s: int
|
|
failures: List[FailureEvent]
|
|
|
|
def add_failure(self, failure: FailureEvent) -> None
|
|
def get_failure_count(self) -> int
|
|
def cleanup_expired(self) -> None
|
|
```
|
|
|
|
### VersionInfo
|
|
```python
|
|
@dataclass
|
|
class VersionInfo:
|
|
version_id: str
|
|
created_at: datetime
|
|
workflow_id: str
|
|
success_rate_before: float
|
|
success_rate_after: Optional[float]
|
|
components_versioned: List[str] # ["prototypes", "faiss", "memory"]
|
|
```
|
|
|
|
## Correctness Properties
|
|
|
|
*A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
|
|
|
|
### Property 1: State Transition Consistency
|
|
*For any* workflow execution state, transitions should follow valid state machine rules and maintain consistency across all system components.
|
|
**Validates: Requirements 1.1, 1.2, 1.3, 1.4, 1.5, 1.6**
|
|
|
|
### Property 2: Circuit Breaker Threshold Enforcement
|
|
*For any* sequence of step failures, when thresholds are exceeded, the circuit breaker should trigger appropriate state transitions within the configured time windows.
|
|
**Validates: Requirements 2.1, 2.2, 2.3**
|
|
|
|
### Property 3: Degraded Mode Safety
|
|
*For any* workflow in DEGRADED state, all execution decisions should use increased confidence thresholds and learning updates should be disabled.
|
|
**Validates: Requirements 3.1, 3.2, 3.3, 3.4, 3.5, 3.6**
|
|
|
|
### Property 4: Rollback Consistency
|
|
*For any* rollback operation, all versioned components (prototypes, FAISS indices, target memory) should be restored to the same consistent version point.
|
|
**Validates: Requirements 4.1, 4.2, 4.3, 4.5, 4.6**
|
|
|
|
### Property 5: Hybrid Storage Integrity
|
|
*For any* execution decision, audit records should always be written to JSONL, and SQLite records should only be written for validated successes when not in DEGRADED mode.
|
|
**Validates: Requirements 5.1, 5.2, 5.3, 5.4**
|
|
|
|
### Property 6: Configuration Consistency
|
|
*For any* configuration change, all system components should apply the new settings consistently without requiring restart.
|
|
**Validates: Requirements 6.1, 6.2, 6.3, 6.4, 6.5**
|
|
|
|
### Property 7: Integration Compatibility
|
|
*For any* existing system integration point, the auto-healing system should maintain backward compatibility and enhance functionality without breaking existing workflows.
|
|
**Validates: Requirements 7.1, 7.2, 7.3, 7.4, 7.5**
|
|
|
|
## Error Handling
|
|
|
|
### Failure Classification
|
|
1. **TARGET_NOT_FOUND**: Élément UI non trouvé
|
|
2. **POSTCONDITION_FAILED**: Post-conditions non satisfaites
|
|
3. **WATCHDOG_TIMEOUT**: Timeout de surveillance
|
|
4. **LOW_CONFIDENCE**: Confiance FAISS insuffisante
|
|
5. **RUNTIME_DRIFT**: Changement de résolution/scale
|
|
|
|
### Recovery Strategies
|
|
1. **Immediate**: Retry avec paramètres normaux
|
|
2. **Degraded**: Retry avec seuils augmentés
|
|
3. **Quarantine**: Arrêt du workflow avec capture
|
|
4. **Rollback**: Restauration version précédente
|
|
5. **Manual**: Intervention humaine requise
|
|
|
|
### Error Propagation
|
|
- Les erreurs de step remontent au niveau workflow
|
|
- Les erreurs de workflow peuvent déclencher des actions globales
|
|
- Chaque erreur génère un FailureCase (Fiche #19)
|
|
- Les erreurs critiques génèrent des rapports (Fiche #16)
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
- Test des transitions d'état individuelles
|
|
- Test des seuils de circuit breaker
|
|
- Test des opérations de versioning
|
|
- Test de la configuration policy
|
|
|
|
### Property Tests
|
|
- Test des propriétés de cohérence d'état
|
|
- Test des invariants de seuil
|
|
- Test de l'intégrité des rollbacks
|
|
- Test de la consistance du stockage hybride
|
|
|
|
### Integration Tests
|
|
- Test avec les systèmes existants (Fiche #19, #18, #16)
|
|
- Test des hooks d'exécution
|
|
- Test des scénarios de dégradation
|
|
- Test des rollbacks complets
|
|
|
|
### Scenario Tests
|
|
- Simulation de 3 échecs consécutifs → DEGRADED
|
|
- Simulation de 10 échecs en 10 min → QUARANTINED
|
|
- Simulation de dégradation d'apprentissage → ROLLBACK
|
|
- Test de récupération après quarantaine
|
|
|
|
La stratégie de test utilise à la fois des tests unitaires pour les cas spécifiques et des tests de propriétés pour valider les invariants universels. Les tests d'intégration vérifient la compatibilité avec les systèmes existants, tandis que les tests de scénarios valident les comportements de bout en bout. |