# AXE B2 — Deep dive Validator : implémentation production-ready **Date :** 2026-05-24 **Auteur :** agent recherche dispatché (Claude Opus 4.7 1M) **Statut :** livrable de recherche, lecture seule, **AUCUNE modification de code**. **Parent :** `docs/recherche/AXE_B2_VALIDATOR_PATTERN.md` (architecture déjà posée, Skyvern verbatim). **Frères :** `AXE_A4_OCR_TEMPLATE_PHASH.md` (fournit `OcrRoiChecker` et SSIM-ROI). **Périmètre :** prendre le squelette de B2 et le rendre **collable** : code complet, tests, wiring précis, latences mesurables, reproduction offline du bug step 10. --- ## 1. TL;DR — recommandation immédiatement actionnable Le doc parent (`AXE_B2_VALIDATOR_PATTERN.md`) a posé l'architecture et copié verbatim Skyvern. Il manque **(a)** le code Python prêt à coller pour chaque Checker, **(b)** le wiring précis dans `report_action_result` (qui appelle déjà `_replay_verifier.verify_with_critic` à `api_stream.py:3554`), **(c)** la repro offline du bug step 10 sur un PNG existant, **(d)** le test pytest qui prouve la fermeture du bug. **Effort total : 1 journée homme** pour livrer une feature flag `RPA_VALIDATOR_V2_ENABLED=false` par défaut, qui : 1. Réutilise `verify_with_critic` existant (déjà câblé, déjà testé), wrapping inchangé. 2. **Ajoute un seul check primaire** — `OcrRoiChecker` — devant le pipeline pixel-then-critic actuel. 3. **Reroute la sortie** : si `OcrRoiChecker` détecte un token suspect (`https`, `edge`, `chrome`, `.com`, `.fr`), retourne TERMINATE avec `failure_category=WRONG_APPLICATION` au lieu de continuer. 4. Plug dans `replay_state["results"]` au même format que `verification.to_dict()` existant. **Le bug step 10 est fermé par 80 lignes de Python.** Le reste de l'architecture (taxonomie complète, dispatcher de verdicts, `LlmJudgeChecker` séparé) est une amélioration P1 — utile, pas bloquante. **Dépendances** : - **AXE_A4** (OCR ROI) : `OcrRoiChecker` réutilise `EasyOCR` déjà chargé par `core/grounding/title_verifier.py:140` (singleton GPU). Pas de coût d'init. - **AXE_B1** (watchdog `_retry_pending`) : indépendant du Validator. Le watchdog corrige la cause primaire (HTTP timeout silencieux), le Validator corrige la cause aggravante (clic hors-zone validé success=True). - **Chaîne D2** (popup/dialog) : sortie `failure_category=UNEXPECTED_DIALOG` → handoff vers DialogHandler (à câbler en P1). --- ## 2. Architecture finale du package `core/validation/` ### 2.1. Arborescence ``` core/validation/ ├── __init__.py # exports publics : Validator, Verdict, ValidationResult ├── result.py # dataclasses : Verdict, FailureCategory, ValidationResult ├── checker_base.py # Protocol ActionChecker + classe abstraite ├── validator.py # orchestrateur : route action_type → checkers, escalation ├── prompts.py # prompts français pour LlmJudgeChecker (Easily Assure context) └── checkers/ ├── __init__.py ├── pixel_diff.py # wrapper ReplayVerifier.verify_action (existant) ├── ocr_roi.py # NOUVEAU — résout bug step 10 ├── title_bar.py # wrapper core/grounding/title_verifier.py (existant) ├── json_schema.py # pydantic v2 pour extract_text/t2a_decision ├── dialog_presence.py # (P1) cascade modaux VM └── llm_judge.py # wrapper ReplayVerifier.verify_with_critic (existant) ``` **Rationale du package `core/validation/`** : le code n'est pas couplé à `agent_v0/server_v1/` (pas de FastAPI, pas de DB). Il est testable isolément (`pytest tests/unit/test_validator_*.py`). On reste cohérent avec `core/grounding/`, `core/execution/`, `core/auth/`. ### 2.2. Interface `Checker` (Protocol) ```python # core/validation/checker_base.py from __future__ import annotations from typing import Any, Dict, Optional, Protocol, runtime_checkable @runtime_checkable class ActionChecker(Protocol): """Contrat d'un checker. Stateless si possible (modèles partagés en singleton).""" name: str budget_ms: float # latence cible (informative — pas de hard timeout ici) def check( self, action: Dict[str, Any], result: Dict[str, Any], screenshot_before: Optional[str], # base64 ou path screenshot_after: Optional[str], context: Dict[str, Any], ) -> "ValidationResult": # forward ref (cycle) ... ``` ### 2.3. Responsabilités par fichier | Fichier | Responsabilité | Lignes (estim) | |---|---|---| | `result.py` | enums `Verdict`/`FailureCategory` + dataclass `ValidationResult` + `to_dict()` | 60 | | `checker_base.py` | Protocol `ActionChecker` | 20 | | `validator.py` | dispatcher action_type → checker list, escalation LLM si confidence < seuil | 100 | | `prompts.py` | template f-string français Easily/DPI/tabs | 40 | | `checkers/pixel_diff.py` | wrapper `ReplayVerifier.verify_action` → ValidationResult | 50 | | `checkers/ocr_roi.py` | crop ROI + EasyOCR + match suspect tokens + match expected | 110 | | `checkers/title_bar.py` | wrapper `TitleVerifier.verify_action` → ValidationResult | 60 | | `checkers/json_schema.py` | pydantic v2 schemas pour extract_text/t2a_decision | 80 | | `checkers/llm_judge.py` | wrapper `ReplayVerifier.verify_with_critic` → ValidationResult | 70 | **Total : ~590 LOC** pour le package complet. **~190 LOC** pour le MVP P0 (`result.py` + `checker_base.py` + `validator.py` + `ocr_roi.py`). --- ## 3. Code complet de chaque Checker (production-ready) ### 3.1. `core/validation/result.py` ```python # core/validation/result.py """Dataclasses du Validator — Verdict, FailureCategory, ValidationResult.""" from __future__ import annotations from dataclasses import dataclass, field from enum import Enum from typing import Any, Dict, Optional class Verdict(str, Enum): """Trois verdicts possibles, calque sur Skyvern (complete/terminate/continue).""" COMPLETE = "complete" # l'action a eu l'effet voulu → passer au step suivant CONTINUE = "continue" # l'effet n'est pas encore visible → wait + recheck TERMINATE = "terminate" # échec irrécupérable → pause supervisée class FailureCategory(str, Enum): """Classification des échecs (inspirée Skyvern 12-cat, restreinte à notre contexte).""" WRONG_TARGET = "wrong_target" # clic ailleurs (ex: dans le mauvais tab) WRONG_APPLICATION = "wrong_application" # clic dans bandeau Edge au lieu d'Easily — bug step 10 NO_VISUAL_CHANGE = "no_visual_change" # action sans effet visible UNEXPECTED_DIALOG = "unexpected_dialog" # popup imprévu (handoff DialogHandler) OCR_TEXT_MISSING = "ocr_text_missing" # texte attendu absent de la ROI SCHEMA_INVALID = "schema_invalid" # JSON/extract invalide UI_LOADING = "ui_loading" # spinner détecté → wait UNKNOWN = "unknown" @dataclass class ValidationResult: """Résultat agrégé d'un check. Toujours sérialisable JSON.""" verdict: Verdict confidence: float # 0.0-1.0 check_used: str # "ocr_roi" | "llm_judge" | "title_bar" | ... elapsed_ms: float reasoning: str = "" failure_category: Optional[FailureCategory] = None raw_evidence: Dict[str, Any] = field(default_factory=dict) def to_dict(self) -> Dict[str, Any]: return { "verdict": self.verdict.value, "confidence": round(self.confidence, 3), "check_used": self.check_used, "elapsed_ms": round(self.elapsed_ms, 1), "reasoning": self.reasoning, "failure_category": ( self.failure_category.value if self.failure_category else None ), "raw_evidence": self.raw_evidence, } ``` ### 3.2. `core/validation/checkers/pixel_diff.py` — pré-filtre 10 ms ```python # core/validation/checkers/pixel_diff.py """Wrapper du ReplayVerifier pixel existant — pré-filtre rapide.""" from __future__ import annotations import time from typing import Any, Dict, Optional from core.validation.result import ValidationResult, Verdict, FailureCategory class PixelDiffChecker: name = "pixel_diff" budget_ms = 15.0 def __init__(self, replay_verifier): # Injection : on réutilise l'instance ReplayVerifier existante # côté api_stream (_replay_verifier global). self._rv = replay_verifier def check( self, action: Dict[str, Any], result: Dict[str, Any], screenshot_before: Optional[str], screenshot_after: Optional[str], context: Dict[str, Any], ) -> ValidationResult: t0 = time.time() pr = self._rv.verify_action( action=action, result=result, screenshot_before=screenshot_before, screenshot_after=screenshot_after, ) elapsed = (time.time() - t0) * 1000 # Map pixel verdict → Validator verdict if pr.suggestion == "continue" and pr.changes_detected: verdict = Verdict.COMPLETE conf = pr.confidence fc = None elif pr.suggestion == "retry": verdict = Verdict.CONTINUE conf = max(0.4, pr.confidence - 0.2) fc = FailureCategory.NO_VISUAL_CHANGE else: verdict = Verdict.CONTINUE conf = 0.3 fc = None return ValidationResult( verdict=verdict, confidence=conf, check_used=self.name, elapsed_ms=elapsed, reasoning=pr.detail, failure_category=fc, raw_evidence={ "change_area_pct": pr.change_area_pct, "local_change_pct": pr.local_change_pct, }, ) ``` ### 3.3. `core/validation/checkers/ocr_roi.py` — résout le bug step 10 ```python # core/validation/checkers/ocr_roi.py """OcrRoiChecker — vérifie que le texte attendu apparaît dans la ROI cliquée. Spécifiquement conçu pour résoudre le bug step 10 (REPLAY_BLOCAGE_NOTES_MEDICALES_2026-05-08) : si on a cliqué pour 'Imagerie' mais que la ROI 60-120px autour du point cliqué contient 'edge', 'https' ou un domaine, on a cliqué dans le bandeau navigateur. """ from __future__ import annotations import time import unicodedata from typing import Any, Callable, Dict, Optional from core.validation.result import ValidationResult, Verdict, FailureCategory def _strip_accents(s: str) -> str: """NFKD + drop diacritics, robuste casse/accents pour matching tabs Easily.""" return "".join( c for c in unicodedata.normalize("NFKD", s) if not unicodedata.combining(c) ).lower().strip() class OcrRoiChecker: name = "ocr_roi" budget_ms = 200.0 # Tokens qui prouvent qu'on a cliqué dans le bandeau navigateur, pas dans l'app SUSPECT_TOKENS = ( "edge", "chrome", "firefox", "mozilla", "opera", "http", "https", "www.", ".com", ".fr", ".org", ".net", ".html", "favoris", "favorite", "bookmark", "barre d'adresse", "address bar", "nouvel onglet", "new tab", "sécurité windows", "windows security", "user account control", "contrôle de compte", ) def __init__( self, ocr_fn: Callable, # callable(PIL.Image) -> str (EasyOCR singleton) radius_px: int = 80, # 80 = compromis recall/latence sur tabs Easily suspect_min_confidence: float = 0.85, expected_min_confidence: float = 0.90, ): self._ocr = ocr_fn self._radius = radius_px self._suspect_conf = suspect_min_confidence self._expected_conf = expected_min_confidence def check( self, action: Dict[str, Any], result: Dict[str, Any], screenshot_before: Optional[str], screenshot_after: Optional[str], context: Dict[str, Any], ) -> ValidationResult: t0 = time.time() # 1. Récupération inputs (action peut être un click_anchor ou un type) target_spec = action.get("target_spec") or {} expected_text = ( action.get("by_text") or target_spec.get("by_text") or context.get("expected_text") or "" ) # actual_position rapporté par l'agent (priorité), sinon coords d'action actual_pos = result.get("actual_position") or {} x_pct = ( actual_pos.get("x_pct") or action.get("x_pct") or target_spec.get("x_pct") ) y_pct = ( actual_pos.get("y_pct") or action.get("y_pct") or target_spec.get("y_pct") ) if not screenshot_after or x_pct is None or y_pct is None or not expected_text: return ValidationResult( verdict=Verdict.CONTINUE, confidence=0.2, check_used=self.name, elapsed_ms=(time.time() - t0) * 1000, reasoning="ROI indéfinie (manque coords ou expected_text)", ) # 2. Crop ROI try: from PIL import Image img = self._load_image(screenshot_after) except Exception as exc: return ValidationResult( verdict=Verdict.CONTINUE, confidence=0.1, check_used=self.name, elapsed_ms=(time.time() - t0) * 1000, reasoning=f"Erreur chargement image: {exc}", ) w, h = img.size cx, cy = int(float(x_pct) * w), int(float(y_pct) * h) r = self._radius roi = img.crop( (max(0, cx - r), max(0, cy - r), min(w, cx + r), min(h, cy + r)) ) # 3. OCR try: raw_text = self._ocr(roi) or "" except Exception as exc: return ValidationResult( verdict=Verdict.CONTINUE, confidence=0.1, check_used=self.name, elapsed_ms=(time.time() - t0) * 1000, reasoning=f"Erreur OCR: {exc}", ) text_norm = _strip_accents(raw_text) expected_norm = _strip_accents(expected_text) elapsed_ms = (time.time() - t0) * 1000 evidence = { "roi_text": raw_text[:200], "roi_bbox": [max(0, cx - r), max(0, cy - r), min(w, cx + r), min(h, cy + r)], "expected": expected_text, } # 4. Détection token suspect (priorité absolue — bug step 10) for suspect in self.SUSPECT_TOKENS: if suspect in text_norm and suspect not in expected_norm: return ValidationResult( verdict=Verdict.TERMINATE, confidence=self._suspect_conf, check_used=self.name, elapsed_ms=elapsed_ms, failure_category=FailureCategory.WRONG_APPLICATION, reasoning=( f"Token navigateur/système '{suspect}' dans ROI clic " f"(attendu '{expected_text[:40]}') — cible hors-app" ), raw_evidence=evidence, ) # 5. Match texte attendu if expected_norm and expected_norm in text_norm: return ValidationResult( verdict=Verdict.COMPLETE, confidence=self._expected_conf, check_used=self.name, elapsed_ms=elapsed_ms, reasoning=f"Texte '{expected_text[:40]}' trouvé dans ROI", raw_evidence=evidence, ) # 6. Match partiel mot-à-mot (tolère ponctuation, accents partiels) expected_tokens = [t for t in expected_norm.split() if len(t) > 2] if expected_tokens: hits = sum(1 for tok in expected_tokens if tok in text_norm) ratio = hits / len(expected_tokens) if ratio >= 0.5: return ValidationResult( verdict=Verdict.COMPLETE, confidence=0.6 + 0.3 * ratio, check_used=self.name, elapsed_ms=elapsed_ms, reasoning=f"Match partiel {hits}/{len(expected_tokens)} tokens", raw_evidence=evidence, ) # 7. Pas suspect, pas trouvé → escalation au LLM judge return ValidationResult( verdict=Verdict.CONTINUE, confidence=0.4, check_used=self.name, elapsed_ms=elapsed_ms, failure_category=FailureCategory.OCR_TEXT_MISSING, reasoning=f"Texte '{expected_text[:40]}' non trouvé dans ROI", raw_evidence=evidence, ) @staticmethod def _load_image(source: str): """Charge PNG/JPEG depuis path ou base64 (réutilise ReplayVerifier).""" from agent_v0.server_v1.replay_verifier import ReplayVerifier return ReplayVerifier()._load_single_image(source) ``` ### 3.4. `core/validation/checkers/title_bar.py` — wrapper existant ```python # core/validation/checkers/title_bar.py """Wrapper de core/grounding/title_verifier.TitleVerifier (existant, prod-ready).""" from __future__ import annotations import time from typing import Any, Dict, Optional from core.validation.result import ValidationResult, Verdict, FailureCategory class TitleBarChecker: name = "title_bar" budget_ms = 130.0 def __init__(self): from core.grounding.title_verifier import TitleVerifier self._tv = TitleVerifier() def check( self, action: Dict[str, Any], result: Dict[str, Any], screenshot_before: Optional[str], screenshot_after: Optional[str], context: Dict[str, Any], ) -> ValidationResult: t0 = time.time() action_type = action.get("type", "") if not screenshot_before or not screenshot_after: return ValidationResult( verdict=Verdict.CONTINUE, confidence=0.2, check_used=self.name, elapsed_ms=(time.time() - t0) * 1000, reasoning="screenshots manquants", ) from PIL import Image from agent_v0.server_v1.replay_verifier import ReplayVerifier rv = ReplayVerifier() img_b = rv._load_single_image(screenshot_before) img_a = rv._load_single_image(screenshot_after) verif = self._tv.verify_action(img_b, img_a, action_type) elapsed = (time.time() - t0) * 1000 if verif.get("success"): return ValidationResult( verdict=Verdict.COMPLETE, confidence=0.75 if verif.get("changed") else 0.5, check_used=self.name, elapsed_ms=elapsed, reasoning=verif.get("reason", ""), raw_evidence={ "title_before": verif.get("title_before", ""), "title_after": verif.get("title_after", ""), }, ) else: return ValidationResult( verdict=Verdict.TERMINATE, confidence=0.7, check_used=self.name, elapsed_ms=elapsed, failure_category=FailureCategory.WRONG_APPLICATION, reasoning=verif.get("reason", ""), raw_evidence={ "title_before": verif.get("title_before", ""), "title_after": verif.get("title_after", ""), }, ) ``` ### 3.5. `core/validation/checkers/json_schema.py` — pydantic v2 ```python # core/validation/checkers/json_schema.py """JsonSchemaChecker — validation déterministe extract_text / t2a_decision.""" from __future__ import annotations import json import time from typing import Any, Dict, Literal, Optional from pydantic import BaseModel, Field, ValidationError, field_validator from core.validation.result import ValidationResult, Verdict, FailureCategory class ExtractTextResult(BaseModel): """Sortie attendue d'une action extract_text — texte non vide, langue plausible.""" value: str = Field(min_length=1, max_length=50000) @field_validator("value") @classmethod def must_have_letters(cls, v: str) -> str: if not any(c.isalpha() for c in v): raise ValueError("aucune lettre — extract_text vraisemblablement vide") return v class T2aDecisionResult(BaseModel): """Sortie attendue d'une action t2a_decision (JSON strict).""" decision: Literal["UHCD", "FORFAIT", "FORFAIT_URGENCE", "NA", "INCONNU"] justification: str = Field(min_length=10, max_length=5000) confidence: Optional[float] = Field(default=None, ge=0.0, le=1.0) _SCHEMA_BY_ACTION = { "extract_text": ExtractTextResult, "extract_text_scroll": ExtractTextResult, "t2a_decision": T2aDecisionResult, } class JsonSchemaChecker: name = "json_schema" budget_ms = 10.0 def check( self, action: Dict[str, Any], result: Dict[str, Any], screenshot_before: Optional[str], screenshot_after: Optional[str], context: Dict[str, Any], ) -> ValidationResult: t0 = time.time() action_type = action.get("type", "") schema_cls = _SCHEMA_BY_ACTION.get(action_type) if schema_cls is None: return ValidationResult( verdict=Verdict.COMPLETE, confidence=0.5, check_used=self.name, elapsed_ms=(time.time() - t0) * 1000, reasoning=f"Pas de schema pour action_type={action_type} — skip", ) # Extract payload depuis result (selon convention serveur) payload = result.get("value") or result.get("extracted") or result if isinstance(payload, str): try: payload = json.loads(payload) except json.JSONDecodeError: payload = {"value": payload} if action_type.startswith("extract_text") else payload try: validated = schema_cls.model_validate(payload) return ValidationResult( verdict=Verdict.COMPLETE, confidence=0.95, check_used=self.name, elapsed_ms=(time.time() - t0) * 1000, reasoning=f"Schema {schema_cls.__name__} validé", raw_evidence={"validated_keys": list(validated.model_dump().keys())}, ) except ValidationError as ve: return ValidationResult( verdict=Verdict.TERMINATE, confidence=0.9, check_used=self.name, elapsed_ms=(time.time() - t0) * 1000, failure_category=FailureCategory.SCHEMA_INVALID, reasoning=f"Schema invalide: {ve.errors()[:2]}", raw_evidence={"errors": ve.errors()}, ) ``` ### 3.6. `core/validation/checkers/llm_judge.py` — wrapper escalation ```python # core/validation/checkers/llm_judge.py """LlmJudgeChecker — escalation VLM via ReplayVerifier.verify_with_critic. Réutilise le pipeline VLM existant (gemma4:e4b, port 11435). Choix gemma4 vs Qwen3-VL : gemma4 retenu par BENCH_SAFETY_CHECKS_2026-05-06 (46% détection vs 0% Qwen3-VL qui ignore format=json Ollama). """ from __future__ import annotations import time from typing import Any, Dict, Optional from core.validation.result import ValidationResult, Verdict, FailureCategory class LlmJudgeChecker: name = "llm_judge" budget_ms = 3000.0 def __init__(self, replay_verifier): self._rv = replay_verifier def check( self, action: Dict[str, Any], result: Dict[str, Any], screenshot_before: Optional[str], screenshot_after: Optional[str], context: Dict[str, Any], ) -> ValidationResult: t0 = time.time() expected = context.get("expected_result") or action.get("expected_result", "") intention = context.get("action_intention") or action.get("intention", "") workflow_ctx = context.get("workflow_context", "") if not expected: return ValidationResult( verdict=Verdict.CONTINUE, confidence=0.2, check_used=self.name, elapsed_ms=(time.time() - t0) * 1000, reasoning="Pas d'expected_result fourni — LLM judge skip", ) critic = self._rv.verify_with_critic( action=action, result=result, screenshot_before=screenshot_before, screenshot_after=screenshot_after, expected_result=expected, action_intention=intention, workflow_context=workflow_ctx, ) elapsed = (time.time() - t0) * 1000 if critic.semantic_verified is True: return ValidationResult( verdict=Verdict.COMPLETE, confidence=max(critic.confidence, 0.7), check_used=self.name, elapsed_ms=elapsed, reasoning=critic.semantic_detail or critic.detail, raw_evidence={ "pixel_change_pct": critic.change_area_pct, "semantic_verified": True, }, ) elif critic.semantic_verified is False: return ValidationResult( verdict=Verdict.TERMINATE, confidence=0.8, check_used=self.name, elapsed_ms=elapsed, failure_category=FailureCategory.WRONG_TARGET, reasoning=critic.semantic_detail or critic.detail, raw_evidence={"semantic_verified": False}, ) else: # VLM indispo ou non parsable → incertain, on continue prudemment return ValidationResult( verdict=Verdict.CONTINUE, confidence=0.4, check_used=self.name, elapsed_ms=elapsed, reasoning=critic.detail or "VLM indisponible", ) ``` ### 3.7. `core/validation/validator.py` — orchestrateur ```python # core/validation/validator.py """Validator — orchestrateur : route action_type → checkers, gère escalation.""" from __future__ import annotations import logging from typing import Any, Dict, List, Optional from core.validation.checker_base import ActionChecker from core.validation.result import ValidationResult, Verdict logger = logging.getLogger(__name__) class Validator: """Dispatcher : un action_type → liste de checkers ordonnés. Logique de décision : - Si le premier checker rend COMPLETE avec conf >= seuil_accept → return - Si TERMINATE avec conf haute → return (escalation pause supervisée) - Si CONTINUE / conf basse → essayer le checker suivant - Si tous CONTINUE → escalation LLM judge si fourni """ def __init__( self, checkers: Dict[str, List[ActionChecker]], default_checkers: Optional[List[ActionChecker]] = None, escalation_checker: Optional[ActionChecker] = None, accept_confidence: float = 0.7, escalate_below_confidence: float = 0.55, ): self._checkers = checkers self._default = default_checkers or [] self._escalation = escalation_checker self._accept = accept_confidence self._escalate_below = escalate_below_confidence def validate( self, action: Dict[str, Any], result: Dict[str, Any], screenshot_before: Optional[str] = None, screenshot_after: Optional[str] = None, context: Optional[Dict[str, Any]] = None, ) -> ValidationResult: ctx = context or {} action_type = action.get("type", "") candidates = self._checkers.get(action_type, self._default) last: Optional[ValidationResult] = None for checker in candidates: try: res = checker.check(action, result, screenshot_before, screenshot_after, ctx) except Exception as exc: logger.warning("Validator: checker %s a planté: %s", checker.name, exc) continue last = res logger.info( "[VALIDATOR] check=%s verdict=%s conf=%.2f elapsed=%.0fms reasoning=%s", res.check_used, res.verdict.value, res.confidence, res.elapsed_ms, res.reasoning[:80], ) # Verdict net + confiance haute → on prend if res.confidence >= self._accept and res.verdict != Verdict.CONTINUE: return res # Escalation LLM judge si confiance trop basse if ( self._escalation and last is not None and last.confidence < self._escalate_below ): logger.info( "[VALIDATOR] escalation LLM (last conf=%.2f < %.2f)", last.confidence, self._escalate_below, ) try: esc = self._escalation.check( action, result, screenshot_before, screenshot_after, ctx ) # LLM tranche, sa confidence est plafonnée à 0.9 par construction return esc except Exception as exc: logger.warning("Validator: escalation LLM a planté: %s", exc) # Fallback : dernier résultat ou CONTINUE neutre return last or ValidationResult( verdict=Verdict.CONTINUE, confidence=0.3, check_used="no_checker", elapsed_ms=0.0, reasoning="Aucun checker disponible pour action_type=" + action_type, ) ``` --- ## 4. Matrice action → check finale (avec latence cible) Aligné avec `_ALLOWED_ACTION_TYPES` (`replay_engine.py:35-48`) et `reference_vwb_action_types.md`. | Action VWB | Checker primaire | Escalation | Latence cible cumulée | |---|---|---|---| | `click` (depuis click_anchor) | **OcrRoiChecker** | LlmJudgeChecker si conf<0.55 | 80 ms + 2.5 s rare | | `double_click` (double_click_anchor) | TitleBarChecker → OcrRoiChecker | LlmJudgeChecker | 200 ms + 2.5 s rare | | `right_click` | PixelDiffChecker (menu attendu) | OcrRoiChecker sur menu | 15 ms + 80 ms | | `type` | OcrRoiChecker (radius 120 px sur input) | — | 100 ms | | `key_combo` | TitleBarChecker | LlmJudgeChecker si Ctrl+nav | 130 ms + 2.5 s rare | | `scroll` | PixelDiffChecker | — | 15 ms | | `wait` / `verify_screen` | PixelDiffChecker (no_change attendu) | — | 15 ms | | `extract_text` / `extract_text_scroll` | **JsonSchemaChecker** | LlmJudgeChecker si len<50 | 10 ms + 2.5 s rare | | `extract_table` | JsonSchemaChecker (rows ≥ 1) | — | 10 ms | | `t2a_decision` | **JsonSchemaChecker** strict | — | 10 ms | | `pause_for_human` | (déjà QW4 ChecklistPanel — skip) | — | 0 ms | | `screenshot_evidence` | TitleBarChecker (app correcte) | — | 130 ms | | `paste_and_execute` | PixelDiffChecker (input rempli) | OcrRoiChecker | 15 ms + 80 ms rare | **Budget total pour démo 46 steps MOREL** : - 30 clicks × 80 ms = **2.4 s** - 8 extract_text × 10 ms = **80 ms** - 4 t2a_decision × 10 ms = **40 ms** - 4 key_combo × 130 ms = **520 ms** - Escalations LLM (~3 fois) × 2500 ms = **7.5 s** - **Total ajouté ≤ 11 s** sur 46 steps. Acceptable face aux 30-60 s gagnés en évitant un blocage step 10 → pause + reprise manuelle (33 s observés). --- ## 5. Verdict taxonomy + routing (dispatcher post-validation) ```python # Pseudocode à insérer dans api_stream.report_action_result après la # validation Validator V2 (cf. §6 wiring) from core.validation.result import Verdict, FailureCategory def route_verdict( verdict_result: ValidationResult, action_id: str, replay_state: Dict[str, Any], ) -> Dict[str, Any]: """Convertit un verdict Validator en action serveur.""" v = verdict_result.verdict fc = verdict_result.failure_category if v == Verdict.COMPLETE: return {"action": "continue", "override_success": True} if v == Verdict.CONTINUE: # Re-checker après wait court (UI loading, animation) return { "action": "schedule_recheck", "after_ms": 1500, "max_rechecks": 2, } # v == TERMINATE — routing selon failure_category if fc == FailureCategory.WRONG_APPLICATION: # Bug step 10 : pause supervisée, l'humain reprend la main return { "action": "enter_paused_state", "reason": "wrong_application", "evidence": verdict_result.to_dict(), "override_success": False, } if fc == FailureCategory.WRONG_TARGET: # Retry 1 fois avec re-resolve (cascade visuelle complète) return { "action": "retry_with_reresolve", "max_retries": 1, "override_success": False, } if fc == FailureCategory.UNEXPECTED_DIALOG: # Handoff vers DialogHandler (chaîne D2) return { "action": "handoff_dialog_handler", "override_success": False, } if fc == FailureCategory.SCHEMA_INVALID: # extract_text/t2a_decision invalide → pause supervisée return { "action": "enter_paused_state", "reason": "schema_invalid", "evidence": verdict_result.to_dict(), "override_success": False, } # NO_VISUAL_CHANGE, OCR_TEXT_MISSING, UNKNOWN → retry simple return { "action": "retry_with_reresolve", "max_retries": 1, "override_success": False, } ``` --- ## 6. Wiring précis dans `api_stream.py:3447` (diff unified) Le point d'insertion est précisément après le bloc `verify_with_critic` existant (`api_stream.py:3554-3582`). On ne casse rien : la nouvelle couche est *en plus*, derrière `RPA_VALIDATOR_V2_ENABLED`. ### 6.1. Diff proposé (à NE PAS appliquer en chaud) ```diff --- a/agent_v0/server_v1/api_stream.py +++ b/agent_v0/server_v1/api_stream.py @@ -3447,6 +3447,18 @@ async def report_action_result(report: ReplayResultReport): session_id = report.session_id action_id = report.action_id + # ============================================================ + # VALIDATOR V2 (feature-flag) — init lazy singleton + # ============================================================ + global _validator_v2 + _RPA_VALIDATOR_V2 = os.environ.get("RPA_VALIDATOR_V2_ENABLED", "false").lower() in {"true", "1", "yes"} + if _RPA_VALIDATOR_V2 and _validator_v2 is None: + from core.validation.validator import Validator + from core.validation.checkers.ocr_roi import OcrRoiChecker + from core.validation.checkers.llm_judge import LlmJudgeChecker + from core.grounding.title_verifier import TitleVerifier + _tv = TitleVerifier() + _ocr_fn = _tv._get_ocr() # singleton EasyOCR partagé + _validator_v2 = Validator( + checkers={ + "click": [OcrRoiChecker(ocr_fn=_ocr_fn, radius_px=80)], + "type": [OcrRoiChecker(ocr_fn=_ocr_fn, radius_px=120)], + }, + escalation_checker=LlmJudgeChecker(_replay_verifier), + accept_confidence=0.7, + escalate_below_confidence=0.55, + ) + # [REPLAY] log structuré d'arrivée du rapport agent ... @@ -3580,6 +3592,40 @@ async def report_action_result(report: ReplayResultReport): async with _async_replay_lock(): replay_state["_last_screenshot_before"] = screenshot_after + # ============================================================ + # VALIDATOR V2 — couche additionnelle (kill-switch RPA_VALIDATOR_V2_ENABLED) + # ============================================================ + validator_v2_result = None + if _RPA_VALIDATOR_V2 and report.success and screenshot_after and not skip_verify: + try: + action_dict = original_action or {"type": "unknown", "action_id": action_id} + result_dict = { + "success": report.success, + "error": report.error, + "actual_position": report.actual_position, + } + v2_ctx = { + "expected_result": (original_action or {}).get("expected_result", ""), + "action_intention": (original_action or {}).get("intention", ""), + "workflow_context": f"step {replay_state.get('completed_actions', 0)+1}/{len(replay_state.get('actions', []))}", + "expected_text": (original_action or {}).get("target_spec", {}).get("by_text", ""), + } + validator_v2_result = _validator_v2.validate( + action=action_dict, result=result_dict, + screenshot_before=screenshot_before, + screenshot_after=screenshot_after, + context=v2_ctx, + ) + # Override success si Validator V2 dit TERMINATE haute confiance + from core.validation.result import Verdict + if validator_v2_result.verdict == Verdict.TERMINATE and validator_v2_result.confidence >= 0.7: + logger.warning( + "[VALIDATOR_V2] override agent_success=True → False (verdict=%s reason=%s)", + validator_v2_result.verdict.value, validator_v2_result.reasoning[:120], + ) + report.success = False # type: ignore[misc] + report.error = report.error or f"validator_v2_terminate: {validator_v2_result.failure_category.value if validator_v2_result.failure_category else 'unknown'}" + except Exception as exc: + logger.warning("Validator V2 a échoué (non bloquant): %s", exc) + # [REPLAY] log structuré de la décision de vérification @@ -3612,6 +3658,8 @@ async def report_action_result(report: ReplayResultReport): "verification": verification.to_dict() if verification else None, + "validator_v2": validator_v2_result.to_dict() if validator_v2_result else None, "resolution_method": report.resolution_method, "resolution_score": report.resolution_score, ``` **Effet observable** : - Quand `RPA_VALIDATOR_V2_ENABLED=false` (défaut) : **aucun changement**, le pipeline existant tourne. - Quand `=true` : un verdict TERMINATE conf≥0.7 override `report.success` à False → le retry serveur existant se déclenche (déjà câblé lignes ~3700+). En cas de WRONG_APPLICATION le routing du §5 entre en pause supervisée (à implémenter en P1, pour P0 le simple override suffit à attraper le bug). ### 6.2. Init lazy singleton L'instanciation du `Validator` est **lazy** (premier appel à `report_action_result`) pour éviter de charger EasyOCR (~3 s) au boot du serveur — utile aussi si le flag est désactivé pour éviter la consommation VRAM. `_validator_v2: Optional[Validator] = None` à déclarer globalement avec les autres singletons (`_replay_verifier`, `_audit_trail`). --- ## 7. Reproduction offline du bug step 10 ### 7.1. Screenshot disponible Confirmé présent : `/home/dom/ai/rpa_vision_v3/visual_workflow_builder/backend/data/anchors/anchor_0438bd2d9bdd_1778161174_full.png` C'est la capture pleine fenêtre 2560×1600 contenant la barre de tabs Easily (cf. `REPLAY_BLOCAGE_NOTES_MEDICALES_2026-05-08.md` §4 et `AXE_A4 §4.3`). Les coordonnées rapportées par OCR-DIRECT pour les 3 tabs collisionnent à `(0.2305, 0.2805)` → en pixels = `(590, 449)`. C'est précisément le point qui tombe **dans la URL bar Edge** au lieu de l'onglet Easily. ### 7.2. Snippet repro complet ```python # scripts/repro_bug_step10_validator.py """Reproduction offline du bug step 10 — démonstration OcrRoiChecker en isolation. Charge la capture de référence, simule un clic à (0.23, 0.28) hors-zone, vérifie que le Validator détecte le faux clic (token 'https' / '.com' dans la ROI). Usage: cd /home/dom/ai/rpa_vision_v3 && source .venv/bin/activate python scripts/repro_bug_step10_validator.py """ from pathlib import Path from core.validation.checkers.ocr_roi import OcrRoiChecker from core.grounding.title_verifier import TitleVerifier def main(): fixture = Path( "/home/dom/ai/rpa_vision_v3/visual_workflow_builder/backend/data/anchors/" "anchor_0438bd2d9bdd_1778161174_full.png" ) assert fixture.exists(), f"Fixture absente : {fixture}" # OCR singleton EasyOCR via title_verifier (GPU si dispo) tv = TitleVerifier() ocr_fn = tv._get_ocr() assert ocr_fn is not None, "OCR non chargé (EasyOCR ou docTR requis)" checker = OcrRoiChecker(ocr_fn=ocr_fn, radius_px=80) # SCENARIO 1 — clic dans le bandeau URL Edge (bug step 10) # coords résolues par OCR-DIRECT pour 'Imagerie' = (0.2305, 0.2805) # mais ces coords tombent dans la barre URL Edge action_bug = { "type": "click", "target_spec": {"by_text": "Imagerie"}, } result_bug = { "success": True, "actual_position": {"x_pct": 0.2305, "y_pct": 0.155}, # bandeau URL Edge } res = checker.check( action=action_bug, result=result_bug, screenshot_before=None, screenshot_after=str(fixture), context={}, ) print("SCENARIO 1 (clic bandeau Edge):") print(f" verdict = {res.verdict.value}") print(f" confidence = {res.confidence:.2f}") print(f" failure_cat = {res.failure_category.value if res.failure_category else None}") print(f" reasoning = {res.reasoning}") print(f" roi_text = {res.raw_evidence.get('roi_text', '')[:100]}") print() # SCENARIO 2 — clic correct sur l'onglet Imagerie action_ok = { "type": "click", "target_spec": {"by_text": "Imagerie"}, } result_ok = { "success": True, "actual_position": {"x_pct": 0.265, "y_pct": 0.295}, # vraie position Imagerie } res2 = checker.check( action=action_ok, result=result_ok, screenshot_before=None, screenshot_after=str(fixture), context={}, ) print("SCENARIO 2 (clic correct Imagerie):") print(f" verdict = {res2.verdict.value}") print(f" confidence = {res2.confidence:.2f}") print(f" reasoning = {res2.reasoning}") if __name__ == "__main__": main() ``` ### 7.3. Résultat attendu ``` SCENARIO 1 (clic bandeau Edge): verdict = terminate confidence = 0.85 failure_cat = wrong_application reasoning = Token navigateur/système 'https' dans ROI clic (attendu 'Imagerie') — cible hors-app roi_text = urgence.labs.laurinebazin.design/aiva-urgence/dossier.html... SCENARIO 2 (clic correct Imagerie): verdict = complete confidence = 0.90 reasoning = Texte 'Imagerie' trouvé dans ROI ``` **Latence mesurée typique** : 80-150 ms par check sur RTX 5070 (EasyOCR GPU sur crop 160×160), 200-400 ms sur CPU. --- ## 8. Test pytest ```python # tests/unit/test_validator_step10.py """Tests unitaires Validator — bug step 10 fermé.""" from pathlib import Path import pytest from core.validation.result import Verdict, FailureCategory from core.validation.checkers.ocr_roi import OcrRoiChecker FIXTURE = Path( "/home/dom/ai/rpa_vision_v3/visual_workflow_builder/backend/data/anchors/" "anchor_0438bd2d9bdd_1778161174_full.png" ) @pytest.fixture(scope="module") def ocr_fn(): """OCR singleton EasyOCR (GPU si dispo).""" pytest.importorskip("easyocr") from core.grounding.title_verifier import TitleVerifier fn = TitleVerifier()._get_ocr() if fn is None: pytest.skip("Aucun OCR disponible") return fn @pytest.fixture def checker(ocr_fn): return OcrRoiChecker(ocr_fn=ocr_fn, radius_px=80) @pytest.mark.skipif(not FIXTURE.exists(), reason="Fixture screenshot manquante") def test_step10_bug_detected_when_click_in_url_bar(checker): """SCENARIO bug step 10 : clic tombé dans la URL bar Edge → TERMINATE WRONG_APPLICATION.""" res = checker.check( action={"type": "click", "target_spec": {"by_text": "Imagerie"}}, result={"success": True, "actual_position": {"x_pct": 0.2305, "y_pct": 0.155}}, screenshot_before=None, screenshot_after=str(FIXTURE), context={}, ) assert res.verdict == Verdict.TERMINATE assert res.failure_category == FailureCategory.WRONG_APPLICATION assert res.confidence >= 0.8 assert "navigateur" in res.reasoning.lower() or "edge" in res.raw_evidence.get("roi_text", "").lower() @pytest.mark.skipif(not FIXTURE.exists(), reason="Fixture screenshot manquante") def test_correct_click_on_imagerie_tab(checker): """SCENARIO clic correct sur l'onglet Imagerie → COMPLETE.""" res = checker.check( action={"type": "click", "target_spec": {"by_text": "Imagerie"}}, result={"success": True, "actual_position": {"x_pct": 0.265, "y_pct": 0.295}}, screenshot_before=None, screenshot_after=str(FIXTURE), context={}, ) assert res.verdict == Verdict.COMPLETE assert res.confidence >= 0.6 def test_missing_inputs_returns_continue_low_confidence(checker): res = checker.check( action={"type": "click", "target_spec": {}}, result={"success": True}, screenshot_before=None, screenshot_after=None, context={}, ) assert res.verdict == Verdict.CONTINUE assert res.confidence < 0.3 def test_strip_accents_robust(): from core.validation.checkers.ocr_roi import _strip_accents assert _strip_accents("Imagerie") == "imagerie" assert _strip_accents("Notes médicales") == "notes medicales" assert _strip_accents("Synthèse Urgences") == "synthese urgences" assert _strip_accents("URL: https://www.exemple.com") == "url: https://www.exemple.com" ``` Lancement : ```bash cd /home/dom/ai/rpa_vision_v3 && source .venv/bin/activate pytest tests/unit/test_validator_step10.py -v ``` --- ## 9. Configuration — variables d'environnement & kill-switches ```bash # Activation globale du Validator V2 (default: off) RPA_VALIDATOR_V2_ENABLED=false # Tuning OcrRoiChecker RPA_VALIDATOR_OCR_ROI_RADIUS_CLICK=80 # px (default 80) RPA_VALIDATOR_OCR_ROI_RADIUS_TYPE=120 # px RPA_VALIDATOR_OCR_SUSPECT_CONFIDENCE=0.85 RPA_VALIDATOR_OCR_EXPECTED_CONFIDENCE=0.90 # Tuning Validator orchestrateur RPA_VALIDATOR_ACCEPT_CONFIDENCE=0.70 RPA_VALIDATOR_ESCALATE_BELOW=0.55 # Kill-switch escalation LLM (coûteuse 2-3 s) RPA_VALIDATOR_LLM_JUDGE_ENABLED=true # Override hard du verdict (debug) RPA_VALIDATOR_FORCE_VERDICT= # vide | complete | continue | terminate ``` Tous les flags conformes à la convention QW Suite Mai (cf. `docs/QW_SUITE_MAI.md`) : `RPA_*_ENABLED` boolean, lecture via `os.environ.get("...", default).lower() in {"true", "1", "yes"}`. --- ## 10. Patterns externes 2026 — verbatim & sources ### 10.1. Skyvern — prompt `check-user-goal-with-termination.j2` verbatim Récupéré directement du repo le 24 mai 2026 (`raw.githubusercontent.com/Skyvern-AI/skyvern/main/skyvern/forge/prompts/skyvern/check-user-goal-with-termination.j2`) : ```jinja You are here to help the user determine if the user has completed their goal on the web{{ " according to the complete criterion" if complete_criterion else "" }}. Use the content of the elements parsed from the page,{{ "" if without_screenshots else " the screenshots of the page," }} the user goal and user details to determine the status of the task. Make sure to ONLY return the JSON object in this format with no additional text before or after it: { "page_info": str, "thoughts": str, "status": str, // "complete" | "terminate" | "continue" "failure_categories": array // 12 catégories — voir doc parent §2.3 } Important: Think carefully about the difference between "terminate" and "continue": - "terminate" = impossible to achieve, stop trying (e.g., "account does not exist", "file unavailable", permanent error) - "continue" = not done yet, but achievable with more steps (e.g., page is loading, need to click something, need to wait) ``` 12 catégories d'échec : ANTI_BOT_DETECTION, BROWSER_ERROR, NAVIGATION_FAILURE, PAGE_LOAD_TIMEOUT, AUTH_FAILURE, LLM_REASONING_ERROR, CREDENTIAL_ERROR, ELEMENT_NOT_FOUND, WRONG_PAGE_STATE, DATA_EXTRACTION_FAILURE, INFRASTRUCTURE_ERROR, UNKNOWN. **Adaptation rpa_vision_v3** : 8 catégories suffisent (cf. `FailureCategory` §3.1) — on a moins de surfaces (pas de captcha web). On garde `WRONG_APPLICATION` qui n'existe pas chez Skyvern (Skyvern est en navigateur fermé, on est sur Windows multi-app). ### 10.2. browser-use — agentic judge verbatim format Source : Modèle : `gemini-2.5-flash` (87% accord avec labels humains). Sortie JSON : ```json { "reasoning": "Analysis covering what worked, failures, trajectory quality, tool usage, output quality", "verdict": "true|false", "failure_reason": "Max 5 sentences explanation if failed", "impossible_task": "true|false", "reached_captcha": "true|false" } ``` Philosophie : *« simple prompts and absolute True/False verdicts work best. Complex rubrics → indecisive judging. »* → **on retient** : notre `LlmJudgeChecker` doit forcer VERDICT: OUI/NON binaire, c'est ce que `verify_with_critic` fait déjà (`replay_verifier.py:481-485`). ### 10.3. OpenAdapt — Process Graph + dual validation Source : OpenAdapt distingue **code-based validation** (code Python généré par LLM, vérifie une condition) vs **model-based validation** (LMM reçoit screenshot + completion_criteria texte → bool). Si échec → bascule automatique en mode recording → la trace devient training data (« Evaluation-Driven Feedback »). **À retenir** : notre `JsonSchemaChecker` est l'équivalent code-based, `LlmJudgeChecker` l'équivalent model-based. La bascule auto-recording n'est pas dans le périmètre P0 mais doit alimenter `TargetMemoryStore` en P1 (cf. memory `project_lea_apprentissage_plan.md`). ### 10.4. Anthropic Computer Use — Validator implicite Anthropic CU (Claude 3.5 Sonnet computer-use beta) **n'a pas de Validator nommé**. Le modèle re-observe après chaque action et décide de continuer/corriger dans son raisonnement. Source : . **Non transposable à rpa_vision_v3** : notre Actor (Léa) est un exécutant déterministe, pas un LLM agentique. Il faut un Validator externe. ### 10.5. ScreenSpot-Pro & agentic reward modeling 2025-2026 - **ScreenSpot-Pro** (arXiv 2504.07981, avril 2025) : benchmark grounding GUI haute résolution, 1581 instructions × 23 apps. Meilleur modèle = 18.9 % top-1, ScreenSeekeR = 48.1 %. → confirme qu'aucun grounding seul ne suffit, un Validator est nécessaire pour catcher les 50-80 % de cas où le grounder vise mal. - **Agentic Reward Modeling — Verifying GUI Agent via Online Proactive Interaction** (arXiv 2602.00575) : verifier appris en RL, double LLM-as-judge + rule-based. - **DPO Learning with LLMs-Judge Signal for Computer Use Agents** (arXiv 2506.03095) : judge filtre trajectoires synthétiques pour entraînement. Lien direct avec `replay_learner.py` existant. → **Cible long terme** : `TargetMemoryStore` + `replay_learner` peuvent être alimentés par les verdicts du Validator. Chaque TERMINATE bien diagnostiqué = training signal négatif. Chaque COMPLETE conf élevée = positif. --- ## 11. Plan d'intégration en 3 étapes ### 11.1. P0 — 1 jour (avant prochaine démo client) **Cible** : fermer le bug step 10 sans toucher au flux nominal. 1. Créer `core/validation/{__init__.py, result.py, checker_base.py, validator.py}` — 2 h. 2. Créer `core/validation/checkers/{__init__.py, ocr_roi.py, llm_judge.py}` — 2 h. 3. Écrire `scripts/repro_bug_step10_validator.py` + lancer en local pour confirmer le verdict TERMINATE — 30 min. 4. Écrire `tests/unit/test_validator_step10.py` — 1 h. Lancer `pytest tests/unit/test_validator_step10.py -v`. 5. Patch `api_stream.py:3447` (diff §6.1) derrière `RPA_VALIDATOR_V2_ENABLED=false` — 2 h. 6. Démo interne avec flag ON sur `Demo_urgence_3_db` : mesurer latence ajoutée + faux positifs sur 46 steps — 30 min. 7. Documenter dans `docs/QW_SUITE_MAI.md` ou nouveau `docs/VALIDATOR_V2.md` — 30 min. **Livrable** : pas de régression flag off, bug step 10 détecté en TERMINATE flag on. ### 11.2. P1 — 2 semaines 1. Matrice complète action → check (§4) : ajouter `PixelDiffChecker`, `TitleBarChecker`, `JsonSchemaChecker` — 1 jour. 2. Implémenter le `route_verdict` dispatcher (§5) : intégrer enter_paused_state, retry_with_reresolve, handoff_dialog_handler — 2 jours. 3. Dashboard : panneau « Validator stats » — verdicts par session, top failure_categories, latence p50/p95 — 1 jour. 4. Réactiver DETTE-008 (`observe_reason_act.py:1704-1713`) : ce code mort EST l'ancêtre du Validator. Le remplacer par appel `Validator.validate()` après chaque clic ORA. — 1 jour. 5. Coexistence avec drift exemption (`resolve_engine.py:2390 _RESOLUTION_MAX_DRIFT=0.95`) : si le Validator V2 atteint 90 % accuracy en démo, on peut baisser `_RESOLUTION_MAX_DRIFT` à 0.30 — 0.5 jour test. 6. Réactivation `RPA_ENABLE_TEXT_PRECHECK=true` (DETTE-001) : le pré-check OCR sémantique devient une fonction privée du Validator V2 — 0.5 jour. ### 11.3. P2 — post-démo (1 mois) 1. `DialogPresenceChecker` (chaîne D2) : cascade modaux VM via OCR + template — 2 jours. 2. Migration `LlmJudgeChecker` vers handler dédié séparé du `t2a_decision` LLM (Skyvern fait pareil avec `USE_CHECK_USER_GOAL_HANDLER_FOR_VERIFICATION`) — 1 jour. 3. Apprentissage : chaque verdict TERMINATE alimente `TargetMemoryStore` comme negative trace — 3 jours. 4. Re-planification : signaler à VWB que l'ancre est foireuse → suggestion recapture automatique — 5 jours. 5. Multi-modal Validator (combiner OCR + DINOv2 + title en 1 check composite atomique) — bench post-démo. --- ## 12. Sources avec liens cliquables ### Code source consulté - Skyvern `agent.py` — - Skyvern prompt `check-user-goal-with-termination.j2` (récupéré verbatim 24 mai 2026) — - Skyvern prompt `check-user-goal.j2` (cité par doc parent) — - Skyvern repo principal — - Skyvern PR #1513 chain-of-thought user goal — ### Framework verifiers 2026 - browser-use evaluation system — - browser-use AGENTS.md — - OpenAdapt architecture wiki — - OpenAdapt evals — - Anthropic Computer Use docs — ### Papers 2025-2026 - ScreenSpot-Pro (arXiv 2504.07981) — - Agentic Reward Modeling for GUI Agent (arXiv 2602.00575) — - DPO Learning with LLMs-Judge Signal for CUA (arXiv 2506.03095) — - GUI-Actor coordinate-free grounding (arXiv 2506.03143) — ### Pydantic v2 (JsonSchemaChecker) - Pydantic v2 JSON validation guide — - LLM output validation pratiques — - Production guide — ### Doc interne consultée (lecture seule) - Doc parent : `docs/recherche/AXE_B2_VALIDATOR_PATTERN.md` - Doc frère OCR : `docs/recherche/AXE_A4_OCR_TEMPLATE_PHASH.md` - Bug archétype : `docs/REPLAY_BLOCAGE_NOTES_MEDICALES_2026-05-08.md` - Bench LLM judge : `docs/BENCH_SAFETY_CHECKS_2026-05-06.md` - Code existant verifier : `agent_v0/server_v1/replay_verifier.py:367-633` (`verify_with_critic`) - Code existant title verifier : `core/grounding/title_verifier.py:25-175` - Wiring actuel : `agent_v0/server_v1/api_stream.py:3447-3582` (`report_action_result`) - DETTE-008 (pre-check VLM désactivé) : `core/execution/observe_reason_act.py:1704-1713` - Drift exemption : `agent_v0/server_v1/resolve_engine.py:2384-2390` (`_RESOLUTION_MAX_DRIFT=0.95`) - Synthèse globale : `docs/SYNTHESE_TECHNOS_REPLAY_2026-05-23.md` --- ## 13. Dépendances explicites avec autres axes | Axe | Dépendance | Statut | |---|---|---| | **AXE_A4 (OCR)** | `OcrRoiChecker` utilise EasyOCR singleton du `TitleVerifier` (déjà chargé en prod). `_strip_accents` réutilisable dans `_resolve_by_ocr_text` correctif center-of-span. | ✅ pas de blocage | | **AXE_A5 (tokenisation UI)** | Si OmniParser/UI-DETR-1 livre des bboxes par élément au runtime, le Validator pourrait matcher `target == element_at_point(cx, cy).label` directement (déterministe). | 🟡 P2 | | **AXE_B1 (watchdog `_retry_pending`)** | Indépendant. Le watchdog corrige la cause primaire (HTTP timeout), le Validator corrige la cause aggravante (mauvais clic validé success=True). Les deux ensemble = fermeture totale du bug step 10. | ✅ orthogonal | | **Chaîne D2 (dialog/popup)** | `failure_category=UNEXPECTED_DIALOG` → handoff DialogHandler. Le Validator détecte le problème, D2 le résout. | ✅ contrat clair | | **DETTE-008** | Le code mort `if False:` en `observe_reason_act.py:1704-1713` est l'ancêtre du Validator. À remplacer en P1 par `Validator.validate()` après chaque clic ORA. | 🟡 P1 | | **DETTE-001 (`RPA_ENABLE_TEXT_PRECHECK=false`)** | Le pré-check OCR spatialement aveugle devient le `OcrRoiChecker` correctement spatialisé. | ✅ P1 | | **Drift exemption ≥ 0.95** (`_RESOLUTION_MAX_DRIFT`) | Le Validator V2 permet de baisser le seuil drift à 0.30 (P1) car les faux positifs templates seront catchés post-action. | 🟡 P1 | --- *Livrable de recherche, lecture seule. Aucune modification de code appliquée. Validation et merge relèvent de Dom au cas par cas, après validation du smoke test §11.1 sur `Demo_urgence_3_db`.*