docs: add POC specs, handoffs, and research notes

2026-06-02 16:28:34 +02:00
parent 18ed6cb751
commit f2e9aac6b7
86 changed files with 27615 additions and 25 deletions
--- a/docs/recherche/AXE_B2_VALIDATOR_PATTERN.md
+++ b/docs/recherche/AXE_B2_VALIDATOR_PATTERN.md
@@ -0,0 +1,817 @@
+# AXE B2 — Pattern Planner-Actor-Validator & validation sémantique post-action
+
+**Date :** 2026-05-23
+**Auteur :** agent recherche dispatché (Claude Opus 4.7 1M)
+**Statut :** livrable de recherche, lecture seule, AUCUNE modification de code
+**Lien dépendances :** AXE_A4 (OCR), AXE_A5 (tokenisation écran — déjà rédigé), AXE_B4 (ORA observe_reason_act)
+
+---
+
+## 1. TL;DR + recommandation
+
+**Constat.** Skyvern (12k stars, SOTA 85.85 % WebVoyager) formalise le **Validator** comme un agent à part entière, séparé du Planner et de l'Actor. Son rôle : après chaque step, prendre une nouvelle capture, demander à un LLM (avec image + DOM élagué) si l'objectif courant est atteint, sinon renvoyer `continue` / `terminate`. C'est exactement ce qui manque à rpa_vision_v3 : VWB = Planner statique, Léa = Actor, et `replay_verifier.py` est un pixel-diff global qui n'a aucune notion de **sémantique** (« est-ce que l'onglet Imagerie de l'app Easily est maintenant actif ? »).
+
+Le bug archétype step 10 démo GHT (« Imagerie » cliqué dans le bandeau Edge, REPORT success=True) tient **uniquement** à cette absence : pHash global voit du mouvement → conclut OK. Un Validator visuel par step le détecterait en 1-3 s.
+
+**Recommandation design pour rpa_vision_v3** (justifiée §6, §9) :
+
+1. **Garder** `replay_verifier.verify_action` (pixel) comme pré-filtre 10 ms.
+2. **Réactiver et étendre** `verify_with_critic` déjà câblé (§6) en lui passant un `expected_result` **typé** par action.
+3. **Ajouter un `Validator` pluggable** côté serveur, qui choisit la stratégie de check selon `action_type` (matrice §5). Implémentation Python = ~250 LOC.
+4. **Pour le bug step 10 précisément** : `click_anchor` doit déclencher une vérif OCR-ROI **autour du point cliqué** (rayon 60 px) ET une vérif title-bar (déjà fait par `core/grounding/title_verifier.py`). Si la ROI contient le mot Edge / le mot URL / un domaine `.com`, c'est un faux clic → retry, pas continue.
+5. **Latence cible** : pixel 10 ms, OCR-ROI 100 ms, LLM-judge 2-3 s. Ne lancer le LLM-judge que si pixel **OU** OCR-ROI suspect.
+
+Le pattern Skyvern est directement adoptable. Le code Skyvern (Python, AGPL-3.0) montre que le Validator c'est **5 prompts Jinja2 + 1 méthode `complete_verify` + 1 dataclass `CompleteVerifyResult`**. Pas plus.
+
+---
+
+## 2. Skyvern Validator détaillé (code source 23 mai 2026)
+
+### 2.1. Méthode `complete_verify` (extraite verbatim de `skyvern/forge/agent.py:2609-2730`)
+
+Source : <https://github.com/Skyvern-AI/skyvern/blob/main/skyvern/forge/agent.py#L2609>
+
+Le Validator chez Skyvern n'est pas un sous-processus exotique : c'est **une coroutine LLM** appelée après l'Actor, à chaque step où il n'y a pas déjà une `DecisiveAction` (= action terminale émise par l'Actor lui-même).
+
+```python
+# skyvern/forge/agent.py (résumé condensé du flux)
+async def complete_verify(
+    self, page: Page, scraped_page: ScrapedPage, task: Task, step: Step
+) -> CompleteVerifyResult:
+    # 1. RE-SCRAPE la page (DOM élagué + screenshots), pas la version utilisée par l'Actor
+    scraped_page_refreshed = await scraped_page.refresh(draw_boxes=False, scroll=scroll)
+
+    # 2. Construit le prompt avec : navigation_goal, payload, complete_criterion,
+    #    action_history, elements parsés, datetime
+    template_name = "check-user-goal-with-termination" if use_termination_prompt else "check-user-goal"
+    verification_prompt = load_prompt_with_elements(
+        element_tree_builder=scraped_page_refreshed,
+        template_name=template_name,
+        navigation_goal=task.navigation_goal,
+        navigation_payload=task.navigation_payload,
+        complete_criterion=task.complete_criterion,
+        terminate_criterion=task.terminate_criterion,
+        action_history=actions_and_results_str,
+        local_datetime=...,
+    )
+
+    # 3. Appel LLM avec screenshots — un handler LLM dédié possible
+    #    via flag PostHog USE_CHECK_USER_GOAL_HANDLER_FOR_VERIFICATION
+    verification_result = await llm_api_handler(
+        prompt=verification_prompt,
+        step=step,
+        screenshots=scraped_page_refreshed.screenshots,
+        prompt_name=prompt_name,
+    )
+
+    # 4. Parse JSON strict → 3 verdicts possibles
+    result = CompleteVerifyResult.model_validate(verification_result)
+    if result.is_complete:
+        verification_status = VerificationStatus.complete
+    elif result.is_terminate:
+        verification_status = VerificationStatus.terminate
+    else:
+        verification_status = VerificationStatus.continue_step
+
+    # 5. Trace OTEL : verification.status, verification.template, verification.reasoning_kind
+    span.set_attribute("verification.status", verification_status.value)
+    record_verification_span_attrs(span, result.thoughts)
+    return result
+```
+
+**Trois verdicts uniquement** : `complete` / `terminate` / `continue_step`. Pas de `success_partial` ni de `retry_silent`. C'est volontaire : la décision est forcée binaire.
+
+Le `check_user_goal_complete` (lignes 2736+) wrap `complete_verify` et le convertit en `CompleteAction` ou `TerminateAction` pour l'orchestrateur.
+
+### 2.2. Le prompt `check-user-goal.j2` (verbatim, fetch direct du repo)
+
+Source : <https://raw.githubusercontent.com/Skyvern-AI/skyvern/main/skyvern/forge/prompts/skyvern/check-user-goal.j2>
+
+```jinja
+Your are here to help the user determine if the user has completed their goal on the web{{ " according to the complete criterion" if complete_criterion else "" }}. Use the content of the elements parsed from the page,{{ "" if without_screenshots else " the screenshots of the page," }} the user goal and user details to determine whether the {{ "complete criterion has been met" if complete_criterion else "user goal has been completed" }} or not.
+
+Make sure to ONLY return the JSON object in this format with no additional text before or after it:
+{
+  "page_info": str, // Think step by step. Describe all the useful information in the page related to the user goal.
+  "thoughts": str, // Think step by step. What information makes you believe whether user goal has completed or not. Use information you see on the site to explain.
+  "user_goal_achieved": bool // True if the user goal has been completed, false otherwise.
+}
+
+User Goal:
+{{ navigation_goal }}
+
+User Details:
+{{ navigation_payload }}
+
+Action History:
+{{ action_history }}
+
+Elements on the page:
+{{ elements }}
+
+Current datetime, ISO format:
+{{ local_datetime }}
+```
+
+**Points clés** :
+- Sortie JSON stricte, parsée par Pydantic `CompleteVerifyResult.model_validate`.
+- Trois infos données au modèle : (a) screenshots, (b) elements parsés du DOM, (c) action_history textuelle. Multi-modal.
+- Le `page_info` → `thoughts` → `user_goal_achieved` impose une chain-of-thought structurée. C'est ce qui rend l'erreur diagnosticable.
+
+### 2.3. Le prompt `check-user-goal-with-termination.j2` (expérimental, verbatim)
+
+Source : <https://raw.githubusercontent.com/Skyvern-AI/skyvern/main/skyvern/forge/prompts/skyvern/check-user-goal-with-termination.j2>
+
+Ajoute un 3e statut explicite `terminate` + une **classification des échecs** en 12 catégories :
+
+```jinja
+"status": str, // Must be one of three values: "complete", "terminate", or "continue".
+"failure_categories": array // Only populate when status is "terminate". Classify the root cause.
+[{
+    "category": str, // ANTI_BOT_DETECTION | BROWSER_ERROR | NAVIGATION_FAILURE |
+                    // PAGE_LOAD_TIMEOUT | AUTH_FAILURE | LLM_REASONING_ERROR |
+                    // CREDENTIAL_ERROR | ELEMENT_NOT_FOUND | WRONG_PAGE_STATE |
+                    // DATA_EXTRACTION_FAILURE | INFRASTRUCTURE_ERROR | UNKNOWN
+    "confidence_float": float,
+    "reasoning": str
+}]
+
+Important: Think carefully about the difference between "terminate" and "continue":
+- "terminate" = impossible to achieve, stop trying
+- "continue" = not done yet, but achievable with more steps
+```
+
+**À retenir** : Skyvern est très conservateur sur `terminate` (« only when CLEAR, EXPLICIT, UNAMBIGUOUS evidence »). C'est aligné avec le feedback `feedback_failure_is_learning.md` de Dom : échec ≠ stop avec erreur, c'est pause supervisée.
+
+### 2.4. Quand le Validator se déclenche
+
+Extrait `agent.py:1929-1971` :
+
+```python
+enable_parallel_verification = False
+if (
+    not has_decisive_action          # l'Actor n'a pas déjà émis un COMPLETE
+    and not task_completes_on_download
+    and not isinstance(task_block, ActionBlock)
+    and complete_verification        # flag global activable par-task
+    and (task.navigation_goal or task.complete_criterion)
+):
+    # Géré par feature flag PostHog
+    disable_user_goal_check = await app.EXPERIMENTATION_PROVIDER.is_feature_enabled_cached(
+        "DISABLE_USER_GOAL_CHECK",
+        task.task_id,
+        ...
+    )
+    enable_parallel_verification = not disable_user_goal_check
+```
+
+→ **Le Validator tourne à CHAQUE step** par défaut (« deferred to handle_completed_step »). C'est désactivable par task ou globalement, mais l'état par défaut est ON. Skyvern accepte le coût LLM par step parce qu'un faux succès rend l'agent inutilisable.
+
+### 2.5. Contrat de données
+
+```python
+# skyvern/forge/sdk/schemas/tasks.py (déduit du code agent.py)
+class CompleteVerifyResult(BaseModel):
+    page_info: str
+    thoughts: str
+    is_complete: bool
+    is_terminate: bool = False
+    status: str | None = None       # "complete" | "terminate" | "continue"
+    failure_categories: list[FailureCategory] = []
+```
+
+### 2.6. Latence et coût
+
+D'après le post Skyvern 2.0 (<https://www.skyvern.com/blog/skyvern-2-0-state-of-the-art-web-navigation-with-85-8-on-webvoyager-eval/>) :
+
+- Un step moyen prend 2-10 s.
+- Validator = appel LLM séparé (souvent un GPT-4o-mini ou Claude Haiku), 1-3 s.
+- ROI = sans Validator, accuracy 68.7 % WebVoyager ; avec Validator, **85.85 %**. Le delta de +17 points en accuracy justifie largement la latence.
+
+Source : <https://browser-use.com/posts/our-browser-agent-evaluation-system> (browser-use rapporte +17 pts également : 45 → 68.7 → 85.85 selon Planner/Validator).
+
+---
+
+## 3. Tour d'horizon Validator dans 5 autres frameworks
+
+### 3.1. OpenAdapt — Evaluation-Driven Feedback
+
+Source : <https://github.com/OpenAdaptAI/OpenAdapt/wiki/OpenAdapt-Architecture-(draft)>, <https://github.com/OpenAdaptAI/openadapt-evals>.
+
+OpenAdapt formalise le concept au niveau **Process Graph** (graphe de steps avec arêtes = critères de complétion) :
+
+- **Code-based validation** : LLM génère du Python qui vérifie une condition d'état (présence d'un message de confirmation, état d'un bouton, etc.). Code stocké, ré-exécuté à chaque replay.
+- **Model-based validation** : LMM (Large Multimodal Model) reçoit le screenshot courant + `completion_criteria` formulés en langage naturel → bool.
+
+Particularité : si la validation échoue, OpenAdapt **bascule en mode recording** automatiquement → l'utilisateur démontre la suite → la trace devient training data. C'est l'« Evaluation-Driven Feedback ». Le sous-package `openadapt-evals` expose `evaluate_agent_on_benchmark`.
+
+### 3.2. browser-use — agentic judge
+
+Source : <https://browser-use.com/posts/our-browser-agent-evaluation-system>, <https://github.com/browser-use/browser-use>.
+
+- LLM judge intégré dans le code agent, **tourne après `done`** ET « can also double as a real-time validation layer during regular use ».
+- Modèle : `gemini-2.5-flash`. Accuracy juge vs labels humains : 87 %.
+- Sortie JSON stricte :
+
+```json
+{
+  "reasoning": "Analysis covering what worked, failures, trajectory quality, tool usage, output quality",
+  "verdict": "true|false",
+  "failure_reason": "Max 5 sentences explanation if failed",
+  "impossible_task": "true|false",
+  "reached_captcha": "true|false"
+}
+```
+
+- Philosophie : **simple prompts and absolute True/False verdicts work best**. Complex rubrics → indecisive judging.
+
+### 3.3. Anthropic Computer Use
+
+Source : <https://docs.anthropic.com/en/docs/build-with-claude/computer-use>.
+
+Anthropic CU n'a pas de Validator nommé. Boucle minimaliste : `screenshot → action → screenshot → ...` jusqu'à ce que Claude lui-même décide qu'il a fini. **Validation = self-reflection implicite du modèle dans son raisonnement**.
+
+→ Acceptable parce que Claude est puissant. **Pas applicable à rpa_vision_v3** où l'Actor n'est pas un LLM agentique mais un exécutant déterministe (Léa). Il faut un Validator externe.
+
+### 3.4. OpenAI Operator / CUA
+
+Source : <https://openai.com/index/operator-system-card/>.
+
+Idem Anthropic CU : pas de Validator séparé. Le modèle CUA fait perception → reasoning → action en boucle. Selon le system card : « If it encounters challenges or makes mistakes, Operator can leverage its reasoning capabilities to self-correct ». Pas formalisé.
+
+OpenCUA (open-source, <https://opencua.xlang.ai/>) entraîne avec « reflective Chain-of-Thought reasoning » mais pas de check externe.
+
+### 3.5. Cradle (BAAI, Kunlun Tech) — Self-Reflection module
+
+Source : <https://github.com/BAAI-Agents/Cradle>, <https://arxiv.org/pdf/2403.03186>.
+
+Cradle décompose explicitement en 6 modules dont **Self-Reflection** :
+> « Through this module, the agent assesses previous actions to understand their outcomes, evaluate successes or failures, and adjust behavior accordingly. »
+
+Mesure : +20.41 points sur tâches « professional domain » vs baselines. Mais c'est un agent jeu/applications, pas RPA déclaratif → moins directement transposable.
+
+### 3.6. Tableau récap
+
+| Framework | Validator nommé ? | Modalité | Modèle | Latence | Verdict format |
+|---|---|---|---|---|---|
+| Skyvern 2.0 | **Oui** (`complete_verify`) | VLM + DOM élagué | GPT-4o ou handler dédié | 1-3 s | JSON `is_complete/is_terminate/status` |
+| OpenAdapt | Oui (Process Graph) | LMM ou Python généré | Configurable | n/a | bool + falls back to recording |
+| browser-use | Oui (agentic judge) | VLM + DOM | gemini-2.5-flash | 1-2 s | JSON `verdict/failure_reason` |
+| Anthropic CU | Non (implicite) | Self-reflection | Claude lui-même | inclus | continuation libre |
+| OpenAI Operator | Non (implicite) | Self-reflection | CUA | inclus | continuation libre |
+| Cradle | Oui (Self-Reflection) | LMM | GPT-4V | 2-5 s | text reasoning |
+
+**Convergence forte** : les 3 frameworks RPA matures (Skyvern, OpenAdapt, browser-use) ont un Validator **explicite, JSON-strict, multi-modal (VLM + structure DOM)**. Les agents généralistes (CU, Operator) délèguent au LLM agentique. Pour rpa_vision_v3 avec Actor déterministe = camp Skyvern.
+
+---
+
+## 4. Taxonomie des approches de validation post-action
+
+| Approche | Coût | Précision | Faux-positifs | Quand l'utiliser |
+|---|---|---|---|---|
+| **A. LLM-as-judge (full VLM)** | 1-5 s | Très haute (sémantique) | Faibles | Validation finale de step / cas ambigus |
+| **B. OCR ROI** (texte attendu autour du clic) | 80-200 ms | Haute si texte connu | Sensible OCR errors | Tabs, boutons, libellés |
+| **C. OCR title-bar** (titre fenêtre) | ~120 ms (déjà câblé) | Moyenne | Bruit OCR sur petits crops | Navigation fenêtre / ouverture appli |
+| **D. Visual diff pHash global** | 10 ms | Très basse (juste « ça a bougé ») | Énormes | Pré-filtre `nothing-happened` |
+| **E. Visual diff pHash ROI** | 20 ms | Moyenne | Moyens | Détection focus tab (changement souligné) |
+| **F. CLIP features cos-sim** | 50-200 ms | Moyenne | Confond visuellement proches | Reconnaissance d'écran connu |
+| **G. DINOv2 features** | 100-300 ms | Haute (self-supervised, plus robuste que CLIP) | Faibles | Comparaison patches précis |
+| **H. LPIPS** | 100 ms | Haute (perceptual) | Moyens | Vérif après animations / transitions |
+| **I. Window-focus check** (win32 API ou OCR titlebar) | <50 ms | Très haute | Quasi nuls | Vérif que la bonne app est devant |
+| **J. Dialog presence detect** | OCR + template | Très haute | Faibles | Détection popups bloquantes |
+| **K. JSON schema validation** (extraction) | <10 ms | Déterministe | nuls | `extract_text`, `t2a_decision` |
+
+**Source visual diff** : <https://wopee.io/blog/screenshot-comparison-algorithms-visual-testing/> — pHash est positionné comme « pre-filter, not a comparator ». Les VLM sont positionnés comme « triage layer on top of pixel diffs, not as the comparator itself ». Exactement le design pixel→sémantique déjà câblé dans `replay_verifier.verify_with_critic`.
+
+**Pour DINOv2 / LPIPS / CLIP** : sources <https://github.com/facebookresearch/dinov2>, <https://medium.com/aimonks/clip-vs-dinov2-in-image-similarity-6fa5aa7ed8c6>. DINOv2 produit des features visuelles plus discriminantes que CLIP pour comparer deux crops d'UI (CLIP est entraîné texte↔image, pas pour le pixel-perfect).
+
+---
+
+## 5. Matrice type d'action → check recommandé pour rpa_vision_v3
+
+Aligné avec `reference_vwb_action_types.md` (memory) et `_ALLOWED_ACTION_TYPES` de `replay_engine.py`.
+
+| Action VWB (Léa) | Check primaire | Check secondaire (si primaire ambigu) | Budget latence |
+|---|---|---|---|
+| `click_anchor` → `click` | **B. OCR ROI** (rayon 60 px) + **I. Window focus** | A. LLM-as-judge si OCR ne trouve pas le label | 100 ms + 2 s si escalation |
+| `double_click_anchor` → `click button="double"` | **C. OCR title-bar** (déjà câblé) + **B. OCR ROI** | A. LLM-as-judge | 200 ms + 2 s |
+| `right_click_anchor` → `click button="right"` | **J. Dialog presence** (menu contextuel attendu) | B. OCR ROI sur menu | 150 ms |
+| `type_text` → `type` | **B. OCR ROI** : le texte tapé est-il visible dans la ROI ? | A. LLM-as-judge si texte tronqué | 100 ms |
+| `type_secret` | **D. pHash ROI** (vérifier qu'un input s'est rempli, pas le contenu) | — | 20 ms |
+| `keyboard_shortcut` → `key_combo` | **C. OCR title-bar** OU **J. Dialog presence** selon raccourci | A. LLM-as-judge en cas de doute | 200 ms |
+| `scroll_to_anchor` → `scroll` | **F. CLIP cos-sim** before/after ROI cible visible | D. pHash global change ≠ 0 | 100 ms |
+| `wait_for_anchor` → `wait` | **B. OCR ROI** : l'ancre est-elle visible ? | A. LLM-as-judge | 100 ms |
+| `extract_text` | **K. JSON schema** : type str, longueur > 0, langue fr ratio | A. LLM-as-judge sur le contenu plausibilité | 10 ms + 2 s si plausibilité requise |
+| `extract_text_scroll` | K + **A. LLM-as-judge** si plusieurs pages | — | 10 ms + 2 s |
+| `extract_table` | **K. JSON schema** : ≥ 1 row, headers attendus si fournis | A. LLM-as-judge | 10 ms |
+| `screenshot_evidence` | — (action passive) | I. Window focus | <50 ms |
+| `t2a_decision` | **K. JSON schema** strict (decision ∈ {UHCD, FORFAIT, NA}, JSON parseable) | — | 10 ms |
+| `pause_for_human` | **Checklist QW4** (déjà fait, `SafetyChecksProvider`) | — | n/a |
+| `db_save_data` | **K. Schema row sauvée** (SELECT verify) | — | <50 ms |
+| `import_excel`, `db_read_data` | **K. Schema rows** | — | <50 ms |
+| `visual_condition` | **A. LLM-as-judge** sur la condition formulée | — | 2 s |
+| `ai_ocr`, `ai_summarize`, etc. | **K. JSON schema** + **A. plausibilité** | — | 10 ms + 2 s |
+
+**Principe directeur** : la plupart des actions ont un check pas-cher (OCR ROI, JSON) qui suffit dans 90 % des cas. Le LLM-as-judge (2 s) ne tire qu'en escalation, ou sur les actions à risque élevé (`click_anchor` sur cibles ambiguës, `t2a_decision`, `visual_condition`).
+
+---
+
+## 6. Design d'un Validator pluggable — code copy-paste-ready
+
+### 6.1. Interface
+
+À placer dans `agent_v0/server_v1/validator.py` (nouveau fichier, complète `replay_verifier.py` existant) :
+
+```python
+# agent_v0/server_v1/validator.py
+"""
+Validator — vérification sémantique post-action pluggable.
+
+Inspiré de Skyvern (Planner-Actor-Validator). Combine pixel-diff existant
+(replay_verifier.py) avec une couche sémantique typée par action_type.
+
+Trois verdicts possibles, calque sur Skyvern :
+- COMPLETE  → l'action a eu l'effet voulu, passer au step suivant
+- CONTINUE  → l'effet n'est pas encore visible, re-vérifier après wait
+- TERMINATE → l'action a échoué de manière irrécupérable (pause supervisée)
+"""
+from __future__ import annotations
+import logging
+from dataclasses import dataclass, field
+from enum import Enum
+from typing import Any, Callable, Dict, Optional, Protocol
+
+logger = logging.getLogger(__name__)
+
+
+class Verdict(str, Enum):
+    COMPLETE = "complete"
+    CONTINUE = "continue"
+    TERMINATE = "terminate"
+
+
+class FailureCategory(str, Enum):
+    WRONG_TARGET = "wrong_target"           # cliqué ailleurs (ex. bug step 10)
+    NO_VISUAL_CHANGE = "no_visual_change"   # action sans effet
+    UNEXPECTED_DIALOG = "unexpected_dialog" # popup bloque
+    WRONG_APPLICATION = "wrong_application" # focus sur mauvaise app (Edge vs Easily)
+    OCR_TEXT_MISSING = "ocr_text_missing"   # texte attendu absent
+    SCHEMA_INVALID = "schema_invalid"       # JSON/extract invalide
+    UNKNOWN = "unknown"
+
+
+@dataclass
+class ValidationResult:
+    verdict: Verdict
+    confidence: float                       # 0.0-1.0
+    check_used: str                         # "ocr_roi" | "llm_judge" | "title_bar" | ...
+    elapsed_ms: float
+    reasoning: str = ""
+    failure_category: Optional[FailureCategory] = None
+    raw_evidence: Dict[str, Any] = field(default_factory=dict)
+
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "verdict": self.verdict.value,
+            "confidence": round(self.confidence, 3),
+            "check_used": self.check_used,
+            "elapsed_ms": round(self.elapsed_ms, 1),
+            "reasoning": self.reasoning,
+            "failure_category": self.failure_category.value if self.failure_category else None,
+            "raw_evidence": self.raw_evidence,
+        }
+
+
+class ActionChecker(Protocol):
+    """Contrat d'un checker spécifique par action_type."""
+    name: str
+    budget_ms: float
+
+    def check(
+        self,
+        action: Dict[str, Any],
+        result: Dict[str, Any],
+        screenshot_before: Optional[str],
+        screenshot_after: Optional[str],
+        context: Dict[str, Any],
+    ) -> ValidationResult: ...
+
+
+class Validator:
+    """Orchestrateur : route action_type → checker, gère l'escalation."""
+
+    def __init__(
+        self,
+        checkers: Dict[str, list[ActionChecker]],
+        default_checker: ActionChecker,
+        escalation_checker: Optional[ActionChecker] = None,
+        escalation_threshold: float = 0.5,
+    ):
+        """
+        checkers: mapping action_type → liste de checkers à essayer en ordre.
+        default_checker: fallback si action_type pas dans le mapping.
+        escalation_checker: typiquement un LLM-as-judge, lancé si confidence < seuil.
+        """
+        self._checkers = checkers
+        self._default = default_checker
+        self._escalation = escalation_checker
+        self._escalation_threshold = escalation_threshold
+
+    def validate(
+        self,
+        action: Dict[str, Any],
+        result: Dict[str, Any],
+        screenshot_before: Optional[str] = None,
+        screenshot_after: Optional[str] = None,
+        context: Optional[Dict[str, Any]] = None,
+    ) -> ValidationResult:
+        context = context or {}
+        action_type = action.get("type", "")
+
+        candidates = self._checkers.get(action_type, [self._default])
+
+        last_result: Optional[ValidationResult] = None
+        for checker in candidates:
+            res = checker.check(action, result, screenshot_before, screenshot_after, context)
+            last_result = res
+            # Si verdict net + confiance haute → renvoyer
+            if res.confidence >= self._escalation_threshold and res.verdict != Verdict.CONTINUE:
+                return res
+
+        # Escalation LLM-as-judge si fourni
+        if self._escalation and last_result and last_result.confidence < self._escalation_threshold:
+            logger.info(
+                "Validator escalation LLM-judge (last_conf=%.2f, check=%s)",
+                last_result.confidence, last_result.check_used,
+            )
+            esc = self._escalation.check(action, result, screenshot_before, screenshot_after, context)
+            # On combine : si LLM contredit, LLM prime (sa confiance est bornée à 0.9)
+            return esc
+
+        return last_result or ValidationResult(
+            verdict=Verdict.CONTINUE,
+            confidence=0.3,
+            check_used="no_checker",
+            elapsed_ms=0.0,
+            reasoning="Aucun checker n'a produit de verdict",
+        )
+```
+
+### 6.2. Exemple de checker : `OcrRoiChecker` (pour click_anchor)
+
+```python
+# agent_v0/server_v1/checkers/ocr_roi.py
+import time
+from typing import Any, Dict, Optional
+from PIL import Image
+
+from agent_v0.server_v1.validator import (
+    ActionChecker, ValidationResult, Verdict, FailureCategory,
+)
+
+
+class OcrRoiChecker:
+    """Vérifie que le texte attendu apparaît dans la ROI autour du clic.
+
+    Spécifiquement conçu pour résoudre le bug step 10 :
+    si on a cliqué sur 'Imagerie', la ROI 60px doit contenir 'Imagerie'.
+    Si elle contient 'Edge' ou 'urgence.labs.laurinebazin.design',
+    on a cliqué dans le bandeau navigateur → failure.
+    """
+    name = "ocr_roi"
+    budget_ms = 200.0
+
+    # Mots suspects = on a cliqué hors-app
+    SUSPECT_TOKENS = {"edge", "chrome", "firefox", "http", "https", ".com", ".fr",
+                      "favoris", "favorite", "onglet", "tab "}
+
+    def __init__(self, ocr_fn, radius_px: int = 60):
+        self._ocr = ocr_fn  # callable(PIL.Image) -> str
+        self._radius = radius_px
+
+    def check(self, action, result, screenshot_before, screenshot_after, context) -> ValidationResult:
+        t0 = time.time()
+        expected_text = action.get("by_text") or context.get("expected_text", "")
+        x_pct = action.get("x_pct")
+        y_pct = action.get("y_pct")
+
+        if not screenshot_after or x_pct is None or y_pct is None:
+            return ValidationResult(
+                verdict=Verdict.CONTINUE, confidence=0.2,
+                check_used=self.name, elapsed_ms=(time.time() - t0) * 1000,
+                reasoning="ROI indéfinie (pas de coords ou pas de screenshot)",
+            )
+
+        img = self._load_image(screenshot_after)
+        w, h = img.size
+        cx, cy = int(x_pct * w), int(y_pct * h)
+        r = self._radius
+        roi = img.crop((max(0, cx - r), max(0, cy - r), min(w, cx + r), min(h, cy + r)))
+
+        text = (self._ocr(roi) or "").lower()
+        expected_lower = expected_text.lower().strip()
+
+        elapsed_ms = (time.time() - t0) * 1000
+
+        # 1) Vérif : un token suspect (navigateur) dans la ROI → faux clic
+        for suspect in self.SUSPECT_TOKENS:
+            if suspect in text and suspect not in expected_lower:
+                return ValidationResult(
+                    verdict=Verdict.TERMINATE, confidence=0.85,
+                    check_used=self.name, elapsed_ms=elapsed_ms,
+                    failure_category=FailureCategory.WRONG_APPLICATION,
+                    reasoning=f"Token navigateur '{suspect}' dans ROI clic — cible probablement hors-app",
+                    raw_evidence={"roi_text": text[:200], "expected": expected_lower},
+                )
+
+        # 2) Vérif : le texte attendu est dans la ROI ?
+        if expected_lower and expected_lower in text:
+            return ValidationResult(
+                verdict=Verdict.COMPLETE, confidence=0.9,
+                check_used=self.name, elapsed_ms=elapsed_ms,
+                reasoning=f"Texte '{expected_lower[:40]}' trouvé dans ROI",
+                raw_evidence={"roi_text": text[:200]},
+            )
+
+        # 3) Pas trouvé mais pas suspect non plus → confiance basse, escalation
+        return ValidationResult(
+            verdict=Verdict.CONTINUE, confidence=0.4,
+            check_used=self.name, elapsed_ms=elapsed_ms,
+            failure_category=FailureCategory.OCR_TEXT_MISSING,
+            reasoning=f"Texte '{expected_lower[:40]}' non trouvé dans ROI",
+            raw_evidence={"roi_text": text[:200]},
+        )
+
+    @staticmethod
+    def _load_image(source: str) -> Image.Image:
+        # Délégué à replay_verifier._load_single_image, ou copy-paste équivalent
+        from agent_v0.server_v1.replay_verifier import ReplayVerifier
+        return ReplayVerifier()._load_single_image(source)
+```
+
+### 6.3. Intégration avec `replay_verifier.py` existant
+
+Le `replay_verifier.verify_with_critic` couvre déjà 80 % du besoin LLM-as-judge (étape sémantique VLM). Il suffit de :
+
+1. Le wrapper dans un `LlmJudgeChecker` qui implémente `ActionChecker`.
+2. L'utiliser comme `escalation_checker` du `Validator`.
+
+```python
+# agent_v0/server_v1/checkers/llm_judge.py
+import time
+from agent_v0.server_v1.replay_verifier import ReplayVerifier
+from agent_v0.server_v1.validator import (
+    ActionChecker, ValidationResult, Verdict, FailureCategory,
+)
+
+class LlmJudgeChecker:
+    """Wrapper autour de ReplayVerifier.verify_with_critic (VLM gemma4)."""
+    name = "llm_judge"
+    budget_ms = 3000.0
+
+    def __init__(self, verifier: ReplayVerifier):
+        self._verifier = verifier
+
+    def check(self, action, result, screenshot_before, screenshot_after, context) -> ValidationResult:
+        t0 = time.time()
+        expected = context.get("expected_result", "")
+        intention = context.get("action_intention", "")
+        workflow_ctx = context.get("workflow_context", "")
+
+        critic = self._verifier.verify_with_critic(
+            action=action, result=result,
+            screenshot_before=screenshot_before,
+            screenshot_after=screenshot_after,
+            expected_result=expected,
+            action_intention=intention,
+            workflow_context=workflow_ctx,
+        )
+        elapsed_ms = (time.time() - t0) * 1000
+
+        if critic.semantic_verified is True:
+            verdict = Verdict.COMPLETE
+            conf = max(critic.confidence, 0.7)
+        elif critic.semantic_verified is False:
+            verdict = Verdict.TERMINATE
+            conf = 0.8
+        else:
+            verdict = Verdict.CONTINUE
+            conf = 0.4
+
+        return ValidationResult(
+            verdict=verdict, confidence=conf,
+            check_used=self.name, elapsed_ms=elapsed_ms,
+            reasoning=critic.semantic_detail or critic.detail,
+            raw_evidence={"pixel_change_pct": critic.change_area_pct,
+                          "semantic_verified": critic.semantic_verified},
+        )
+```
+
+### 6.4. Câblage côté `api_stream.py` (post-action)
+
+Pseudo-diff (NE PAS appliquer, juste pour montrer le point d'insertion) :
+
+```python
+# agent_v0/server_v1/api_stream.py — handler de REPORT
+from agent_v0.server_v1.validator import Validator, Verdict
+from agent_v0.server_v1.checkers.ocr_roi import OcrRoiChecker
+from agent_v0.server_v1.checkers.llm_judge import LlmJudgeChecker
+
+# Init au boot
+_validator = Validator(
+    checkers={
+        "click": [OcrRoiChecker(ocr_fn=_easyocr_fn)],
+        "type": [OcrRoiChecker(ocr_fn=_easyocr_fn)],
+        "key_combo": [TitleBarChecker()],  # voir core/grounding/title_verifier.py
+        # ...
+    },
+    default_checker=PixelDiffChecker(),  # wrapper ReplayVerifier.verify_action
+    escalation_checker=LlmJudgeChecker(ReplayVerifier()),
+    escalation_threshold=0.55,
+)
+
+# Dans report_action_result, après le pixel-diff actuel
+async def report_action_result(payload):
+    ...
+    if RPA_VALIDATOR_ENABLED:  # kill-switch env var
+        val = _validator.validate(
+            action=action, result=result,
+            screenshot_before=before, screenshot_after=after,
+            context={"expected_text": action.get("by_text"),
+                     "expected_result": step.get("expected_result", ""),
+                     "action_intention": step.get("label", ""),
+                     "workflow_context": f"step {step_idx}/{total_steps}"},
+        )
+        if val.verdict == Verdict.TERMINATE:
+            # Pause supervisée, pas stop avec error (cf. feedback_failure_is_learning)
+            _enter_paused_state(reason=val.reasoning, evidence=val.to_dict())
+        elif val.verdict == Verdict.CONTINUE:
+            # Re-vérifier après wait, ou retry
+            _schedule_recheck(action_id, after_ms=1500)
+        # COMPLETE → continue normalement
+```
+
+---
+
+## 7. Application au bug step 10 démo GHT
+
+**Rappel du bug** (cf. `REPLAY_BLOCAGE_NOTES_MEDICALES_2026-05-08.md`) : step 10 « cliquer onglet Imagerie », OCR-DIRECT renvoie centre de la rangée de tabs → le clic tombe **dans la URL bar Edge** (au-dessus). pHash global voit du changement → REPORT success=True. Cascade.
+
+**Avec le Validator proposé** :
+
+1. Action `click_anchor` (`by_text="Imagerie"`, `x_pct=0.23`, `y_pct=0.28`).
+2. Léa rapporte success après mouseclick. Screenshot_after capturé.
+3. `Validator.validate(action_type="click", ...)` route vers `OcrRoiChecker`.
+4. ROI 60 px autour de (0.23, 0.28) → réellement la URL bar.
+5. EasyOCR du crop renvoie texte type : `« urgence.labs.laurinebazin.design/aiva-urgence/dossier.html#imagerie »`
+6. Token `.com` ou `https` détecté → **`Verdict.TERMINATE`** avec `FailureCategory.WRONG_APPLICATION`.
+7. Reasoning : « Token navigateur 'https' dans ROI clic — cible probablement hors-app ».
+8. `api_stream` entre en pause supervisée avec `evidence={roi_text, expected}`. Dom voit dans le dashboard ce qui s'est mal passé. Pas d'enchainement vers step 11.
+
+**Latence ajoutée** : 100-200 ms (EasyOCR sur 120×120 px). **Négligeable** vs. les 6 s passés à enchaîner 5 steps faux et à entrer en pause supervisée 33 s plus tard.
+
+**Effet secondaire bénéfique** : le même mécanisme attrape :
+- Clics sur popups Windows (Hello / UAC) → ROI contient « Sécurité Windows » → TERMINATE.
+- Clics sur le menu démarrer ou la barre des tâches.
+- Tout clic qui tombe dans une zone système non prévue.
+
+---
+
+## 8. Budget latence par check — qu'accepter en démo ?
+
+Hypothèse démo GHT (40 steps, 2 min de pipeline cible) :
+
+| Check | Latence | × 40 steps | Acceptable démo ? |
+|---|---|---|---|
+| Pixel diff global (existant) | 10 ms | 0.4 s | ✅ ON par défaut |
+| OCR ROI EasyOCR | 100-200 ms | 4-8 s | ✅ ON sur `click`, `type` |
+| OCR title-bar (existant) | 120 ms | 4.8 s | ✅ ON sur navigation |
+| Schema validation (JSON) | <10 ms | 0.4 s | ✅ ON sur `extract_*`, `t2a_decision` |
+| LLM-judge gemma4 critic | 2-3 s | 80-120 s | ⚠️ SEULEMENT en escalation |
+| LLM-judge cloud (Claude Haiku) | 1-2 s | 40-80 s | ⚠️ SEULEMENT en escalation |
+| DINOv2 features ROI | 150 ms | 6 s | ❓ pas nécessaire pour démo |
+
+**Recommandation budget** :
+
+- Démo : pixel + OCR ROI + title-bar + schema = ~10 s de latence cumulée sur 40 steps. Acceptable.
+- LLM-judge escalation déclenché ~5 fois max par démo = 10 s ajoutés. Tolérable si placé sur les steps à risque (clics ambigus sur tabs).
+- DINOv2 hors-périmètre démo. À benchmarker post-démo.
+
+**Kill-switch** obligatoire (cf. QW Suite Mai, conventions Dom) :
+```bash
+RPA_VALIDATOR_ENABLED=true                # active la couche entière
+RPA_VALIDATOR_LLM_JUDGE_ENABLED=true      # active escalation LLM (coûteuse)
+RPA_VALIDATOR_OCR_ROI_RADIUS=60           # tunable
+RPA_VALIDATOR_ESCALATION_THRESHOLD=0.55
+```
+
+---
+
+## 9. Plan d'intégration gradué
+
+### 9.1. Court terme — 1 jour, faisable avant prochaine démo (P0)
+
+**But** : éliminer la classe « clic hors-app silencieusement success=True ».
+
+1. Créer `agent_v0/server_v1/validator.py` (squelette §6.1) — 1 h.
+2. Créer `OcrRoiChecker` (§6.2) — 2 h.
+3. Wrapper `LlmJudgeChecker` autour de `verify_with_critic` existant (§6.3) — 30 min.
+4. Ajouter hook dans `api_stream.report_action_result` derrière `RPA_VALIDATOR_ENABLED=false` par défaut — 2 h.
+5. Tests :
+   - Unit : ROI text matching, suspect tokens, escalation logic — 2 h.
+   - Integration : rejouer step 10 sur fixture screenshot — 1 h.
+6. Démo interne avec `RPA_VALIDATOR_ENABLED=true` sur Demo_urgence_3_db, mesure latence + faux positifs — 1 h.
+
+**Livrable** : pas de régression démo si flag off ; quand on, le bug step 10 est attrapé en TERMINATE → pause supervisée.
+
+### 9.2. Moyen terme — 1-2 semaines (P1)
+
+**But** : matrice complète action → check (§5).
+
+1. `TitleBarChecker` adapté de `core/grounding/title_verifier.py` existant — 2 h.
+2. `JsonSchemaChecker` pour `extract_text`, `t2a_decision`, `extract_table` — 4 h.
+3. `DialogPresenceChecker` réutilisant la cascade de modaux VM (`feedback_phash_vs_dialog_in_vm.md`) — 4 h.
+4. `PixelDiffChecker` (wrapper de l'existant) avec verdict adapté au contrat Verdict — 2 h.
+5. Câblage de la matrice complète selon §5 — 4 h.
+6. Dashboard : panneau « Validator stats » par session — pourcentage COMPLETE / CONTINUE / TERMINATE, top failure_categories — 1 j.
+
+### 9.3. Long terme — post-démo (P2)
+
+1. Évaluer **DINOv2** vs OCR ROI sur fixtures GHT : meilleur signal pour distinguer « tab activé vs tab survolé » ? Bench 100 steps.
+2. Migration LLM-judge de gemma4:e4b (local) vers un handler dédié — séparer le « LLM décisionnel T2A » du « LLM judge ». Skyvern expose `USE_CHECK_USER_GOAL_HANDLER_FOR_VERIFICATION` qui sépare déjà.
+3. Apprentissage : enregistrer dans `TargetMemoryStore` chaque verdict TERMINATE pour produire du training data (pattern OpenAdapt « success traces become new training data »).
+4. Re-planification : si TERMINATE répété → renvoyer info au Planner pour ajuster le workflow (cf. Skyvern « reporting any errors / tweaks back to the Planner so it can make adjustments in real-time »). Pour rpa_vision_v3 : signaler à VWB que l'ancre est foireuse → suggestion recapture.
+
+---
+
+## 10. Sources avec liens
+
+### Skyvern (Planner-Actor-Validator)
+
+- Blog Skyvern 2.0 — <https://www.skyvern.com/blog/skyvern-2-0-state-of-the-art-web-navigation-with-85-8-on-webvoyager-eval/> (annonce de l'archi Planner-Actor-Validator, score WebVoyager 85.85 %)
+- GitHub repo — <https://github.com/Skyvern-AI/skyvern>
+- `agent.py` (méthode `complete_verify`) — <https://github.com/Skyvern-AI/skyvern/blob/main/skyvern/forge/agent.py> ligne 2609 (au 23 mai 2026)
+- Prompt `check-user-goal.j2` — <https://raw.githubusercontent.com/Skyvern-AI/skyvern/main/skyvern/forge/prompts/skyvern/check-user-goal.j2>
+- Prompt `check-user-goal-with-termination.j2` — <https://raw.githubusercontent.com/Skyvern-AI/skyvern/main/skyvern/forge/prompts/skyvern/check-user-goal-with-termination.j2>
+- Prompt `decisive-criterion-validate.j2` — <https://raw.githubusercontent.com/Skyvern-AI/skyvern/main/skyvern/forge/prompts/skyvern/decisive-criterion-validate.j2>
+- Hacker News show — <https://news.ycombinator.com/item?id=42724616>
+
+### browser-use (agentic judge)
+
+- Blog « Our browser agent evaluation system » — <https://browser-use.com/posts/our-browser-agent-evaluation-system>
+- AGENTS.md — <https://github.com/browser-use/browser-use/blob/main/AGENTS.md>
+
+### OpenAdapt (Evaluation-Driven Feedback)
+
+- GitHub OpenAdapt — <https://github.com/OpenAdaptAI/OpenAdapt>
+- GitHub openadapt-evals — <https://github.com/OpenAdaptAI/openadapt-evals>
+- Wiki architecture — <https://github.com/OpenAdaptAI/OpenAdapt/wiki/OpenAdapt-Architecture-(draft)>
+
+### Anthropic Computer Use & OpenAI Operator
+
+- Operator system card — <https://openai.com/index/operator-system-card/>
+- OpenCUA (open foundations CUA, xLANG / HKU) — <https://opencua.xlang.ai/>
+- Computer Use 2026 review — <https://tech-insider.org/anthropic-claude-computer-use-agent-2026/>
+
+### Cradle (BAAI)
+
+- Paper arXiv 2403.03186 — <https://arxiv.org/pdf/2403.03186>
+- GitHub — <https://github.com/BAAI-Agents/Cradle>
+- Project page — <https://baai-agents.github.io/Cradle/>
+
+### Visual diff / VLM-as-judge / LLM-as-judge
+
+- « Screenshot Comparison Algorithms » — <https://wopee.io/blog/screenshot-comparison-algorithms-visual-testing/> (pHash positionné comme pre-filter, VLM comme triage layer)
+- DINOv2 (Meta) — <https://github.com/facebookresearch/dinov2>
+- CLIP vs DINOv2 image similarity — <https://medium.com/aimonks/clip-vs-dinov2-in-image-similarity-6fa5aa7ed8c6>
+- « Aha Moment Revisited: Are VLMs Truly Capable of Self Verification » (arXiv 2506.17417) — <https://arxiv.org/pdf/2506.17417>
+- Vision-Language Model Verifier (review) — <https://www.emergentmind.com/topics/vision-language-model-vlm-verifier>
+- LLM-as-a-Judge guide 2026 — <https://labelyourdata.com/articles/llm-as-a-judge>
+- « Why Success is Lying to You: The 2026 Agent Eval Stack » — <https://micheallanham.substack.com/p/why-success-is-lying-to-you-the-2026>
+
+### EDDOps (Evaluation-Driven Development & Operations)
+
+- Paper arXiv 2411.13768 (v3, 2026) — <https://arxiv.org/html/2411.13768v3>
+
+### Doc interne rpa_vision_v3 (référencée)
+
+- `docs/INSPIRATION_FRAMEWORKS_2026-05-10.md` §3.1 — Planner-Actor-Validator
+- `docs/REPLAY_BLOCAGE_NOTES_MEDICALES_2026-05-08.md` — bug archétype step 10
+- `docs/BUG_PRECHECK_SPATIAL_BLINDNESS_2026-05-08.md` — DETTE-001
+- `agent_v0/server_v1/replay_verifier.py` — `verify_with_critic` déjà câblé
+- `core/grounding/title_verifier.py` — TitleVerifier déjà câblé
+- Memory `reference_vwb_action_types.md` — matrice action_types VWB
+
+---
+
+## 11. Dépendances avec autres axes
+
+- **AXE_A4 (OCR)** : `OcrRoiChecker` repose sur EasyOCR/docTR rapides. Si AXE_A4 livre un OCR ROI < 100 ms calibré sur petits crops, le check primaire devient ultra-fiable. **Bloquant** : qualité OCR sur crop 120×120 px.
+- **AXE_A5 (tokenisation écran)** : si on a un parseur UI type OmniParser qui renvoie une liste d'éléments avec bbox + label, le check ROI devient déterministe (matche `target == element_at_point(cx, cy).label`). **Forte synergie** : un Validator + un tokenizer = on rentre dans le territoire Skyvern 2.0.
+- **AXE_B4 (ORA)** : ORA peut consommer les `ValidationResult` du Validator comme signal d'observation. Si TERMINATE → ORA ré-observe et propose une re-action. Le Validator devient l'œil de l'Actor.
+- **DETTE-008** (pre-check VLM par-clic désactivé par `if False:`) : ce Validator est sa version refaite-proprement. La désactivation actuelle est juste, mais le besoin reste — c'est ce livrable.
+- **DETTE-001** (pre-check OCR spatialement aveugle) : `OcrRoiChecker` avec `radius_px=60` est exactement l'Option B mentionnée dans la note de Dom. Réduire radius + bboxes individuelles = même direction.
+
+---
+
+*Document de recherche, lecture seule. Aucune décision d'implémentation prise par cet axe — décision relève de Dom et d'un planning de réintégration coordonné avec AXE_A4, A5, B4.*