57 KiB
AXE B2 — Deep dive Validator : implémentation production-ready
Date : 2026-05-24
Auteur : agent recherche dispatché (Claude Opus 4.7 1M)
Statut : livrable de recherche, lecture seule, AUCUNE modification de code.
Parent : docs/recherche/AXE_B2_VALIDATOR_PATTERN.md (architecture déjà posée, Skyvern verbatim).
Frères : AXE_A4_OCR_TEMPLATE_PHASH.md (fournit OcrRoiChecker et SSIM-ROI).
Périmètre : prendre le squelette de B2 et le rendre collable : code complet, tests, wiring précis, latences mesurables, reproduction offline du bug step 10.
1. TL;DR — recommandation immédiatement actionnable
Le doc parent (AXE_B2_VALIDATOR_PATTERN.md) a posé l'architecture et copié verbatim Skyvern. Il manque (a) le code Python prêt à coller pour chaque Checker, (b) le wiring précis dans report_action_result (qui appelle déjà _replay_verifier.verify_with_critic à api_stream.py:3554), (c) la repro offline du bug step 10 sur un PNG existant, (d) le test pytest qui prouve la fermeture du bug.
Effort total : 1 journée homme pour livrer une feature flag RPA_VALIDATOR_V2_ENABLED=false par défaut, qui :
- Réutilise
verify_with_criticexistant (déjà câblé, déjà testé), wrapping inchangé. - Ajoute un seul check primaire —
OcrRoiChecker— devant le pipeline pixel-then-critic actuel. - Reroute la sortie : si
OcrRoiCheckerdétecte un token suspect (https,edge,chrome,.com,.fr), retourne TERMINATE avecfailure_category=WRONG_APPLICATIONau lieu de continuer. - Plug dans
replay_state["results"]au même format queverification.to_dict()existant.
Le bug step 10 est fermé par 80 lignes de Python. Le reste de l'architecture (taxonomie complète, dispatcher de verdicts, LlmJudgeChecker séparé) est une amélioration P1 — utile, pas bloquante.
Dépendances :
- AXE_A4 (OCR ROI) :
OcrRoiCheckerréutiliseEasyOCRdéjà chargé parcore/grounding/title_verifier.py:140(singleton GPU). Pas de coût d'init. - AXE_B1 (watchdog
_retry_pending) : indépendant du Validator. Le watchdog corrige la cause primaire (HTTP timeout silencieux), le Validator corrige la cause aggravante (clic hors-zone validé success=True). - Chaîne D2 (popup/dialog) : sortie
failure_category=UNEXPECTED_DIALOG→ handoff vers DialogHandler (à câbler en P1).
2. Architecture finale du package core/validation/
2.1. Arborescence
core/validation/
├── __init__.py # exports publics : Validator, Verdict, ValidationResult
├── result.py # dataclasses : Verdict, FailureCategory, ValidationResult
├── checker_base.py # Protocol ActionChecker + classe abstraite
├── validator.py # orchestrateur : route action_type → checkers, escalation
├── prompts.py # prompts français pour LlmJudgeChecker (Easily Assure context)
└── checkers/
├── __init__.py
├── pixel_diff.py # wrapper ReplayVerifier.verify_action (existant)
├── ocr_roi.py # NOUVEAU — résout bug step 10
├── title_bar.py # wrapper core/grounding/title_verifier.py (existant)
├── json_schema.py # pydantic v2 pour extract_text/t2a_decision
├── dialog_presence.py # (P1) cascade modaux VM
└── llm_judge.py # wrapper ReplayVerifier.verify_with_critic (existant)
Rationale du package core/validation/ : le code n'est pas couplé à agent_v0/server_v1/ (pas de FastAPI, pas de DB). Il est testable isolément (pytest tests/unit/test_validator_*.py). On reste cohérent avec core/grounding/, core/execution/, core/auth/.
2.2. Interface Checker (Protocol)
# core/validation/checker_base.py
from __future__ import annotations
from typing import Any, Dict, Optional, Protocol, runtime_checkable
@runtime_checkable
class ActionChecker(Protocol):
"""Contrat d'un checker. Stateless si possible (modèles partagés en singleton)."""
name: str
budget_ms: float # latence cible (informative — pas de hard timeout ici)
def check(
self,
action: Dict[str, Any],
result: Dict[str, Any],
screenshot_before: Optional[str], # base64 ou path
screenshot_after: Optional[str],
context: Dict[str, Any],
) -> "ValidationResult": # forward ref (cycle)
...
2.3. Responsabilités par fichier
| Fichier | Responsabilité | Lignes (estim) |
|---|---|---|
result.py |
enums Verdict/FailureCategory + dataclass ValidationResult + to_dict() |
60 |
checker_base.py |
Protocol ActionChecker |
20 |
validator.py |
dispatcher action_type → checker list, escalation LLM si confidence < seuil | 100 |
prompts.py |
template f-string français Easily/DPI/tabs | 40 |
checkers/pixel_diff.py |
wrapper ReplayVerifier.verify_action → ValidationResult |
50 |
checkers/ocr_roi.py |
crop ROI + EasyOCR + match suspect tokens + match expected | 110 |
checkers/title_bar.py |
wrapper TitleVerifier.verify_action → ValidationResult |
60 |
checkers/json_schema.py |
pydantic v2 schemas pour extract_text/t2a_decision | 80 |
checkers/llm_judge.py |
wrapper ReplayVerifier.verify_with_critic → ValidationResult |
70 |
Total : ~590 LOC pour le package complet. ~190 LOC pour le MVP P0 (result.py + checker_base.py + validator.py + ocr_roi.py).
3. Code complet de chaque Checker (production-ready)
3.1. core/validation/result.py
# core/validation/result.py
"""Dataclasses du Validator — Verdict, FailureCategory, ValidationResult."""
from __future__ import annotations
from dataclasses import dataclass, field
from enum import Enum
from typing import Any, Dict, Optional
class Verdict(str, Enum):
"""Trois verdicts possibles, calque sur Skyvern (complete/terminate/continue)."""
COMPLETE = "complete" # l'action a eu l'effet voulu → passer au step suivant
CONTINUE = "continue" # l'effet n'est pas encore visible → wait + recheck
TERMINATE = "terminate" # échec irrécupérable → pause supervisée
class FailureCategory(str, Enum):
"""Classification des échecs (inspirée Skyvern 12-cat, restreinte à notre contexte)."""
WRONG_TARGET = "wrong_target" # clic ailleurs (ex: dans le mauvais tab)
WRONG_APPLICATION = "wrong_application" # clic dans bandeau Edge au lieu d'Easily — bug step 10
NO_VISUAL_CHANGE = "no_visual_change" # action sans effet visible
UNEXPECTED_DIALOG = "unexpected_dialog" # popup imprévu (handoff DialogHandler)
OCR_TEXT_MISSING = "ocr_text_missing" # texte attendu absent de la ROI
SCHEMA_INVALID = "schema_invalid" # JSON/extract invalide
UI_LOADING = "ui_loading" # spinner détecté → wait
UNKNOWN = "unknown"
@dataclass
class ValidationResult:
"""Résultat agrégé d'un check. Toujours sérialisable JSON."""
verdict: Verdict
confidence: float # 0.0-1.0
check_used: str # "ocr_roi" | "llm_judge" | "title_bar" | ...
elapsed_ms: float
reasoning: str = ""
failure_category: Optional[FailureCategory] = None
raw_evidence: Dict[str, Any] = field(default_factory=dict)
def to_dict(self) -> Dict[str, Any]:
return {
"verdict": self.verdict.value,
"confidence": round(self.confidence, 3),
"check_used": self.check_used,
"elapsed_ms": round(self.elapsed_ms, 1),
"reasoning": self.reasoning,
"failure_category": (
self.failure_category.value if self.failure_category else None
),
"raw_evidence": self.raw_evidence,
}
3.2. core/validation/checkers/pixel_diff.py — pré-filtre 10 ms
# core/validation/checkers/pixel_diff.py
"""Wrapper du ReplayVerifier pixel existant — pré-filtre rapide."""
from __future__ import annotations
import time
from typing import Any, Dict, Optional
from core.validation.result import ValidationResult, Verdict, FailureCategory
class PixelDiffChecker:
name = "pixel_diff"
budget_ms = 15.0
def __init__(self, replay_verifier):
# Injection : on réutilise l'instance ReplayVerifier existante
# côté api_stream (_replay_verifier global).
self._rv = replay_verifier
def check(
self,
action: Dict[str, Any],
result: Dict[str, Any],
screenshot_before: Optional[str],
screenshot_after: Optional[str],
context: Dict[str, Any],
) -> ValidationResult:
t0 = time.time()
pr = self._rv.verify_action(
action=action,
result=result,
screenshot_before=screenshot_before,
screenshot_after=screenshot_after,
)
elapsed = (time.time() - t0) * 1000
# Map pixel verdict → Validator verdict
if pr.suggestion == "continue" and pr.changes_detected:
verdict = Verdict.COMPLETE
conf = pr.confidence
fc = None
elif pr.suggestion == "retry":
verdict = Verdict.CONTINUE
conf = max(0.4, pr.confidence - 0.2)
fc = FailureCategory.NO_VISUAL_CHANGE
else:
verdict = Verdict.CONTINUE
conf = 0.3
fc = None
return ValidationResult(
verdict=verdict,
confidence=conf,
check_used=self.name,
elapsed_ms=elapsed,
reasoning=pr.detail,
failure_category=fc,
raw_evidence={
"change_area_pct": pr.change_area_pct,
"local_change_pct": pr.local_change_pct,
},
)
3.3. core/validation/checkers/ocr_roi.py — résout le bug step 10
# core/validation/checkers/ocr_roi.py
"""OcrRoiChecker — vérifie que le texte attendu apparaît dans la ROI cliquée.
Spécifiquement conçu pour résoudre le bug step 10 (REPLAY_BLOCAGE_NOTES_MEDICALES_2026-05-08) :
si on a cliqué pour 'Imagerie' mais que la ROI 60-120px autour du point cliqué
contient 'edge', 'https' ou un domaine, on a cliqué dans le bandeau navigateur.
"""
from __future__ import annotations
import time
import unicodedata
from typing import Any, Callable, Dict, Optional
from core.validation.result import ValidationResult, Verdict, FailureCategory
def _strip_accents(s: str) -> str:
"""NFKD + drop diacritics, robuste casse/accents pour matching tabs Easily."""
return "".join(
c for c in unicodedata.normalize("NFKD", s) if not unicodedata.combining(c)
).lower().strip()
class OcrRoiChecker:
name = "ocr_roi"
budget_ms = 200.0
# Tokens qui prouvent qu'on a cliqué dans le bandeau navigateur, pas dans l'app
SUSPECT_TOKENS = (
"edge", "chrome", "firefox", "mozilla", "opera",
"http", "https", "www.",
".com", ".fr", ".org", ".net", ".html",
"favoris", "favorite", "bookmark",
"barre d'adresse", "address bar",
"nouvel onglet", "new tab",
"sécurité windows", "windows security",
"user account control", "contrôle de compte",
)
def __init__(
self,
ocr_fn: Callable, # callable(PIL.Image) -> str (EasyOCR singleton)
radius_px: int = 80, # 80 = compromis recall/latence sur tabs Easily
suspect_min_confidence: float = 0.85,
expected_min_confidence: float = 0.90,
):
self._ocr = ocr_fn
self._radius = radius_px
self._suspect_conf = suspect_min_confidence
self._expected_conf = expected_min_confidence
def check(
self,
action: Dict[str, Any],
result: Dict[str, Any],
screenshot_before: Optional[str],
screenshot_after: Optional[str],
context: Dict[str, Any],
) -> ValidationResult:
t0 = time.time()
# 1. Récupération inputs (action peut être un click_anchor ou un type)
target_spec = action.get("target_spec") or {}
expected_text = (
action.get("by_text")
or target_spec.get("by_text")
or context.get("expected_text")
or ""
)
# actual_position rapporté par l'agent (priorité), sinon coords d'action
actual_pos = result.get("actual_position") or {}
x_pct = (
actual_pos.get("x_pct")
or action.get("x_pct")
or target_spec.get("x_pct")
)
y_pct = (
actual_pos.get("y_pct")
or action.get("y_pct")
or target_spec.get("y_pct")
)
if not screenshot_after or x_pct is None or y_pct is None or not expected_text:
return ValidationResult(
verdict=Verdict.CONTINUE,
confidence=0.2,
check_used=self.name,
elapsed_ms=(time.time() - t0) * 1000,
reasoning="ROI indéfinie (manque coords ou expected_text)",
)
# 2. Crop ROI
try:
from PIL import Image
img = self._load_image(screenshot_after)
except Exception as exc:
return ValidationResult(
verdict=Verdict.CONTINUE,
confidence=0.1,
check_used=self.name,
elapsed_ms=(time.time() - t0) * 1000,
reasoning=f"Erreur chargement image: {exc}",
)
w, h = img.size
cx, cy = int(float(x_pct) * w), int(float(y_pct) * h)
r = self._radius
roi = img.crop(
(max(0, cx - r), max(0, cy - r), min(w, cx + r), min(h, cy + r))
)
# 3. OCR
try:
raw_text = self._ocr(roi) or ""
except Exception as exc:
return ValidationResult(
verdict=Verdict.CONTINUE,
confidence=0.1,
check_used=self.name,
elapsed_ms=(time.time() - t0) * 1000,
reasoning=f"Erreur OCR: {exc}",
)
text_norm = _strip_accents(raw_text)
expected_norm = _strip_accents(expected_text)
elapsed_ms = (time.time() - t0) * 1000
evidence = {
"roi_text": raw_text[:200],
"roi_bbox": [max(0, cx - r), max(0, cy - r), min(w, cx + r), min(h, cy + r)],
"expected": expected_text,
}
# 4. Détection token suspect (priorité absolue — bug step 10)
for suspect in self.SUSPECT_TOKENS:
if suspect in text_norm and suspect not in expected_norm:
return ValidationResult(
verdict=Verdict.TERMINATE,
confidence=self._suspect_conf,
check_used=self.name,
elapsed_ms=elapsed_ms,
failure_category=FailureCategory.WRONG_APPLICATION,
reasoning=(
f"Token navigateur/système '{suspect}' dans ROI clic "
f"(attendu '{expected_text[:40]}') — cible hors-app"
),
raw_evidence=evidence,
)
# 5. Match texte attendu
if expected_norm and expected_norm in text_norm:
return ValidationResult(
verdict=Verdict.COMPLETE,
confidence=self._expected_conf,
check_used=self.name,
elapsed_ms=elapsed_ms,
reasoning=f"Texte '{expected_text[:40]}' trouvé dans ROI",
raw_evidence=evidence,
)
# 6. Match partiel mot-à-mot (tolère ponctuation, accents partiels)
expected_tokens = [t for t in expected_norm.split() if len(t) > 2]
if expected_tokens:
hits = sum(1 for tok in expected_tokens if tok in text_norm)
ratio = hits / len(expected_tokens)
if ratio >= 0.5:
return ValidationResult(
verdict=Verdict.COMPLETE,
confidence=0.6 + 0.3 * ratio,
check_used=self.name,
elapsed_ms=elapsed_ms,
reasoning=f"Match partiel {hits}/{len(expected_tokens)} tokens",
raw_evidence=evidence,
)
# 7. Pas suspect, pas trouvé → escalation au LLM judge
return ValidationResult(
verdict=Verdict.CONTINUE,
confidence=0.4,
check_used=self.name,
elapsed_ms=elapsed_ms,
failure_category=FailureCategory.OCR_TEXT_MISSING,
reasoning=f"Texte '{expected_text[:40]}' non trouvé dans ROI",
raw_evidence=evidence,
)
@staticmethod
def _load_image(source: str):
"""Charge PNG/JPEG depuis path ou base64 (réutilise ReplayVerifier)."""
from agent_v0.server_v1.replay_verifier import ReplayVerifier
return ReplayVerifier()._load_single_image(source)
3.4. core/validation/checkers/title_bar.py — wrapper existant
# core/validation/checkers/title_bar.py
"""Wrapper de core/grounding/title_verifier.TitleVerifier (existant, prod-ready)."""
from __future__ import annotations
import time
from typing import Any, Dict, Optional
from core.validation.result import ValidationResult, Verdict, FailureCategory
class TitleBarChecker:
name = "title_bar"
budget_ms = 130.0
def __init__(self):
from core.grounding.title_verifier import TitleVerifier
self._tv = TitleVerifier()
def check(
self,
action: Dict[str, Any],
result: Dict[str, Any],
screenshot_before: Optional[str],
screenshot_after: Optional[str],
context: Dict[str, Any],
) -> ValidationResult:
t0 = time.time()
action_type = action.get("type", "")
if not screenshot_before or not screenshot_after:
return ValidationResult(
verdict=Verdict.CONTINUE, confidence=0.2,
check_used=self.name, elapsed_ms=(time.time() - t0) * 1000,
reasoning="screenshots manquants",
)
from PIL import Image
from agent_v0.server_v1.replay_verifier import ReplayVerifier
rv = ReplayVerifier()
img_b = rv._load_single_image(screenshot_before)
img_a = rv._load_single_image(screenshot_after)
verif = self._tv.verify_action(img_b, img_a, action_type)
elapsed = (time.time() - t0) * 1000
if verif.get("success"):
return ValidationResult(
verdict=Verdict.COMPLETE,
confidence=0.75 if verif.get("changed") else 0.5,
check_used=self.name, elapsed_ms=elapsed,
reasoning=verif.get("reason", ""),
raw_evidence={
"title_before": verif.get("title_before", ""),
"title_after": verif.get("title_after", ""),
},
)
else:
return ValidationResult(
verdict=Verdict.TERMINATE, confidence=0.7,
check_used=self.name, elapsed_ms=elapsed,
failure_category=FailureCategory.WRONG_APPLICATION,
reasoning=verif.get("reason", ""),
raw_evidence={
"title_before": verif.get("title_before", ""),
"title_after": verif.get("title_after", ""),
},
)
3.5. core/validation/checkers/json_schema.py — pydantic v2
# core/validation/checkers/json_schema.py
"""JsonSchemaChecker — validation déterministe extract_text / t2a_decision."""
from __future__ import annotations
import json
import time
from typing import Any, Dict, Literal, Optional
from pydantic import BaseModel, Field, ValidationError, field_validator
from core.validation.result import ValidationResult, Verdict, FailureCategory
class ExtractTextResult(BaseModel):
"""Sortie attendue d'une action extract_text — texte non vide, langue plausible."""
value: str = Field(min_length=1, max_length=50000)
@field_validator("value")
@classmethod
def must_have_letters(cls, v: str) -> str:
if not any(c.isalpha() for c in v):
raise ValueError("aucune lettre — extract_text vraisemblablement vide")
return v
class T2aDecisionResult(BaseModel):
"""Sortie attendue d'une action t2a_decision (JSON strict)."""
decision: Literal["UHCD", "FORFAIT", "FORFAIT_URGENCE", "NA", "INCONNU"]
justification: str = Field(min_length=10, max_length=5000)
confidence: Optional[float] = Field(default=None, ge=0.0, le=1.0)
_SCHEMA_BY_ACTION = {
"extract_text": ExtractTextResult,
"extract_text_scroll": ExtractTextResult,
"t2a_decision": T2aDecisionResult,
}
class JsonSchemaChecker:
name = "json_schema"
budget_ms = 10.0
def check(
self,
action: Dict[str, Any],
result: Dict[str, Any],
screenshot_before: Optional[str],
screenshot_after: Optional[str],
context: Dict[str, Any],
) -> ValidationResult:
t0 = time.time()
action_type = action.get("type", "")
schema_cls = _SCHEMA_BY_ACTION.get(action_type)
if schema_cls is None:
return ValidationResult(
verdict=Verdict.COMPLETE, confidence=0.5,
check_used=self.name, elapsed_ms=(time.time() - t0) * 1000,
reasoning=f"Pas de schema pour action_type={action_type} — skip",
)
# Extract payload depuis result (selon convention serveur)
payload = result.get("value") or result.get("extracted") or result
if isinstance(payload, str):
try:
payload = json.loads(payload)
except json.JSONDecodeError:
payload = {"value": payload} if action_type.startswith("extract_text") else payload
try:
validated = schema_cls.model_validate(payload)
return ValidationResult(
verdict=Verdict.COMPLETE, confidence=0.95,
check_used=self.name,
elapsed_ms=(time.time() - t0) * 1000,
reasoning=f"Schema {schema_cls.__name__} validé",
raw_evidence={"validated_keys": list(validated.model_dump().keys())},
)
except ValidationError as ve:
return ValidationResult(
verdict=Verdict.TERMINATE, confidence=0.9,
check_used=self.name,
elapsed_ms=(time.time() - t0) * 1000,
failure_category=FailureCategory.SCHEMA_INVALID,
reasoning=f"Schema invalide: {ve.errors()[:2]}",
raw_evidence={"errors": ve.errors()},
)
3.6. core/validation/checkers/llm_judge.py — wrapper escalation
# core/validation/checkers/llm_judge.py
"""LlmJudgeChecker — escalation VLM via ReplayVerifier.verify_with_critic.
Réutilise le pipeline VLM existant (gemma4:e4b, port 11435).
Choix gemma4 vs Qwen3-VL : gemma4 retenu par BENCH_SAFETY_CHECKS_2026-05-06
(46% détection vs 0% Qwen3-VL qui ignore format=json Ollama).
"""
from __future__ import annotations
import time
from typing import Any, Dict, Optional
from core.validation.result import ValidationResult, Verdict, FailureCategory
class LlmJudgeChecker:
name = "llm_judge"
budget_ms = 3000.0
def __init__(self, replay_verifier):
self._rv = replay_verifier
def check(
self,
action: Dict[str, Any],
result: Dict[str, Any],
screenshot_before: Optional[str],
screenshot_after: Optional[str],
context: Dict[str, Any],
) -> ValidationResult:
t0 = time.time()
expected = context.get("expected_result") or action.get("expected_result", "")
intention = context.get("action_intention") or action.get("intention", "")
workflow_ctx = context.get("workflow_context", "")
if not expected:
return ValidationResult(
verdict=Verdict.CONTINUE, confidence=0.2,
check_used=self.name, elapsed_ms=(time.time() - t0) * 1000,
reasoning="Pas d'expected_result fourni — LLM judge skip",
)
critic = self._rv.verify_with_critic(
action=action,
result=result,
screenshot_before=screenshot_before,
screenshot_after=screenshot_after,
expected_result=expected,
action_intention=intention,
workflow_context=workflow_ctx,
)
elapsed = (time.time() - t0) * 1000
if critic.semantic_verified is True:
return ValidationResult(
verdict=Verdict.COMPLETE,
confidence=max(critic.confidence, 0.7),
check_used=self.name, elapsed_ms=elapsed,
reasoning=critic.semantic_detail or critic.detail,
raw_evidence={
"pixel_change_pct": critic.change_area_pct,
"semantic_verified": True,
},
)
elif critic.semantic_verified is False:
return ValidationResult(
verdict=Verdict.TERMINATE,
confidence=0.8,
check_used=self.name, elapsed_ms=elapsed,
failure_category=FailureCategory.WRONG_TARGET,
reasoning=critic.semantic_detail or critic.detail,
raw_evidence={"semantic_verified": False},
)
else:
# VLM indispo ou non parsable → incertain, on continue prudemment
return ValidationResult(
verdict=Verdict.CONTINUE,
confidence=0.4,
check_used=self.name, elapsed_ms=elapsed,
reasoning=critic.detail or "VLM indisponible",
)
3.7. core/validation/validator.py — orchestrateur
# core/validation/validator.py
"""Validator — orchestrateur : route action_type → checkers, gère escalation."""
from __future__ import annotations
import logging
from typing import Any, Dict, List, Optional
from core.validation.checker_base import ActionChecker
from core.validation.result import ValidationResult, Verdict
logger = logging.getLogger(__name__)
class Validator:
"""Dispatcher : un action_type → liste de checkers ordonnés.
Logique de décision :
- Si le premier checker rend COMPLETE avec conf >= seuil_accept → return
- Si TERMINATE avec conf haute → return (escalation pause supervisée)
- Si CONTINUE / conf basse → essayer le checker suivant
- Si tous CONTINUE → escalation LLM judge si fourni
"""
def __init__(
self,
checkers: Dict[str, List[ActionChecker]],
default_checkers: Optional[List[ActionChecker]] = None,
escalation_checker: Optional[ActionChecker] = None,
accept_confidence: float = 0.7,
escalate_below_confidence: float = 0.55,
):
self._checkers = checkers
self._default = default_checkers or []
self._escalation = escalation_checker
self._accept = accept_confidence
self._escalate_below = escalate_below_confidence
def validate(
self,
action: Dict[str, Any],
result: Dict[str, Any],
screenshot_before: Optional[str] = None,
screenshot_after: Optional[str] = None,
context: Optional[Dict[str, Any]] = None,
) -> ValidationResult:
ctx = context or {}
action_type = action.get("type", "")
candidates = self._checkers.get(action_type, self._default)
last: Optional[ValidationResult] = None
for checker in candidates:
try:
res = checker.check(action, result, screenshot_before, screenshot_after, ctx)
except Exception as exc:
logger.warning("Validator: checker %s a planté: %s", checker.name, exc)
continue
last = res
logger.info(
"[VALIDATOR] check=%s verdict=%s conf=%.2f elapsed=%.0fms reasoning=%s",
res.check_used, res.verdict.value, res.confidence,
res.elapsed_ms, res.reasoning[:80],
)
# Verdict net + confiance haute → on prend
if res.confidence >= self._accept and res.verdict != Verdict.CONTINUE:
return res
# Escalation LLM judge si confiance trop basse
if (
self._escalation
and last is not None
and last.confidence < self._escalate_below
):
logger.info(
"[VALIDATOR] escalation LLM (last conf=%.2f < %.2f)",
last.confidence, self._escalate_below,
)
try:
esc = self._escalation.check(
action, result, screenshot_before, screenshot_after, ctx
)
# LLM tranche, sa confidence est plafonnée à 0.9 par construction
return esc
except Exception as exc:
logger.warning("Validator: escalation LLM a planté: %s", exc)
# Fallback : dernier résultat ou CONTINUE neutre
return last or ValidationResult(
verdict=Verdict.CONTINUE,
confidence=0.3,
check_used="no_checker",
elapsed_ms=0.0,
reasoning="Aucun checker disponible pour action_type=" + action_type,
)
4. Matrice action → check finale (avec latence cible)
Aligné avec _ALLOWED_ACTION_TYPES (replay_engine.py:35-48) et reference_vwb_action_types.md.
| Action VWB | Checker primaire | Escalation | Latence cible cumulée |
|---|---|---|---|
click (depuis click_anchor) |
OcrRoiChecker | LlmJudgeChecker si conf<0.55 | 80 ms + 2.5 s rare |
double_click (double_click_anchor) |
TitleBarChecker → OcrRoiChecker | LlmJudgeChecker | 200 ms + 2.5 s rare |
right_click |
PixelDiffChecker (menu attendu) | OcrRoiChecker sur menu | 15 ms + 80 ms |
type |
OcrRoiChecker (radius 120 px sur input) | — | 100 ms |
key_combo |
TitleBarChecker | LlmJudgeChecker si Ctrl+nav | 130 ms + 2.5 s rare |
scroll |
PixelDiffChecker | — | 15 ms |
wait / verify_screen |
PixelDiffChecker (no_change attendu) | — | 15 ms |
extract_text / extract_text_scroll |
JsonSchemaChecker | LlmJudgeChecker si len<50 | 10 ms + 2.5 s rare |
extract_table |
JsonSchemaChecker (rows ≥ 1) | — | 10 ms |
t2a_decision |
JsonSchemaChecker strict | — | 10 ms |
pause_for_human |
(déjà QW4 ChecklistPanel — skip) | — | 0 ms |
screenshot_evidence |
TitleBarChecker (app correcte) | — | 130 ms |
paste_and_execute |
PixelDiffChecker (input rempli) | OcrRoiChecker | 15 ms + 80 ms rare |
Budget total pour démo 46 steps MOREL :
- 30 clicks × 80 ms = 2.4 s
- 8 extract_text × 10 ms = 80 ms
- 4 t2a_decision × 10 ms = 40 ms
- 4 key_combo × 130 ms = 520 ms
- Escalations LLM (~3 fois) × 2500 ms = 7.5 s
- Total ajouté ≤ 11 s sur 46 steps. Acceptable face aux 30-60 s gagnés en évitant un blocage step 10 → pause + reprise manuelle (33 s observés).
5. Verdict taxonomy + routing (dispatcher post-validation)
# Pseudocode à insérer dans api_stream.report_action_result après la
# validation Validator V2 (cf. §6 wiring)
from core.validation.result import Verdict, FailureCategory
def route_verdict(
verdict_result: ValidationResult,
action_id: str,
replay_state: Dict[str, Any],
) -> Dict[str, Any]:
"""Convertit un verdict Validator en action serveur."""
v = verdict_result.verdict
fc = verdict_result.failure_category
if v == Verdict.COMPLETE:
return {"action": "continue", "override_success": True}
if v == Verdict.CONTINUE:
# Re-checker après wait court (UI loading, animation)
return {
"action": "schedule_recheck",
"after_ms": 1500,
"max_rechecks": 2,
}
# v == TERMINATE — routing selon failure_category
if fc == FailureCategory.WRONG_APPLICATION:
# Bug step 10 : pause supervisée, l'humain reprend la main
return {
"action": "enter_paused_state",
"reason": "wrong_application",
"evidence": verdict_result.to_dict(),
"override_success": False,
}
if fc == FailureCategory.WRONG_TARGET:
# Retry 1 fois avec re-resolve (cascade visuelle complète)
return {
"action": "retry_with_reresolve",
"max_retries": 1,
"override_success": False,
}
if fc == FailureCategory.UNEXPECTED_DIALOG:
# Handoff vers DialogHandler (chaîne D2)
return {
"action": "handoff_dialog_handler",
"override_success": False,
}
if fc == FailureCategory.SCHEMA_INVALID:
# extract_text/t2a_decision invalide → pause supervisée
return {
"action": "enter_paused_state",
"reason": "schema_invalid",
"evidence": verdict_result.to_dict(),
"override_success": False,
}
# NO_VISUAL_CHANGE, OCR_TEXT_MISSING, UNKNOWN → retry simple
return {
"action": "retry_with_reresolve",
"max_retries": 1,
"override_success": False,
}
6. Wiring précis dans api_stream.py:3447 (diff unified)
Le point d'insertion est précisément après le bloc verify_with_critic existant (api_stream.py:3554-3582). On ne casse rien : la nouvelle couche est en plus, derrière RPA_VALIDATOR_V2_ENABLED.
6.1. Diff proposé (à NE PAS appliquer en chaud)
--- a/agent_v0/server_v1/api_stream.py
+++ b/agent_v0/server_v1/api_stream.py
@@ -3447,6 +3447,18 @@ async def report_action_result(report: ReplayResultReport):
session_id = report.session_id
action_id = report.action_id
+ # ============================================================
+ # VALIDATOR V2 (feature-flag) — init lazy singleton
+ # ============================================================
+ global _validator_v2
+ _RPA_VALIDATOR_V2 = os.environ.get("RPA_VALIDATOR_V2_ENABLED", "false").lower() in {"true", "1", "yes"}
+ if _RPA_VALIDATOR_V2 and _validator_v2 is None:
+ from core.validation.validator import Validator
+ from core.validation.checkers.ocr_roi import OcrRoiChecker
+ from core.validation.checkers.llm_judge import LlmJudgeChecker
+ from core.grounding.title_verifier import TitleVerifier
+ _tv = TitleVerifier()
+ _ocr_fn = _tv._get_ocr() # singleton EasyOCR partagé
+ _validator_v2 = Validator(
+ checkers={
+ "click": [OcrRoiChecker(ocr_fn=_ocr_fn, radius_px=80)],
+ "type": [OcrRoiChecker(ocr_fn=_ocr_fn, radius_px=120)],
+ },
+ escalation_checker=LlmJudgeChecker(_replay_verifier),
+ accept_confidence=0.7,
+ escalate_below_confidence=0.55,
+ )
+
# [REPLAY] log structuré d'arrivée du rapport agent
...
@@ -3580,6 +3592,40 @@ async def report_action_result(report: ReplayResultReport):
async with _async_replay_lock():
replay_state["_last_screenshot_before"] = screenshot_after
+ # ============================================================
+ # VALIDATOR V2 — couche additionnelle (kill-switch RPA_VALIDATOR_V2_ENABLED)
+ # ============================================================
+ validator_v2_result = None
+ if _RPA_VALIDATOR_V2 and report.success and screenshot_after and not skip_verify:
+ try:
+ action_dict = original_action or {"type": "unknown", "action_id": action_id}
+ result_dict = {
+ "success": report.success,
+ "error": report.error,
+ "actual_position": report.actual_position,
+ }
+ v2_ctx = {
+ "expected_result": (original_action or {}).get("expected_result", ""),
+ "action_intention": (original_action or {}).get("intention", ""),
+ "workflow_context": f"step {replay_state.get('completed_actions', 0)+1}/{len(replay_state.get('actions', []))}",
+ "expected_text": (original_action or {}).get("target_spec", {}).get("by_text", ""),
+ }
+ validator_v2_result = _validator_v2.validate(
+ action=action_dict, result=result_dict,
+ screenshot_before=screenshot_before,
+ screenshot_after=screenshot_after,
+ context=v2_ctx,
+ )
+ # Override success si Validator V2 dit TERMINATE haute confiance
+ from core.validation.result import Verdict
+ if validator_v2_result.verdict == Verdict.TERMINATE and validator_v2_result.confidence >= 0.7:
+ logger.warning(
+ "[VALIDATOR_V2] override agent_success=True → False (verdict=%s reason=%s)",
+ validator_v2_result.verdict.value, validator_v2_result.reasoning[:120],
+ )
+ report.success = False # type: ignore[misc]
+ report.error = report.error or f"validator_v2_terminate: {validator_v2_result.failure_category.value if validator_v2_result.failure_category else 'unknown'}"
+ except Exception as exc:
+ logger.warning("Validator V2 a échoué (non bloquant): %s", exc)
+
# [REPLAY] log structuré de la décision de vérification
@@ -3612,6 +3658,8 @@ async def report_action_result(report: ReplayResultReport):
"verification": verification.to_dict() if verification else None,
+ "validator_v2": validator_v2_result.to_dict() if validator_v2_result else None,
"resolution_method": report.resolution_method,
"resolution_score": report.resolution_score,
Effet observable :
- Quand
RPA_VALIDATOR_V2_ENABLED=false(défaut) : aucun changement, le pipeline existant tourne. - Quand
=true: un verdict TERMINATE conf≥0.7 overridereport.successà False → le retry serveur existant se déclenche (déjà câblé lignes ~3700+). En cas de WRONG_APPLICATION le routing du §5 entre en pause supervisée (à implémenter en P1, pour P0 le simple override suffit à attraper le bug).
6.2. Init lazy singleton
L'instanciation du Validator est lazy (premier appel à report_action_result) pour éviter de charger EasyOCR (~3 s) au boot du serveur — utile aussi si le flag est désactivé pour éviter la consommation VRAM.
_validator_v2: Optional[Validator] = None à déclarer globalement avec les autres singletons (_replay_verifier, _audit_trail).
7. Reproduction offline du bug step 10
7.1. Screenshot disponible
Confirmé présent : /home/dom/ai/rpa_vision_v3/visual_workflow_builder/backend/data/anchors/anchor_0438bd2d9bdd_1778161174_full.png
C'est la capture pleine fenêtre 2560×1600 contenant la barre de tabs Easily (cf. REPLAY_BLOCAGE_NOTES_MEDICALES_2026-05-08.md §4 et AXE_A4 §4.3).
Les coordonnées rapportées par OCR-DIRECT pour les 3 tabs collisionnent à (0.2305, 0.2805) → en pixels = (590, 449). C'est précisément le point qui tombe dans la URL bar Edge au lieu de l'onglet Easily.
7.2. Snippet repro complet
# scripts/repro_bug_step10_validator.py
"""Reproduction offline du bug step 10 — démonstration OcrRoiChecker en isolation.
Charge la capture de référence, simule un clic à (0.23, 0.28) hors-zone,
vérifie que le Validator détecte le faux clic (token 'https' / '.com' dans la ROI).
Usage:
cd /home/dom/ai/rpa_vision_v3 && source .venv/bin/activate
python scripts/repro_bug_step10_validator.py
"""
from pathlib import Path
from core.validation.checkers.ocr_roi import OcrRoiChecker
from core.grounding.title_verifier import TitleVerifier
def main():
fixture = Path(
"/home/dom/ai/rpa_vision_v3/visual_workflow_builder/backend/data/anchors/"
"anchor_0438bd2d9bdd_1778161174_full.png"
)
assert fixture.exists(), f"Fixture absente : {fixture}"
# OCR singleton EasyOCR via title_verifier (GPU si dispo)
tv = TitleVerifier()
ocr_fn = tv._get_ocr()
assert ocr_fn is not None, "OCR non chargé (EasyOCR ou docTR requis)"
checker = OcrRoiChecker(ocr_fn=ocr_fn, radius_px=80)
# SCENARIO 1 — clic dans le bandeau URL Edge (bug step 10)
# coords résolues par OCR-DIRECT pour 'Imagerie' = (0.2305, 0.2805)
# mais ces coords tombent dans la barre URL Edge
action_bug = {
"type": "click",
"target_spec": {"by_text": "Imagerie"},
}
result_bug = {
"success": True,
"actual_position": {"x_pct": 0.2305, "y_pct": 0.155}, # bandeau URL Edge
}
res = checker.check(
action=action_bug,
result=result_bug,
screenshot_before=None,
screenshot_after=str(fixture),
context={},
)
print("SCENARIO 1 (clic bandeau Edge):")
print(f" verdict = {res.verdict.value}")
print(f" confidence = {res.confidence:.2f}")
print(f" failure_cat = {res.failure_category.value if res.failure_category else None}")
print(f" reasoning = {res.reasoning}")
print(f" roi_text = {res.raw_evidence.get('roi_text', '')[:100]}")
print()
# SCENARIO 2 — clic correct sur l'onglet Imagerie
action_ok = {
"type": "click",
"target_spec": {"by_text": "Imagerie"},
}
result_ok = {
"success": True,
"actual_position": {"x_pct": 0.265, "y_pct": 0.295}, # vraie position Imagerie
}
res2 = checker.check(
action=action_ok,
result=result_ok,
screenshot_before=None,
screenshot_after=str(fixture),
context={},
)
print("SCENARIO 2 (clic correct Imagerie):")
print(f" verdict = {res2.verdict.value}")
print(f" confidence = {res2.confidence:.2f}")
print(f" reasoning = {res2.reasoning}")
if __name__ == "__main__":
main()
7.3. Résultat attendu
SCENARIO 1 (clic bandeau Edge):
verdict = terminate
confidence = 0.85
failure_cat = wrong_application
reasoning = Token navigateur/système 'https' dans ROI clic (attendu 'Imagerie') — cible hors-app
roi_text = urgence.labs.laurinebazin.design/aiva-urgence/dossier.html...
SCENARIO 2 (clic correct Imagerie):
verdict = complete
confidence = 0.90
reasoning = Texte 'Imagerie' trouvé dans ROI
Latence mesurée typique : 80-150 ms par check sur RTX 5070 (EasyOCR GPU sur crop 160×160), 200-400 ms sur CPU.
8. Test pytest
# tests/unit/test_validator_step10.py
"""Tests unitaires Validator — bug step 10 fermé."""
from pathlib import Path
import pytest
from core.validation.result import Verdict, FailureCategory
from core.validation.checkers.ocr_roi import OcrRoiChecker
FIXTURE = Path(
"/home/dom/ai/rpa_vision_v3/visual_workflow_builder/backend/data/anchors/"
"anchor_0438bd2d9bdd_1778161174_full.png"
)
@pytest.fixture(scope="module")
def ocr_fn():
"""OCR singleton EasyOCR (GPU si dispo)."""
pytest.importorskip("easyocr")
from core.grounding.title_verifier import TitleVerifier
fn = TitleVerifier()._get_ocr()
if fn is None:
pytest.skip("Aucun OCR disponible")
return fn
@pytest.fixture
def checker(ocr_fn):
return OcrRoiChecker(ocr_fn=ocr_fn, radius_px=80)
@pytest.mark.skipif(not FIXTURE.exists(), reason="Fixture screenshot manquante")
def test_step10_bug_detected_when_click_in_url_bar(checker):
"""SCENARIO bug step 10 : clic tombé dans la URL bar Edge → TERMINATE WRONG_APPLICATION."""
res = checker.check(
action={"type": "click", "target_spec": {"by_text": "Imagerie"}},
result={"success": True, "actual_position": {"x_pct": 0.2305, "y_pct": 0.155}},
screenshot_before=None,
screenshot_after=str(FIXTURE),
context={},
)
assert res.verdict == Verdict.TERMINATE
assert res.failure_category == FailureCategory.WRONG_APPLICATION
assert res.confidence >= 0.8
assert "navigateur" in res.reasoning.lower() or "edge" in res.raw_evidence.get("roi_text", "").lower()
@pytest.mark.skipif(not FIXTURE.exists(), reason="Fixture screenshot manquante")
def test_correct_click_on_imagerie_tab(checker):
"""SCENARIO clic correct sur l'onglet Imagerie → COMPLETE."""
res = checker.check(
action={"type": "click", "target_spec": {"by_text": "Imagerie"}},
result={"success": True, "actual_position": {"x_pct": 0.265, "y_pct": 0.295}},
screenshot_before=None,
screenshot_after=str(FIXTURE),
context={},
)
assert res.verdict == Verdict.COMPLETE
assert res.confidence >= 0.6
def test_missing_inputs_returns_continue_low_confidence(checker):
res = checker.check(
action={"type": "click", "target_spec": {}},
result={"success": True},
screenshot_before=None,
screenshot_after=None,
context={},
)
assert res.verdict == Verdict.CONTINUE
assert res.confidence < 0.3
def test_strip_accents_robust():
from core.validation.checkers.ocr_roi import _strip_accents
assert _strip_accents("Imagerie") == "imagerie"
assert _strip_accents("Notes médicales") == "notes medicales"
assert _strip_accents("Synthèse Urgences") == "synthese urgences"
assert _strip_accents("URL: https://www.exemple.com") == "url: https://www.exemple.com"
Lancement :
cd /home/dom/ai/rpa_vision_v3 && source .venv/bin/activate
pytest tests/unit/test_validator_step10.py -v
9. Configuration — variables d'environnement & kill-switches
# Activation globale du Validator V2 (default: off)
RPA_VALIDATOR_V2_ENABLED=false
# Tuning OcrRoiChecker
RPA_VALIDATOR_OCR_ROI_RADIUS_CLICK=80 # px (default 80)
RPA_VALIDATOR_OCR_ROI_RADIUS_TYPE=120 # px
RPA_VALIDATOR_OCR_SUSPECT_CONFIDENCE=0.85
RPA_VALIDATOR_OCR_EXPECTED_CONFIDENCE=0.90
# Tuning Validator orchestrateur
RPA_VALIDATOR_ACCEPT_CONFIDENCE=0.70
RPA_VALIDATOR_ESCALATE_BELOW=0.55
# Kill-switch escalation LLM (coûteuse 2-3 s)
RPA_VALIDATOR_LLM_JUDGE_ENABLED=true
# Override hard du verdict (debug)
RPA_VALIDATOR_FORCE_VERDICT= # vide | complete | continue | terminate
Tous les flags conformes à la convention QW Suite Mai (cf. docs/QW_SUITE_MAI.md) : RPA_*_ENABLED boolean, lecture via os.environ.get("...", default).lower() in {"true", "1", "yes"}.
10. Patterns externes 2026 — verbatim & sources
10.1. Skyvern — prompt check-user-goal-with-termination.j2 verbatim
Récupéré directement du repo le 24 mai 2026 (raw.githubusercontent.com/Skyvern-AI/skyvern/main/skyvern/forge/prompts/skyvern/check-user-goal-with-termination.j2) :
You are here to help the user determine if the user has completed their goal on the web{{ " according to the complete criterion" if complete_criterion else "" }}. Use the content of the elements parsed from the page,{{ "" if without_screenshots else " the screenshots of the page," }} the user goal and user details to determine the status of the task.
Make sure to ONLY return the JSON object in this format with no additional text before or after it:
{
"page_info": str,
"thoughts": str,
"status": str, // "complete" | "terminate" | "continue"
"failure_categories": array // 12 catégories — voir doc parent §2.3
}
Important: Think carefully about the difference between "terminate" and "continue":
- "terminate" = impossible to achieve, stop trying (e.g., "account does not exist", "file unavailable", permanent error)
- "continue" = not done yet, but achievable with more steps (e.g., page is loading, need to click something, need to wait)
12 catégories d'échec : ANTI_BOT_DETECTION, BROWSER_ERROR, NAVIGATION_FAILURE, PAGE_LOAD_TIMEOUT, AUTH_FAILURE, LLM_REASONING_ERROR, CREDENTIAL_ERROR, ELEMENT_NOT_FOUND, WRONG_PAGE_STATE, DATA_EXTRACTION_FAILURE, INFRASTRUCTURE_ERROR, UNKNOWN.
Adaptation rpa_vision_v3 : 8 catégories suffisent (cf. FailureCategory §3.1) — on a moins de surfaces (pas de captcha web). On garde WRONG_APPLICATION qui n'existe pas chez Skyvern (Skyvern est en navigateur fermé, on est sur Windows multi-app).
10.2. browser-use — agentic judge verbatim format
Source : https://browser-use.com/posts/our-browser-agent-evaluation-system
Modèle : gemini-2.5-flash (87% accord avec labels humains). Sortie JSON :
{
"reasoning": "Analysis covering what worked, failures, trajectory quality, tool usage, output quality",
"verdict": "true|false",
"failure_reason": "Max 5 sentences explanation if failed",
"impossible_task": "true|false",
"reached_captcha": "true|false"
}
Philosophie : « simple prompts and absolute True/False verdicts work best. Complex rubrics → indecisive judging. » → on retient : notre LlmJudgeChecker doit forcer VERDICT: OUI/NON binaire, c'est ce que verify_with_critic fait déjà (replay_verifier.py:481-485).
10.3. OpenAdapt — Process Graph + dual validation
Source : https://github.com/OpenAdaptAI/OpenAdapt/wiki/OpenAdapt-Architecture-(draft)
OpenAdapt distingue code-based validation (code Python généré par LLM, vérifie une condition) vs model-based validation (LMM reçoit screenshot + completion_criteria texte → bool). Si échec → bascule automatique en mode recording → la trace devient training data (« Evaluation-Driven Feedback »).
À retenir : notre JsonSchemaChecker est l'équivalent code-based, LlmJudgeChecker l'équivalent model-based. La bascule auto-recording n'est pas dans le périmètre P0 mais doit alimenter TargetMemoryStore en P1 (cf. memory project_lea_apprentissage_plan.md).
10.4. Anthropic Computer Use — Validator implicite
Anthropic CU (Claude 3.5 Sonnet computer-use beta) n'a pas de Validator nommé. Le modèle re-observe après chaque action et décide de continuer/corriger dans son raisonnement. Source : https://docs.anthropic.com/en/docs/build-with-claude/computer-use.
Non transposable à rpa_vision_v3 : notre Actor (Léa) est un exécutant déterministe, pas un LLM agentique. Il faut un Validator externe.
10.5. ScreenSpot-Pro & agentic reward modeling 2025-2026
- ScreenSpot-Pro (arXiv 2504.07981, avril 2025) : benchmark grounding GUI haute résolution, 1581 instructions × 23 apps. Meilleur modèle = 18.9 % top-1, ScreenSeekeR = 48.1 %. → confirme qu'aucun grounding seul ne suffit, un Validator est nécessaire pour catcher les 50-80 % de cas où le grounder vise mal.
- Agentic Reward Modeling — Verifying GUI Agent via Online Proactive Interaction (arXiv 2602.00575) : verifier appris en RL, double LLM-as-judge + rule-based.
- DPO Learning with LLMs-Judge Signal for Computer Use Agents (arXiv 2506.03095) : judge filtre trajectoires synthétiques pour entraînement. Lien direct avec
replay_learner.pyexistant.
→ Cible long terme : TargetMemoryStore + replay_learner peuvent être alimentés par les verdicts du Validator. Chaque TERMINATE bien diagnostiqué = training signal négatif. Chaque COMPLETE conf élevée = positif.
11. Plan d'intégration en 3 étapes
11.1. P0 — 1 jour (avant prochaine démo client)
Cible : fermer le bug step 10 sans toucher au flux nominal.
- Créer
core/validation/{__init__.py, result.py, checker_base.py, validator.py}— 2 h. - Créer
core/validation/checkers/{__init__.py, ocr_roi.py, llm_judge.py}— 2 h. - Écrire
scripts/repro_bug_step10_validator.py+ lancer en local pour confirmer le verdict TERMINATE — 30 min. - Écrire
tests/unit/test_validator_step10.py— 1 h. Lancerpytest tests/unit/test_validator_step10.py -v. - Patch
api_stream.py:3447(diff §6.1) derrièreRPA_VALIDATOR_V2_ENABLED=false— 2 h. - Démo interne avec flag ON sur
Demo_urgence_3_db: mesurer latence ajoutée + faux positifs sur 46 steps — 30 min. - Documenter dans
docs/QW_SUITE_MAI.mdou nouveaudocs/VALIDATOR_V2.md— 30 min.
Livrable : pas de régression flag off, bug step 10 détecté en TERMINATE flag on.
11.2. P1 — 2 semaines
- Matrice complète action → check (§4) : ajouter
PixelDiffChecker,TitleBarChecker,JsonSchemaChecker— 1 jour. - Implémenter le
route_verdictdispatcher (§5) : intégrer enter_paused_state, retry_with_reresolve, handoff_dialog_handler — 2 jours. - Dashboard : panneau « Validator stats » — verdicts par session, top failure_categories, latence p50/p95 — 1 jour.
- Réactiver DETTE-008 (
observe_reason_act.py:1704-1713) : ce code mort EST l'ancêtre du Validator. Le remplacer par appelValidator.validate()après chaque clic ORA. — 1 jour. - Coexistence avec drift exemption (
resolve_engine.py:2390 _RESOLUTION_MAX_DRIFT=0.95) : si le Validator V2 atteint 90 % accuracy en démo, on peut baisser_RESOLUTION_MAX_DRIFTà 0.30 — 0.5 jour test. - Réactivation
RPA_ENABLE_TEXT_PRECHECK=true(DETTE-001) : le pré-check OCR sémantique devient une fonction privée du Validator V2 — 0.5 jour.
11.3. P2 — post-démo (1 mois)
DialogPresenceChecker(chaîne D2) : cascade modaux VM via OCR + template — 2 jours.- Migration
LlmJudgeCheckervers handler dédié séparé dut2a_decisionLLM (Skyvern fait pareil avecUSE_CHECK_USER_GOAL_HANDLER_FOR_VERIFICATION) — 1 jour. - Apprentissage : chaque verdict TERMINATE alimente
TargetMemoryStorecomme negative trace — 3 jours. - Re-planification : signaler à VWB que l'ancre est foireuse → suggestion recapture automatique — 5 jours.
- Multi-modal Validator (combiner OCR + DINOv2 + title en 1 check composite atomique) — bench post-démo.
12. Sources avec liens cliquables
Code source consulté
- Skyvern
agent.py— https://github.com/Skyvern-AI/skyvern/blob/main/skyvern/forge/agent.py - Skyvern prompt
check-user-goal-with-termination.j2(récupéré verbatim 24 mai 2026) — https://raw.githubusercontent.com/Skyvern-AI/skyvern/main/skyvern/forge/prompts/skyvern/check-user-goal-with-termination.j2 - Skyvern prompt
check-user-goal.j2(cité par doc parent) — https://raw.githubusercontent.com/Skyvern-AI/skyvern/main/skyvern/forge/prompts/skyvern/check-user-goal.j2 - Skyvern repo principal — https://github.com/Skyvern-AI/skyvern
- Skyvern PR #1513 chain-of-thought user goal — https://github.com/Skyvern-AI/skyvern/pull/1513
Framework verifiers 2026
- browser-use evaluation system — https://browser-use.com/posts/our-browser-agent-evaluation-system
- browser-use AGENTS.md — https://github.com/browser-use/browser-use/blob/main/AGENTS.md
- OpenAdapt architecture wiki — https://github.com/OpenAdaptAI/OpenAdapt/wiki/OpenAdapt-Architecture-(draft)
- OpenAdapt evals — https://github.com/OpenAdaptAI/openadapt-evals
- Anthropic Computer Use docs — https://docs.anthropic.com/en/docs/build-with-claude/computer-use
Papers 2025-2026
- ScreenSpot-Pro (arXiv 2504.07981) — https://arxiv.org/abs/2504.07981
- Agentic Reward Modeling for GUI Agent (arXiv 2602.00575) — https://arxiv.org/html/2602.00575v1
- DPO Learning with LLMs-Judge Signal for CUA (arXiv 2506.03095) — https://arxiv.org/pdf/2506.03095
- GUI-Actor coordinate-free grounding (arXiv 2506.03143) — https://arxiv.org/pdf/2506.03143
Pydantic v2 (JsonSchemaChecker)
- Pydantic v2 JSON validation guide — https://docs.pydantic.dev/latest/concepts/json/
- LLM output validation pratiques — https://pydantic.dev/articles/llm-intro
- Production guide — https://superjson.ai/blog/2025-08-24-json-schema-validation-python-pydantic-guide/
Doc interne consultée (lecture seule)
- Doc parent :
docs/recherche/AXE_B2_VALIDATOR_PATTERN.md - Doc frère OCR :
docs/recherche/AXE_A4_OCR_TEMPLATE_PHASH.md - Bug archétype :
docs/REPLAY_BLOCAGE_NOTES_MEDICALES_2026-05-08.md - Bench LLM judge :
docs/BENCH_SAFETY_CHECKS_2026-05-06.md - Code existant verifier :
agent_v0/server_v1/replay_verifier.py:367-633(verify_with_critic) - Code existant title verifier :
core/grounding/title_verifier.py:25-175 - Wiring actuel :
agent_v0/server_v1/api_stream.py:3447-3582(report_action_result) - DETTE-008 (pre-check VLM désactivé) :
core/execution/observe_reason_act.py:1704-1713 - Drift exemption :
agent_v0/server_v1/resolve_engine.py:2384-2390(_RESOLUTION_MAX_DRIFT=0.95) - Synthèse globale :
docs/SYNTHESE_TECHNOS_REPLAY_2026-05-23.md
13. Dépendances explicites avec autres axes
| Axe | Dépendance | Statut |
|---|---|---|
| AXE_A4 (OCR) | OcrRoiChecker utilise EasyOCR singleton du TitleVerifier (déjà chargé en prod). _strip_accents réutilisable dans _resolve_by_ocr_text correctif center-of-span. |
✅ pas de blocage |
| AXE_A5 (tokenisation UI) | Si OmniParser/UI-DETR-1 livre des bboxes par élément au runtime, le Validator pourrait matcher target == element_at_point(cx, cy).label directement (déterministe). |
🟡 P2 |
AXE_B1 (watchdog _retry_pending) |
Indépendant. Le watchdog corrige la cause primaire (HTTP timeout), le Validator corrige la cause aggravante (mauvais clic validé success=True). Les deux ensemble = fermeture totale du bug step 10. | ✅ orthogonal |
| Chaîne D2 (dialog/popup) | failure_category=UNEXPECTED_DIALOG → handoff DialogHandler. Le Validator détecte le problème, D2 le résout. |
✅ contrat clair |
| DETTE-008 | Le code mort if False: en observe_reason_act.py:1704-1713 est l'ancêtre du Validator. À remplacer en P1 par Validator.validate() après chaque clic ORA. |
🟡 P1 |
DETTE-001 (RPA_ENABLE_TEXT_PRECHECK=false) |
Le pré-check OCR spatialement aveugle devient le OcrRoiChecker correctement spatialisé. |
✅ P1 |
Drift exemption ≥ 0.95 (_RESOLUTION_MAX_DRIFT) |
Le Validator V2 permet de baisser le seuil drift à 0.30 (P1) car les faux positifs templates seront catchés post-action. | 🟡 P1 |
Livrable de recherche, lecture seule. Aucune modification de code appliquée. Validation et merge relèvent de Dom au cas par cas, après validation du smoke test §11.1 sur Demo_urgence_3_db.