# AXE A4 — OCR, Template matching, pHash : revue 2026 + correctif `_resolve_by_ocr_text`

**Date :** 2026-05-23
**Auteur :** Claude (dispatch recherche)
**Périmètre :** revue littérature/écosystème 2025-2026 pour la cascade UI `OCR → template → VLM` + alternatives à `pHash` pour LoopDetector et VERIFY. Patch ciblé du bug *center-of-line* de `_resolve_by_ocr_text` (`agent_v0/server_v1/resolve_engine.py:1447-1527`).
**Lecture pré-requise :** `docs/SYNTHESE_TECHNOS_REPLAY_2026-05-23.md` §2, §4 ; `docs/REPLAY_BLOCAGE_NOTES_MEDICALES_2026-05-08.md` §1.2 et §5 ; `docs/BUG_PRECHECK_SPATIAL_BLINDNESS_2026-05-08.md` (DETTE-001).
**Statut :** recherche + propositions. **Aucune modification de code.** Toute application validée par Dom.

---

## 0. TL;DR

1. **Le bug primaire `center-of-line` est résolvable sans changer d'OCR.** docTR expose les `geometry` au niveau du `Word`, normalisées dans le **même repère que la ligne**. Le quick fix §5 du diagnostic 8 mai (cf. §5 ci-dessous, code copy-paste-ready) supprime la collision Imagerie/Notes/Synthèse en restant 100 % iso-stack.
2. **OCR : garder docTR comme moteur OCR-DIRECT** (mode strict + cascade) car c'est le seul, avec Tesseract et PaddleOCR `return_word_box=True`, à exposer des bbox **token-level dans le même repère que la ligne**. EasyOCR retourne par défaut des bbox merges niveau line/segment et **n'est pas adapté** à la résolution multi-tokens d'un onglet sur barre. Surya OCR = line-level uniquement, à écarter pour ce besoin. RapidOCR (PaddleOCR ONNX repackagé) → candidat 2026 pour OCR-DIRECT *léger sans dépendance Paddle*, à valider sur français accentué.
3. **Template matching : remplacer `cv2.matchTemplate` multi-scale par SuperPoint+LightGlue (ONNX, ~50 ms par paire sur RTX 5070).** C'est la sortie propre pour la drift exemption `≥ 0.95` actuelle, qui est un faux positif déguisé (score haut sur région différente). LightGlue est invariant à l'offset/scale/rotation et fournit un *score de cohérence géométrique* — donc plus de faux positifs « 0.95 sur mauvaise zone ». À encapsuler derrière `_resolve_by_template` sans casser la cascade.
4. **pHash : sortir du global. Deux modes complémentaires :**
   - **LoopDetector (QW2)** → DINOv2 features sur l'écran entier, cos-sim < 0.99 = écran a bougé. Plus robuste qu'un pHash 64-bit à un curseur clignotant ou à un caret blinking.
   - **VERIFY post-action** → **SSIM par ROI** (skimage `structural_similarity`, ~5-10 ms sur crop 400×200), avec ROI = bbox de la cible cliquée + halo 50 px. C'est la version *spatialisée* qui résout aussi DETTE-001 (BUG_PRECHECK_SPATIAL_BLINDNESS).
5. **Dépendances** : ce travail est **bloquant** pour AXE_A5 (tokenisation UI : OmniParser et UI-DETR-1 utilisent in fine un OCR + détection icônes — décider du moteur OCR avant tokenisation). Il **alimente** AXE_B2 (Validator) qui consommera SSIM-ROI comme signal sémantique de VERIFY.

---

## 1. Sous-axe 1 — OCR pour grounding

### 1.1. Question centrale : bbox token-level dans le même repère que la ligne

Le bug `center-of-line` apparaît parce que `_resolve_by_ocr_text` (resolve_engine.py:1486-1519) calcule `cx, cy` à partir de la `line_obj.geometry` (bbox de la ligne entière) alors que `target_text` n'est qu'un sous-fragment. Pour le résoudre **sans changer d'OCR**, il suffit que l'OCR expose, dans le même repère normalisé que la ligne, les bbox des **words** qui composent la ligne. C'est le critère discriminant.

### 1.2. Table comparative (mai 2026)

| OCR | Granularité bbox | Repère | Français/accents | Latence (CPU 2560×1600) | Stack | Licence | Date release majeure |
|---|---|---|---|---|---|---|---|
| **docTR** (`python-doctr`) | **word + line + block** | normalisé `[(xmin,ymin),(xmax,ymax)]` ∈ [0,1]², **commun line/word** | bon (modèle `crnn_vgg16_bn` français) | ~800 ms CPU, ~150 ms GPU | PyTorch + TF, ONNX optionnel | Apache 2.0 | v0.10 (2026-04, `python-doctr` PyPI) |
| **EasyOCR** | line merged (par défaut) + char optionnel via `ycenter_ths`/`width_ths` | pixel absolu | bon | ~1.2 s CPU, ~200 ms GPU | PyTorch, CRNN | Apache 2.0 | v1.7.x (2024) |
| **RapidOCR** (`rapidocr`) | line | pixel absolu | bon (modèle PP-OCRv4 fr) | ~200 ms ONNX-CPU, ~80 ms GPU | ONNXRuntime / OpenVINO / MNN / PaddlePaddle, **sans dépendance Paddle** | Apache 2.0 | v3.x (2026-04-11) |
| **PaddleOCR / PP-StructureV3** | line par défaut ; **`return_word_box=True`** en option | pixel absolu | bon | ~250 ms GPU (PP-OCRv4) | PaddlePaddle (lourd) | Apache 2.0 | v3.0 (2025-07) |
| **Surya OCR** (`surya-ocr`) | **line only** | pixel absolu | bon (90+ langues) | ~400 ms GPU (5070-class) | PyTorch | GPL-3.0 (commercial restrictif) | v0.17.x (2025) |
| **Tesseract** (via `pytesseract`) | **word + line + char** via `image_to_data` / `hOCR` | pixel absolu | moyen-bon (modèle `fra`) | 100-500 ms CPU | C++ LSTM | Apache 2.0 | v5.4 (2024) |

**Sources principales :** [docTR Word/Line geometry — Discussion #570](https://github.com/mindee/doctr/discussions/570), [PaddleOCR return_word_box — Issue #15760](https://github.com/PaddlePaddle/PaddleOCR/issues/15760), [Surya line-level — repo datalab-to/surya](https://github.com/datalab-to/surya), [EasyOCR character bbox limitation — Issue #631](https://github.com/JaidedAI/EasyOCR/issues/631), [RapidOCR releases](https://github.com/RapidAI/RapidOCR/releases), [pytesseract image_to_data — PyPI](https://pypi.org/project/pytesseract/), [Codesota benchmark PaddleOCR vs EasyOCR 2025](https://www.codesota.com/ocr/paddleocr-vs-easyocr), [Tildalice benchmark PaddleOCR vs Doctr](https://buttondown.com/ckae930413/archive/paddleocr-vs-easyocr-vs-doctr-memory-latency-test/).

### 1.3. Analyse du bug center-of-line

**Le bug est résolvable nativement avec docTR.** L'API expose `line_obj.words` (List[Word]) avec chaque `Word.geometry` au même format `((xmin,ymin),(xmax,ymax))` normalisé que `line_obj.geometry`. Il n'y a aucun changement de repère à faire — c'est le même page-relative ∈ [0,1]². Cf. [docTR I/O modules doc](https://mindee.github.io/doctr/modules/io.html).

EasyOCR a la **mauvaise granularité par défaut** : il merge les détections en segments via `ycenter_ths=0.5` et `width_ths=0.5`, donc une rangée de tabs serrée tombera comme une boîte unique, **sans accès aux sous-words**. Demander explicitement `width_ths=0.0` casserait la fusion mais aussi pour les vrais textes longs (« Justification de la décision »). **EasyOCR seul ne résout pas le bug.**

Surya OCR est annoncé explicitement comme line-only : « Surya predicts line-level bboxes, while tesseract and others predict word-level or character-level » (cf. [datalab-to/surya README](https://github.com/datalab-to/surya)). **À écarter** pour ce besoin.

PaddleOCR `return_word_box=True` est disponible en v3.0 mais nécessite une dépendance PaddlePaddle ~700 Mo et un init ~8-12 s sur CPU.

RapidOCR repackage les modèles PaddleOCR en ONNX (80 Mo install, init <2 s) ; **il faut vérifier en mai 2026 si `return_word_box` est exposé dans la couche `rapidocr.RapidOCR(__call__)` ou seulement dans `paddleocr.PaddleOCR`**. À ce jour, la doc publique RapidOCR ne mentionne pas explicitement le mode word-bbox.

### 1.4. Snippets Python — récupérer bbox word-level

**docTR (déjà utilisé en production)**

```python
from doctr.models import ocr_predictor
from doctr.io import DocumentFile

predictor = ocr_predictor(pretrained=True)
doc = DocumentFile.from_images("/path/screenshot.png")
result = predictor(doc)

# Navigation hiérarchique : pages -> blocks -> lines -> words
page = result.pages[0]
H, W = page.dimensions  # (height, width) pixels

for block in page.blocks:
    for line in block.lines:
        # line.geometry == ((xmin, ymin), (xmax, ymax)) normalisé [0,1]²
        # line.words == List[Word], chaque Word.geometry au même format
        for word in line.words:
            (xmin_n, ymin_n), (xmax_n, ymax_n) = word.geometry
            # Pixels absolus
            xmin_px = xmin_n * W
            ymax_px = ymax_n * H
            print(f"{word.value!r}  bbox=({xmin_px:.0f},{ymin_px*H:.0f})-({xmax_px:.0f},{ymax_px:.0f})")
```

**Tesseract (alternative légère, fallback CPU)**

```python
import pytesseract
from PIL import Image

img = Image.open("/path/screenshot.png")
data = pytesseract.image_to_data(img, lang="fra", output_type=pytesseract.Output.DICT)

# data == dict with keys 'text','left','top','width','height','conf','line_num','word_num','block_num'
n = len(data['text'])
for i in range(n):
    if data['text'][i].strip() and int(data['conf'][i]) > 50:
        x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i]
        # Pixels absolus directement
        print(f"{data['text'][i]!r}  bbox=({x},{y})-({x+w},{y+h})  line={data['line_num'][i]}")
```

**RapidOCR (candidat migration, ONNX léger)**

```python
from rapidocr import RapidOCR

engine = RapidOCR()
result, elapsed = engine("/path/screenshot.png")
# result == [[box, text, score], ...] avec box = [[x1,y1],[x2,y2],[x3,y3],[x4,y4]] pixel absolu
# ⚠ Niveau line par défaut — à valider en mai 2026 si word-level disponible
```

### 1.5. Recommandation

**Garder docTR pour OCR-DIRECT** (mode strict + cascade resolve_engine). C'est l'OCR qui colle déjà aux contraintes du bug. Le quick fix §5 (recalcul `cx, cy` depuis `line.words`) ne nécessite ni migration ni changement d'API.

**Ne PAS migrer en chaud vers EasyOCR ou Surya** : EasyOCR perd le sous-word, Surya est line-only par design.

**Évaluation parallèle** (post-démo, AXE_A5) :
- RapidOCR sur 10 captures Easily fr — gain potentiel : init 2 s vs 5-8 s docTR, install 80 Mo vs 500 Mo + PyTorch.
- Tesseract `image_to_data` lang `fra` — peut servir de **second moteur OCR de vérification** (vote OCR à 2 moteurs) pour DETTE-001.

---

## 2. Sous-axe 2 — Template matching (étage 2 cascade)

### 2.1. Question centrale : robustesse à l'offset/scale + élimination des faux positifs 0.95

`cv2.matchTemplate` multi-scale (range 0.25→2.0, `resolve_engine.py:130`) calcule un score de corrélation NCC pixel-à-pixel. Limites connues :
- **Aucune invariance à la rotation.** Easily/Edge sont fixes en rotation, donc OK ici.
- **Sensible à l'anti-aliasing** : un même bouton scaled 0.95× vs 1.0× peut perdre 0.10 sur le score.
- **Le score haut ne garantit pas la bonne région** : le match peut être 0.95 sur un patch visuellement similaire (autre bouton de la même barre, même icône de close, etc.). C'est exactement le mécanisme qui force aujourd'hui le `drift exemption ≥ 0.95` (`resolve_engine.py:2367-2390`) à être une rustine — score haut, mauvais endroit.
- Cf. [PyImageSearch multi-scale template matching](https://pyimagesearch.com/2015/01/26/multi-scale-template-matching-using-python-opencv/) et [Medium — Template Matching Beyond Basics](https://medium.com/@coders.stop/template-matching-beyond-basics-rotation-and-scale-invariant-detection-2ae78d8fa190).

### 2.2. Table comparative

| Méthode | Invariance | Score géométrique | Latence pair (RTX 5070, 800×500 vs 2560×1600) | Faux positif 0.95 ? | Licence | Maturité 2026 |
|---|---|---|---|---|---|---|
| **`cv2.matchTemplate` NCC multi-scale** (actuel) | scale ±20 % (force brute) | non — score pixel | ~50-200 ms CPU (multi-scale loop) | **oui** (rustine drift exemption) | BSD | mature |
| **SIFT / AKAZE / ORB (cv2)** | scale + rotation + offset | non — inliers RANSAC | ~30 ms CPU | filtré par RANSAC mais sensible aux UI peu texturées | BSD | mature |
| **SuperPoint + LightGlue (ONNX)** | scale + rotation + offset + photométrie | **oui** — score MNN + inliers | **~44 ms (22 FPS) pair complète RTX-class** | **non** si on prend `len(matches) > seuil` ET cohérence homographique | Apache 2.0 (modèle), [fabio-sim/LightGlue-ONNX](https://github.com/fabio-sim/LightGlue-ONNX) | très mature 2024-2026 |
| **LoFTR / Efficient LoFTR** | id. | id. + dense | ~80 ms pair RTX-class | non | Apache 2.0 | mature, +1-2 pp AUC vs LightGlue mais 2× plus lent |
| **DINOv2 patch features + kNN match** | id. + sémantique | cosine sim patch | ~150 ms (extract DINOv2 ViT-L) | rare (sémantique > pixel) | CC-BY-NC-4.0 ⚠ | très mature 2024-2026 |
| **RoMa / RoMa v2** | id. + dense, sub-pixel | warp + certainties | ~200 ms RTX-class (v2 = 1.7× v1) | non | non-commercial | CVPR 2024, v2 fin 2025 |
| **MASt3R-SfM** | id. + 3D | grid match | très lourd (~1 s+ par pair) | non | non-commercial | recherche 2024 |
| **CLIP visual similarity** (global embedding) | id. + sémantique | cos-sim global | ~30 ms ViT-B/32 | échoue : trop global, ne localise pas | MIT | mature |

**Sources :** [LightGlue ICCV 2023 paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Lindenberger_LightGlue_Local_Feature_Matching_at_Light_Speed_ICCV_2023_paper.pdf), [Efficient LoFTR arXiv 2403.04765](https://arxiv.org/pdf/2403.04765), [RoMa v2 emergent mind](https://www.emergentmind.com/papers/2511.15706), [DINOv2 features](https://www.emergentmind.com/topics/dinov2-features), [Image Matching Challenge 2025 — DINO-RotateMatch arXiv 2512.03715](https://arxiv.org/pdf/2512.03715), [LightGlue ONNX](https://github.com/fabio-sim/LightGlue-ONNX), [LightGlue HF Transformers](https://huggingface.co/docs/transformers/model_doc/lightglue).

### 2.3. Critère faux-positif 0.95

C'est le critère discriminant pour sortir de la drift exemption rustine. **SuperPoint + LightGlue** fournit deux signaux séparables :
1. `n_matches` : nombre de keypoints appariés (typique 50-200 pour un widget visible).
2. **Cohérence géométrique** : on calcule l'homographie via `cv2.findHomography(src_pts, dst_pts, cv2.RANSAC, 5.0)` sur les matches et on garde le ratio inliers / total. Un faux positif 0.95 sur région différente aura `n_matches < 10` ou un ratio inliers < 0.5.

Cela élimine la classe de bug « score 0.95, mauvais bouton » sans avoir besoin d'un seuil bas qui ferait passer le faux positif.

### 2.4. Recommandation

**Phase 1 (court terme, post-démo)** : conserver `cv2.matchTemplate` mais **ajouter une vérification géométrique LightGlue+SuperPoint en ratification** quand le score est ∈ [0.80, 0.95] (zone aujourd'hui ambiguë). Si LightGlue confirme la cohérence homographique → garder le match. Sinon → fallback VLM. Cela réduit l'exemption drift de 0.95 vers 0.80.

**Phase 2 (moyen terme)** : remplacer la boucle multi-scale `cv2.matchTemplate` par LightGlue+SuperPoint en méthode primaire d'étage 2. Garder un fallback NCC pour les widgets très uniformes/texturés faiblement (icônes monochromes plates où LightGlue manque de keypoints).

**Snippet — intégration LightGlue compatible cascade actuelle**

```python
# Pseudo-code à brancher dans resolve_engine._resolve_by_template
# Ne PAS appliquer en l'état — validation syntaxique seulement.

from lightglue import LightGlue, SuperPoint
from lightglue.utils import load_image, rbd
import cv2, torch

_LG_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
_LG_EXTRACTOR = SuperPoint(max_num_keypoints=2048).eval().to(_LG_DEVICE)
_LG_MATCHER = LightGlue(features="superpoint").eval().to(_LG_DEVICE)

def _verify_template_match_with_lightglue(
    screenshot_bgr,
    template_bgr,
    candidate_xy,              # (cx, cy) pixel renvoyé par cv2.matchTemplate
    inlier_ratio_threshold=0.5,
    min_matches=10,
):
    """Confirme géométriquement un match cv2.matchTemplate.

    Returns:
        dict(confirmed=bool, n_matches=int, inlier_ratio=float)
    """
    # Crop autour du candidat (taille du template + halo)
    th, tw = template_bgr.shape[:2]
    cx, cy = candidate_xy
    x0 = max(0, cx - tw)
    y0 = max(0, cy - th)
    x1 = min(screenshot_bgr.shape[1], cx + tw)
    y1 = min(screenshot_bgr.shape[0], cy + th)
    crop = screenshot_bgr[y0:y1, x0:x1]

    # Tensors LightGlue (1, 1, H, W) float [0,1]
    crop_t = torch.from_numpy(cv2.cvtColor(crop, cv2.COLOR_BGR2GRAY)).float()[None, None] / 255.0
    tpl_t  = torch.from_numpy(cv2.cvtColor(template_bgr, cv2.COLOR_BGR2GRAY)).float()[None, None] / 255.0

    with torch.no_grad():
        feats0 = _LG_EXTRACTOR.extract(tpl_t.to(_LG_DEVICE))
        feats1 = _LG_EXTRACTOR.extract(crop_t.to(_LG_DEVICE))
        matches01 = _LG_MATCHER({"image0": feats0, "image1": feats1})

    feats0, feats1, matches01 = (rbd(x) for x in [feats0, feats1, matches01])
    matches = matches01["matches"]  # (M, 2)

    n_matches = matches.shape[0]
    if n_matches < min_matches:
        return {"confirmed": False, "n_matches": n_matches, "inlier_ratio": 0.0}

    pts0 = feats0["keypoints"][matches[..., 0]].cpu().numpy()
    pts1 = feats1["keypoints"][matches[..., 1]].cpu().numpy()
    H_, mask = cv2.findHomography(pts0, pts1, cv2.RANSAC, 5.0)
    if H_ is None:
        return {"confirmed": False, "n_matches": n_matches, "inlier_ratio": 0.0}
    inlier_ratio = float(mask.sum()) / n_matches

    return {
        "confirmed": inlier_ratio >= inlier_ratio_threshold,
        "n_matches": n_matches,
        "inlier_ratio": inlier_ratio,
    }
```

À brancher en **post-process** de `cv2.matchTemplate` : si score ∈ [0.80, 0.95], appel LightGlue. Si confirmé → garder. Cela transforme la rustine drift exemption en *vérification ratifiée*.

---

## 3. Sous-axe 3 — pHash → alternatives 2026

### 3.1. Usages actuels et limites

| Usage | Implémentation actuelle | Limite documentée |
|---|---|---|
| **LoopDetector QW2** | pHash global (`screen_static` ≥ threshold) + `action_repeat` + `retry_threshold` | un caret blinking ou un curseur sur barre de chargement fait varier le hash → faux négatif (« écran a bougé » alors qu'il n'a rien changé fonctionnellement) |
| **VERIFY post-action** | pHash global avant/après click | un clic local sur un onglet change ~5 % de l'image (la zone des tabs + le contenu de l'onglet) — peut être absorbé par le hash global → faux négatif (le click n'a rien fait visible). Inversement, popup arrière-plan / curseur souris fait croire à un changement. |

Diagnostic principal : `feedback_phash_vs_dialog_in_vm.md` (memory) — pHash global est trop grossier pour la cascade VM. DETTE-001 (BUG_PRECHECK_SPATIAL_BLINDNESS) montre que c'est **spatialement aveugle** : `_text_match_fuzzy` valide le pré-check OCR au mauvais endroit parce que le radius 280 px englobe plusieurs tabs.

### 3.2. Table comparative — alternatives 2026

| Méthode | Mode | Latence (crop 400×200) | Mode ROI ? | Robustesse caret/curseur | Distingue mouvement local | Bibliothèque |
|---|---|---|---|---|---|---|
| **pHash global 64-bit** | actuel | <5 ms | non | mauvaise | non | `imagehash` |
| **pHash par ROI (rolling)** | extension simple | ~5 ms × N régions | oui (par tuiles) | OK | oui | `imagehash` |
| **SSIM** (skimage) | classique | 5-10 ms CPU | **oui native** | bonne | oui | `skimage.metrics.structural_similarity` |
| **MS-SSIM** | multi-échelle | 15-30 ms | oui | meilleure | oui | `pytorch-msssim` |
| **LPIPS** (AlexNet/VGG) | deep | 30-80 ms | oui via crop | excellente (sémantique) | oui | `lpips` |
| **DINOv2 patch features cos-sim** | deep semantic | 100-200 ms (ViT-S/14) | oui (patches) | excellente | oui | `transformers` + `dinov2_vits14` |
| **CLIP image embedding cos-sim** | global semantic | ~30 ms | non (perd info spatiale) | bonne mais pas local | non | `open_clip` |

**Sources :** [Eureka — SSIM vs LPIPS](https://eureka.patsnap.com/article/ssim-vs-lpips-which-metric-should-you-trust-for-image-quality-evaluation), [SSIM scikit-image doc](https://scikit-image.org/docs/dev/auto_examples/transform/plot_ssim.html), [Wopee — Screenshot Comparison Algorithms](https://wopee.io/blog/screenshot-comparison-algorithms-visual-testing/), [Medium CLIP vs DINOv2 image similarity](https://medium.com/aimonks/clip-vs-dinov2-in-image-similarity-6fa5aa7ed8c6), [DinoHash arXiv 2503.11195](https://arxiv.org/pdf/2503.11195).

### 3.3. Recommandation par usage

**LoopDetector QW2 (écran statique → boucle)**
- **Adopter** : DINOv2 features cos-sim sur frame entière (downscale 224×224 avant). Seuil cos < 0.99 = changement réel. Robuste au caret blinking, au scroll-bar position, à la souris.
- **Coût** : ~100 ms par frame sur RTX 5070. Acceptable pour un trigger appelé 1×/sec.
- **Alternative dégradée** : pHash par ROI (grille 4×4 tuiles), ré-utilise `imagehash` actuel, sans GPU.

**VERIFY post-action (a-t-on cliqué utilement ?)**
- **Adopter SSIM par ROI** :
  - ROI = bbox du target résolu + halo 50 px (ou la zone qu'on s'attend à voir changer si elle est connue : par exemple, le contenu d'onglet pour un click sur onglet).
  - `structural_similarity(roi_before, roi_after, multichannel=True)`.
  - Seuil empirique à calibrer (0.85 = changement notable, 0.95 = rien n'a changé).
- **Coût** : ~5 ms CPU sur crop 400×200, négligeable.
- **Bénéfice transversal** : résout aussi DETTE-001 — au lieu de vérifier que `target_text` est présent dans un crop OCR autour du click, on vérifie que la **zone** elle-même a changé (= un click vraiment effectif déclenche un repaint local).

**Snippet — SSIM ROI VERIFY (drop-in dans `replay_verifier.py`)**

```python
from skimage.metrics import structural_similarity as ssim
import cv2, numpy as np

def verify_click_changed_roi(
    screenshot_before_path: str,
    screenshot_after_path: str,
    cx_px: int,
    cy_px: int,
    roi_w: int = 400,
    roi_h: int = 200,
    threshold: float = 0.95,
) -> dict:
    """Vérifie qu'un click a effectivement modifié la ROI cible.

    Returns:
        dict(changed=bool, ssim=float, roi_bbox=(x0,y0,x1,y1))
    """
    before = cv2.imread(screenshot_before_path)
    after  = cv2.imread(screenshot_after_path)
    if before is None or after is None or before.shape != after.shape:
        return {"changed": False, "ssim": 0.0, "roi_bbox": (0, 0, 0, 0)}

    H, W = before.shape[:2]
    x0 = max(0, cx_px - roi_w // 2)
    y0 = max(0, cy_px - roi_h // 2)
    x1 = min(W, cx_px + roi_w // 2)
    y1 = min(H, cy_px + roi_h // 2)

    crop_b = cv2.cvtColor(before[y0:y1, x0:x1], cv2.COLOR_BGR2GRAY)
    crop_a = cv2.cvtColor(after[y0:y1, x0:x1],  cv2.COLOR_BGR2GRAY)

    score = float(ssim(crop_b, crop_a, data_range=255))
    return {
        "changed": score < threshold,
        "ssim": score,
        "roi_bbox": (x0, y0, x1, y1),
    }
```

---

## 4. Patch ciblé — bug center-of-line de `_resolve_by_ocr_text`

### 4.1. Cible exacte

Fichier : `agent_v0/server_v1/resolve_engine.py`, fonction `_resolve_by_ocr_text`, lignes **1486-1519** (référence dans `REPLAY_BLOCAGE_NOTES_MEDICALES_2026-05-08.md` §1.2).

Bloc actuel reconstitué d'après §1.2 du diagnostic 8 mai :

```python
# resolve_engine.py:1486-1519 (état au 8 mai 2026)
# Match exact > contient > mot par mot
score = 0.0
if target_lower == line_lower:
    score = 1.0
elif target_lower in line_lower:
    score = 0.8
elif any(target_lower == w.value.lower() for w in line_obj.words):
    score = 0.9

if score > best_score:
    box = line_obj.geometry          # ⚠ bbox de la LIGNE ENTIÈRE
    cx = (box[0][0] + box[1][0]) / 2
    cy = (box[0][1] + box[1][1]) / 2
    best_score = score
    best_match = {"cx": cx, "cy": cy, "score": score, "line": line_obj.value}
```

### 4.2. Patch proposé — center-of-span depuis `line.words`

**Principe** : pour les scores 0.8 (substring) et 0.9 (mot exact), recalculer `cx, cy` à partir des bbox des `words` qui couvrent le `target_text`, **pas** de la ligne entière.

Code copy-paste-ready (validation syntaxique seulement, **non exécuté**) :

```python
# resolve_engine.py:1486-1519 (proposition)
# Match exact > contient > mot par mot
score = 0.0
matched_words = []   # sous-ensemble de line_obj.words couvrant target_text

target_lower = target_text.lower().strip()
line_lower   = line_obj.value.lower().strip()

# 1) Match exact ligne entière
if target_lower == line_lower:
    score = 1.0
    matched_words = list(line_obj.words)

# 2) Match substring (multi-mots possibles)
elif target_lower in line_lower:
    score = 0.8
    # Reconstruire le span de words couvrant target_lower par concat séquentielle
    target_tokens = target_lower.split()
    line_words_lower = [w.value.lower() for w in line_obj.words]
    # Recherche d'une fenêtre contiguë qui matche tous les target_tokens dans l'ordre
    for start in range(len(line_words_lower) - len(target_tokens) + 1):
        window = line_words_lower[start:start + len(target_tokens)]
        # Comparaison tolérante : un token cible peut être préfixe/égal au token line
        if all(t == w or t in w or w in t for t, w in zip(target_tokens, window)):
            matched_words = line_obj.words[start:start + len(target_tokens)]
            break
    if not matched_words:
        # Fallback : tous les words contenant un token cible
        matched_words = [w for w in line_obj.words if any(t in w.value.lower() for t in target_tokens)]

# 3) Match mot-exact dans la ligne (single token)
elif any(target_lower == w.value.lower() for w in line_obj.words):
    score = 0.9
    matched_words = [w for w in line_obj.words if w.value.lower() == target_lower]

if score > best_score:
    if matched_words:
        # ✅ Centre du SPAN matché, pas de la ligne entière
        xs = []
        ys = []
        for w in matched_words:
            (xmin, ymin), (xmax, ymax) = w.geometry
            xs.extend([xmin, xmax])
            ys.extend([ymin, ymax])
        cx = (min(xs) + max(xs)) / 2
        cy = (min(ys) + max(ys)) / 2
    else:
        # Fallback de sécurité : centre de la ligne (comportement actuel)
        box = line_obj.geometry
        cx = (box[0][0] + box[1][0]) / 2
        cy = (box[0][1] + box[1][1]) / 2

    best_score = score
    best_match = {
        "cx": cx,
        "cy": cy,
        "score": score,
        "line": line_obj.value,
        "matched_span": " ".join(w.value for w in matched_words) if matched_words else None,
    }
```

### 4.3. Justification, risques, tests à faire avant merge

**Pourquoi ça résout le bug** : pour `target='Imagerie'` dans la ligne `"Motif d'admission Examens cliniques Imagerie Notes médicales Synthèse Urgences Codage >"`, `matched_words` capturera uniquement le `Word` `"Imagerie"` (geometry locale), pas tous les words de la ligne. `cx, cy` retomberont au centre exact de ce mot. Idem pour `'Notes médicales'` (2 words contigus) et `'Synthèse Urgences'` (2 words contigus). Plus de collision (0.23, 0.28).

**Repère identique** : `Word.geometry` est dans le **même repère normalisé** que `line_obj.geometry` (vérifié par doc docTR — cf. [Discussion #570](https://github.com/mindee/doctr/discussions/570) et [io modules](https://mindee.github.io/doctr/modules/io.html)). Aucune conversion d'échelle requise.

**Risques résiduels** :
1. **Casse/accents** : `target_lower in line_lower` puis comparaison `t == w or t in w or w in t` — il faut **normaliser les accents** (NFD + strip diacritics) si `target='Notes médicales'` vs `Word='médicales'` matche, mais `target='Notes medicales'` (sans accent venant du JSON workflow) peut rater. Mitigation : `unicodedata.normalize('NFKD', s).encode('ascii','ignore').decode()` sur les deux côtés avant la comparaison.
2. **Tokenisation docTR ≠ split blancs** : docTR sépare typiquement par espace mais peut séparer/grouper différemment des hyphens/apostrophes. Le fallback `matched_words = [w for w in line_obj.words if any(t in w.value.lower() for t in target_tokens)]` couvre ce cas mais peut sur-matcher.
3. **Performance** : O(n_words × n_target_tokens) — négligeable (n_words < 50 typiquement).
4. **Régressions cosmétiques** : `pre_check_text_match` (DETTE-001) actuellement OFF — à re-tester avec ce fix actif.

**Tests minimaux avant merge (10 min)** :
```bash
cd /home/dom/ai/rpa_vision_v3 && source .venv/bin/activate
python -c "
from agent_v0.server_v1.resolve_engine import _resolve_by_ocr_text
img='/home/dom/ai/rpa_vision_v3/visual_workflow_builder/backend/data/anchors/anchor_0438bd2d9bdd_1778161174_full.png'
for t in ['Imagerie','Notes médicales','Synthèse Urgences','Codage','Examens cliniques']:
    r = _resolve_by_ocr_text(img, t, 2560, 1600)
    print(f'{t:25s} -> cx={r[\"x_pct\"]:.4f} cy={r[\"y_pct\"]:.4f} score={r[\"score\"]:.2f}')
"
```

Critère succès : `Imagerie / Notes médicales / Synthèse Urgences` ont des `cx` séparés d'au moins 0.05 (≈ 130 px à 2560 px).

**À NE PAS faire en chaud démo** (cf. §5 du diagnostic 8 mai). Le quick fix démo reste le timeout client `5 → 30 s`. Ce patch s'applique sur runner 2 (post-démo).

---

## 5. Dépendances croisées avec les autres axes

- **AXE_A5 (tokenisation UI / OmniParser)** : OmniParser utilise PaddleOCR pour l'OCR d'icônes. Si on bascule vers tokenisation OmniParser-style en cascade `1.5` (entre OCR et VLM), il faudra décider **un seul moteur OCR pour tout le pipeline** ou accepter 2 moteurs (docTR pour resolve_engine, PaddleOCR/RapidOCR pour tokenisation). Voir AXE_A5 livrable.
- **AXE_B2 (Validator)** : SSIM-ROI proposé §3 alimente directement le composant Validator du Planner-Actor-Validator (cf. SYNTHESE §5.2). C'est le signal sémantique « le click a fait quelque chose dans la zone attendue » qui élimine la classe de bugs « cliqué quelque part, REPORT success=True ».
- **DETTE-001** : le patch §4 + SSIM-ROI §3 referment la dette (le pré-check OCR cesse d'être spatialement aveugle parce qu'il vise un span exact, et la vérification post-click se fait sur ROI ciblée).
- **Drift exemption ≥ 0.95** : la ratification LightGlue (§2.4) permet de baisser le seuil vers 0.80 sans réintroduire de faux positifs.

---

## 6. Sources (chronologie)

- [docTR — Word/Line geometry — Discussion #570 (2022, valide en 2026)](https://github.com/mindee/doctr/discussions/570)
- [docTR — I/O modules (doc officielle)](https://mindee.github.io/doctr/modules/io.html)
- [docTR — repo principal (release v0.10, 2026-04)](https://github.com/mindee/doctr)
- [docTR — PyPI python-doctr](https://pypi.org/project/python-doctr/)
- [PaddleOCR — return_word_box Issue #15760 (2024)](https://github.com/PaddlePaddle/PaddleOCR/issues/15760)
- [PaddleOCR 3.0 Technical Report (2025-07)](https://arxiv.org/pdf/2507.05595)
- [Surya OCR — datalab-to/surya](https://github.com/datalab-to/surya)
- [Surya OCR — PyPI v0.17.1](https://pypi.org/project/surya-ocr/)
- [EasyOCR — Character bbox Issue #631](https://github.com/JaidedAI/EasyOCR/issues/631)
- [RapidOCR — releases (v3.x, 2026-04-11)](https://github.com/RapidAI/RapidOCR/releases)
- [RapidOCR — repo](https://github.com/RapidAI/RapidOCR)
- [pytesseract — image_to_data + hOCR (PyPI)](https://pypi.org/project/pytesseract/)
- [Codesota — PaddleOCR vs EasyOCR Speed 2025](https://www.codesota.com/ocr/paddleocr-vs-easyocr)
- [Codesota — PaddleOCR vs Tesseract vs EasyOCR 2026](https://www.codesota.com/ocr/paddleocr-vs-tesseract)
- [Buttondown — PaddleOCR vs EasyOCR vs Doctr Memory & Latency](https://buttondown.com/ckae930413/archive/paddleocr-vs-easyocr-vs-doctr-memory-latency-test/)
- [LightGlue — ICCV 2023 paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Lindenberger_LightGlue_Local_Feature_Matching_at_Light_Speed_ICCV_2023_paper.pdf)
- [LightGlue — repo cvg/LightGlue](https://github.com/cvg/LightGlue)
- [LightGlue ONNX — fabio-sim/LightGlue-ONNX](https://github.com/fabio-sim/LightGlue-ONNX)
- [LightGlue — HuggingFace Transformers integration](https://huggingface.co/docs/transformers/model_doc/lightglue)
- [Efficient LoFTR — arXiv 2403.04765 (CVPR 2024)](https://arxiv.org/pdf/2403.04765)
- [RoMa — CVPR 2024](https://openaccess.thecvf.com/content/CVPR2024/html/Edstedt_RoMa_Robust_Dense_Feature_Matching_CVPR_2024_paper.html)
- [RoMa v2 — emergent mind 2025-11](https://www.emergentmind.com/papers/2511.15706)
- [DINOv2 — features tutorial](https://www.lightly.ai/blog/dinov2)
- [DINO-RotateMatch — arXiv 2512.03715 (2025)](https://arxiv.org/pdf/2512.03715)
- [PyImageSearch — Multi-scale Template Matching (2015, ref classique)](https://pyimagesearch.com/2015/01/26/multi-scale-template-matching-using-python-opencv/)
- [Medium — Template Matching Beyond Basics: Rotation & Scale Invariant](https://medium.com/@coders.stop/template-matching-beyond-basics-rotation-and-scale-invariant-detection-2ae78d8fa190)
- [Eureka — SSIM vs LPIPS](https://eureka.patsnap.com/article/ssim-vs-lpips-which-metric-should-you-trust-for-image-quality-evaluation)
- [skimage — structural_similarity doc](https://scikit-image.org/docs/dev/auto_examples/transform/plot_ssim.html)
- [Wopee — Screenshot Comparison Algorithms](https://wopee.io/blog/screenshot-comparison-algorithms-visual-testing/)
- [Medium — CLIP vs DINOv2 image similarity](https://medium.com/aimonks/clip-vs-dinov2-in-image-similarity-6fa5aa7ed8c6)
- [DinoHash — arXiv 2503.11195](https://arxiv.org/pdf/2503.11195)
- [OmniParser — DeepWiki OCR module](https://deepwiki.com/microsoft/OmniParser/2.2-ocr-and-image-processing)

---

*Document de recherche. Aucun code modifié. Toute application validée par Dom au cas par cas.*