fix: stratégie hybride OCR→grounding VLM / icônes→template matching

Résolution 4/4 (100%) validée localement : - Texte OCR (by_text_source="ocr") → grounding Qwen2.5-VL (dist < 0.04) - Icônes sans texte (by_text_source="") → template matching crop 80x80 (dist = 0.000) Le VLM identify element est supprimé pour les icônes (descriptions non-déterministes qui faisaient échouer le grounding). Le template matching est instantané et parfait quand le crop est net (80x80). Ajout de by_text_source dans target_spec pour distinguer OCR vs VLM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 23:21:06 +02:00
parent d99b17394a
commit 6724f43950
2 changed files with 20 additions and 18 deletions
--- a/agent_v0/server_v1/api_stream.py
+++ b/agent_v0/server_v1/api_stream.py
@@ -3970,12 +3970,14 @@ def _resolve_target_sync(
                vlm_description = _build_target_description(target_spec)

        # ---------------------------------------------------------------
-        # Étape 0 : Grounding VLM Direct (Qwen2.5-VL)
-        # Le VLM reçoit le screenshot + description textuelle et retourne
-        # directement les coordonnées. Plus fiable que SomEngine + numérotation.
+        # Étape 0 : Choisir la stratégie selon le type d'élément
+        # - Texte OCR fiable → grounding VLM (description textuelle)
+        # - Icône sans texte → template matching (crop 80x80)
        # ---------------------------------------------------------------
-        grounding_desc = by_text_strict or vlm_description
-        if grounding_desc:
+        by_text_source = target_spec.get("by_text_source", "")
+
+        if by_text_strict and by_text_source == "ocr":
+            # Texte OCR fiable → grounding VLM direct
            grounding_result = _resolve_by_grounding(
                screenshot_path=screenshot_path,
                target_spec=target_spec,
@@ -3987,14 +3989,11 @@ def _resolve_target_sync(
                    "Strict resolve GROUNDING : OK (%.4f, %.4f) pour '%s'",
                    grounding_result.get("x_pct", 0),
                    grounding_result.get("y_pct", 0),
-                    grounding_desc[:50],
+                    by_text_strict[:50],
                )
                return grounding_result

-        # ---------------------------------------------------------------
-        # Étape 0.5 : Template matching pour icônes sans texte (crop 80x80)
-        # ---------------------------------------------------------------
-        if not by_text_strict:
+        if not by_text_strict or by_text_source != "ocr":
            result = _resolve_by_template_matching(
                screenshot_path=screenshot_path,
                anchor_image_b64=anchor_image_b64,