feat: chaîne de grounding 3 niveaux + refonte capture écran

Grounding en cascade quand CLIP/template échouent : 1. OCR (docTR) → cherche le texte exact sur l'écran (~1s) 2. UI-TARS grounding → "click on X" → coordonnées (~3s, 94% ScreenSpot) 3. VLM reasoning → raisonnement complet + confirmation OCR (~10s) find_element_on_screen() dans input_handler.py (partagé VWB + Léa). Câblé dans find_and_click() et execute_action() comme fallback. Refonte capture écran : - mss.monitors[0] (composite) pour capturer la VM en plein écran - FullscreenSelector réécrit : overlay via getBoundingClientRect() - Bboxes et sélection alignées avec l'image (calcul JS, pas CSS) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-21 09:31:38 +02:00
parent 14a9442343
commit 73ddcdb29d
3 changed files with 392 additions and 2 deletions
--- a/visual_workflow_builder/backend/services/intelligent_executor.py
+++ b/visual_workflow_builder/backend/services/intelligent_executor.py
@@ -656,7 +656,9 @@ def find_and_click(
    anchor_image_base64: str,
    anchor_bbox: Optional[Dict[str, int]] = None,
    method: str = 'clip',
-    detection_threshold: float = 0.35
+    detection_threshold: float = 0.35,
+    target_text: str = '',
+    target_description: str = ''
 ) -> Dict[str, Any]:
    """
    Fonction utilitaire pour trouver une ancre et retourner les coordonnées de clic.
@@ -665,11 +667,16 @@ def find_and_click(
    - 'clip': UI-DETR-1 + CLIP (matching sémantique intelligent, recommandé)
    - 'zoned': Template matching zonée (fallback)

+    En dernier recours, si target_text est fourni, utilise la chaîne de grounding
+    (OCR → UI-TARS → VLM) via find_element_on_screen.
+
    Args:
        anchor_image_base64: Image de l'ancre en base64
        anchor_bbox: Bounding box originale
        method: 'clip' pour UI-DETR-1+CLIP, 'zoned' pour template zonée
        detection_threshold: Seuil de détection pour UI-DETR-1
+        target_text: Texte de l'élément à trouver (pour fallback grounding)
+        target_description: Description longue (pour fallback grounding)

    Returns:
        Dict avec found, coordinates, confidence, etc.
@@ -815,6 +822,35 @@ def find_and_click(
        except Exception as seeclick_err:
            print(f"⚠️ [Vision] Erreur SeeClick: {seeclick_err}")

+        # === FALLBACK: Chaîne de grounding (OCR → UI-TARS → VLM) ===
+        if target_text or target_description:
+            try:
+                from core.execution.input_handler import find_element_on_screen
+                print(f"🔗 [Vision] Dernier recours: chaîne de grounding pour '{target_text or target_description}'...")
+                grounding_result = find_element_on_screen(
+                    target_text=target_text,
+                    target_description=target_description,
+                    anchor_image_base64=anchor_image_base64
+                )
+                if grounding_result:
+                    gx, gy = grounding_result['x'], grounding_result['y']
+                    gmethod = grounding_result['method']
+                    gconf = grounding_result['confidence']
+                    print(f"✅ [Vision] Grounding réussi via {gmethod} à ({gx}, {gy}) conf={gconf:.2f}")
+                    return {
+                        'found': True,
+                        'confidence': gconf,
+                        'coordinates': {'x': gx, 'y': gy},
+                        'bbox': anchor_bbox,
+                        'method': f'grounding_{gmethod}',
+                        'search_time_ms': (_time.time() - start_time) * 1000,
+                        'candidates': []
+                    }
+                else:
+                    print(f"❌ [Vision] Chaîne de grounding échouée pour '{target_text or target_description}'")
+            except Exception as grounding_err:
+                print(f"⚠️ [Vision] Erreur chaîne de grounding: {grounding_err}")
+
        # === Toutes les méthodes visuelles ont échoué ===
        if anchor_bbox:
            best_conf = max(global_result.get('confidence', 0), 0)