fix: vérification croisée CLIP+OCR + description ancre avant exécution
Some checks failed
security-audit / Bandit (scan statique) (push) Successful in 12s
security-audit / pip-audit (CVE dépendances) (push) Successful in 10s
security-audit / Scan secrets (grep) (push) Successful in 9s
tests / Lint (ruff + black) (push) Successful in 14s
tests / Tests unitaires (sans GPU) (push) Failing after 13s
tests / Tests sécurité (critique) (push) Has been skipped
Some checks failed
security-audit / Bandit (scan statique) (push) Successful in 12s
security-audit / pip-audit (CVE dépendances) (push) Successful in 10s
security-audit / Scan secrets (grep) (push) Successful in 9s
tests / Lint (ruff + black) (push) Successful in 14s
tests / Tests unitaires (sans GPU) (push) Failing after 13s
tests / Tests sécurité (critique) (push) Has been skipped
Quand CLIP dit "trouvé", on vérifie par OCR que le texte à cette position correspond au target. Si CLIP clique sur "Ce PC" au lieu de "CR_patient_demo", l'OCR le rejette → fallback sur la cascade. Description VLM de l'ancre AVANT le CLIP quand le label est un type d'action (double_click_anchor → "text file icon CR_patient"). Le target_text enrichi sert à la vérification croisée ET au grounding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -813,10 +813,22 @@ def execute_action(action_type: str, params: dict) -> dict:
|
||||
'height': bbox.get('height', 0)
|
||||
}
|
||||
|
||||
# Extraire le texte cible pour le grounding en dernier recours
|
||||
# Extraire le texte cible pour le grounding et la vérification CLIP
|
||||
_fc_target_text = params.get('visual_anchor', {}).get('target_text', '')
|
||||
if not _fc_target_text:
|
||||
_fc_target_text = params.get('_step_label', '')
|
||||
# Si le label est juste le type d'action, essayer de décrire l'ancre
|
||||
_action_types = {'click_anchor', 'double_click_anchor', 'right_click_anchor',
|
||||
'hover_anchor', 'focus_anchor', 'scroll_to_anchor'}
|
||||
if _fc_target_text in _action_types and screenshot_base64:
|
||||
try:
|
||||
from core.execution.input_handler import _describe_anchor_image
|
||||
_desc = _describe_anchor_image(screenshot_base64)
|
||||
if _desc:
|
||||
print(f"🏷️ [Vision] Ancre décrite: '{_desc}'")
|
||||
_fc_target_text = _desc
|
||||
except Exception as e:
|
||||
print(f"⚠️ [Vision] Description ancre échouée: {e}")
|
||||
_fc_target_desc = params.get('visual_anchor', {}).get('description', '')
|
||||
|
||||
# Trouver l'ancre avec la vision (CLIP + position - cf VISION_RPA_INTELLIGENT.md)
|
||||
|
||||
Reference in New Issue
Block a user