fix: vérification croisée CLIP+OCR + description ancre avant exécution
Some checks failed
security-audit / Bandit (scan statique) (push) Successful in 12s
security-audit / pip-audit (CVE dépendances) (push) Successful in 10s
security-audit / Scan secrets (grep) (push) Successful in 9s
tests / Lint (ruff + black) (push) Successful in 14s
tests / Tests unitaires (sans GPU) (push) Failing after 13s
tests / Tests sécurité (critique) (push) Has been skipped
Some checks failed
security-audit / Bandit (scan statique) (push) Successful in 12s
security-audit / pip-audit (CVE dépendances) (push) Successful in 10s
security-audit / Scan secrets (grep) (push) Successful in 9s
tests / Lint (ruff + black) (push) Successful in 14s
tests / Tests unitaires (sans GPU) (push) Failing after 13s
tests / Tests sécurité (critique) (push) Has been skipped
Quand CLIP dit "trouvé", on vérifie par OCR que le texte à cette position correspond au target. Si CLIP clique sur "Ce PC" au lieu de "CR_patient_demo", l'OCR le rejette → fallback sur la cascade. Description VLM de l'ancre AVANT le CLIP quand le label est un type d'action (double_click_anchor → "text file icon CR_patient"). Le target_text enrichi sert à la vérification croisée ET au grounding. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -724,15 +724,39 @@ def find_and_click(
|
||||
|
||||
# clip_result.found est conditionné par les seuils dans find_anchor_in_screen
|
||||
if clip_result.found:
|
||||
print(f"✅ [Vision] UI-DETR-1+CLIP réussi! Confiance: {clip_result.confidence:.2f}")
|
||||
return {
|
||||
'found': True,
|
||||
'confidence': clip_result.confidence,
|
||||
'coordinates': clip_result.center,
|
||||
'bbox': clip_result.bbox,
|
||||
'method': 'clip',
|
||||
'search_time_ms': (_time.time() - start_time) * 1000
|
||||
}
|
||||
# Vérification croisée OCR : le texte à cette position correspond-il ?
|
||||
clip_validated = True
|
||||
if target_text and target_text not in ('click_anchor', 'double_click_anchor',
|
||||
'right_click_anchor', 'hover_anchor', 'focus_anchor'):
|
||||
try:
|
||||
from services.ocr_service import ocr_extract_words
|
||||
words = ocr_extract_words(screen_image)
|
||||
cx, cy = clip_result.center['x'], clip_result.center['y']
|
||||
nearby_texts = []
|
||||
for w in words:
|
||||
wx = (w['bbox'][0] + w['bbox'][2]) / 2
|
||||
wy = (w['bbox'][1] + w['bbox'][3]) / 2
|
||||
dist = ((wx - cx)**2 + (wy - cy)**2) ** 0.5
|
||||
if dist < 100:
|
||||
nearby_texts.append(w['text'])
|
||||
nearby_str = ' '.join(nearby_texts).lower()
|
||||
target_lower = target_text.lower()
|
||||
if target_lower not in nearby_str and not any(t.lower() in target_lower for t in nearby_texts if len(t) > 2):
|
||||
print(f"⛔ [Vision] CLIP rejeté par OCR: texte proche='{nearby_str}' ne contient pas '{target_text}'")
|
||||
clip_validated = False
|
||||
except Exception as ocr_err:
|
||||
print(f"⚠️ [Vision] Vérification OCR échouée: {ocr_err}")
|
||||
|
||||
if clip_validated:
|
||||
print(f"✅ [Vision] UI-DETR-1+CLIP réussi! Confiance: {clip_result.confidence:.2f}")
|
||||
return {
|
||||
'found': True,
|
||||
'confidence': clip_result.confidence,
|
||||
'coordinates': clip_result.center,
|
||||
'bbox': clip_result.bbox,
|
||||
'method': 'clip',
|
||||
'search_time_ms': (_time.time() - start_time) * 1000
|
||||
}
|
||||
else:
|
||||
print(f"⚠️ [Vision] UI-DETR-1+CLIP: rejeté (confiance: {clip_result.confidence:.2f})")
|
||||
except Exception as clip_err:
|
||||
|
||||
Reference in New Issue
Block a user