feat: VLM grounding direct (Qwen2.5-VL) — nouvelle stratégie de résolution

Nouvelle approche basée sur les recherches état de l'art :
- _resolve_by_grounding() : le VLM retourne directement les coordonnées
  (pas de SomEngine + numérotation intermédiaire)
- Utilise Qwen2.5-VL (entraîné pour le GUI grounding) au lieu de qwen3-vl
- Parse les formats natifs : bbox_2d, JSON x/y, arrays bruts
- Fallback multi-image : screenshot + crop → grounding sans description
- Identification des icônes via Qwen2.5-VL (meilleur que qwen3-vl)

Résultats sur session réelle (validation locale) :
- Éléments avec texte (Word, Document, Fichier) : 100% corrects
- Icônes sans texte (Windows logo, disquette) : en cours d'amélioration

Cascade strict mode :
0. Grounding VLM direct (Qwen2.5-VL) — NOUVEAU
0.5. Template matching pour icônes
1. VLM Quick Find (fallback)
1.5. SoM + VLM
2. Template matching strict

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Dom
2026-03-31 18:55:00 +02:00
parent 875367dea9
commit d99b17394a
2 changed files with 243 additions and 25 deletions

View File

@@ -458,26 +458,24 @@ def _vlm_identify_element(anchor_b64: str, window_title: str = "") -> str:
img.save(tmp, format="PNG")
tmp_path = tmp.name
from core.detection.ollama_client import OllamaClient
client = OllamaClient(
endpoint="http://localhost:11434",
model="qwen3-vl:8b",
timeout=15,
)
context = f" in the window '{window_title}'" if window_title else ""
result = client.generate(
prompt=(
f"This is a cropped UI element{context}. "
"What is it? Answer with a short label (2-5 words max). "
"Examples: 'search bar icon', 'Word application icon', 'close button', "
"'file menu', 'save button'.\n"
"Answer ONLY the label, nothing else."
),
image_path=tmp_path,
system_prompt="You identify UI elements. Answer with a short label only.",
temperature=0.1,
max_tokens=20,
)
import requests as _requests
context = f" from the window '{window_title}'" if window_title else ""
# Utiliser Qwen2.5-VL (meilleur pour l'identification UI que qwen3-vl)
crop_b64 = base64.b64encode(open(tmp_path, "rb").read()).decode()
resp = _requests.post("http://localhost:11434/api/chat", json={
"model": "qwen2.5vl:7b",
"messages": [
{"role": "system", "content": "You name UI elements in 2-5 words. No explanation."},
{"role": "user", "content": (
f"This is a UI element{context}. "
"Name it in 2-5 words. Examples: 'save icon in title bar', "
"'Windows search icon', 'close button', 'file menu'."
), "images": [crop_b64]},
],
"stream": False,
"options": {"temperature": 0.1, "num_predict": 20},
}, timeout=30)
result = {"success": resp.ok, "response": resp.json().get("message", {}).get("content", "")}
import os
os.unlink(tmp_path)