feat: VLM grounding direct (Qwen2.5-VL) — nouvelle stratégie de résolution
Nouvelle approche basée sur les recherches état de l'art : - _resolve_by_grounding() : le VLM retourne directement les coordonnées (pas de SomEngine + numérotation intermédiaire) - Utilise Qwen2.5-VL (entraîné pour le GUI grounding) au lieu de qwen3-vl - Parse les formats natifs : bbox_2d, JSON x/y, arrays bruts - Fallback multi-image : screenshot + crop → grounding sans description - Identification des icônes via Qwen2.5-VL (meilleur que qwen3-vl) Résultats sur session réelle (validation locale) : - Éléments avec texte (Word, Document, Fichier) : 100% corrects - Icônes sans texte (Windows logo, disquette) : en cours d'amélioration Cascade strict mode : 0. Grounding VLM direct (Qwen2.5-VL) — NOUVEAU 0.5. Template matching pour icônes 1. VLM Quick Find (fallback) 1.5. SoM + VLM 2. Template matching strict Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -458,26 +458,24 @@ def _vlm_identify_element(anchor_b64: str, window_title: str = "") -> str:
|
||||
img.save(tmp, format="PNG")
|
||||
tmp_path = tmp.name
|
||||
|
||||
from core.detection.ollama_client import OllamaClient
|
||||
client = OllamaClient(
|
||||
endpoint="http://localhost:11434",
|
||||
model="qwen3-vl:8b",
|
||||
timeout=15,
|
||||
)
|
||||
context = f" in the window '{window_title}'" if window_title else ""
|
||||
result = client.generate(
|
||||
prompt=(
|
||||
f"This is a cropped UI element{context}. "
|
||||
"What is it? Answer with a short label (2-5 words max). "
|
||||
"Examples: 'search bar icon', 'Word application icon', 'close button', "
|
||||
"'file menu', 'save button'.\n"
|
||||
"Answer ONLY the label, nothing else."
|
||||
),
|
||||
image_path=tmp_path,
|
||||
system_prompt="You identify UI elements. Answer with a short label only.",
|
||||
temperature=0.1,
|
||||
max_tokens=20,
|
||||
)
|
||||
import requests as _requests
|
||||
context = f" from the window '{window_title}'" if window_title else ""
|
||||
# Utiliser Qwen2.5-VL (meilleur pour l'identification UI que qwen3-vl)
|
||||
crop_b64 = base64.b64encode(open(tmp_path, "rb").read()).decode()
|
||||
resp = _requests.post("http://localhost:11434/api/chat", json={
|
||||
"model": "qwen2.5vl:7b",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You name UI elements in 2-5 words. No explanation."},
|
||||
{"role": "user", "content": (
|
||||
f"This is a UI element{context}. "
|
||||
"Name it in 2-5 words. Examples: 'save icon in title bar', "
|
||||
"'Windows search icon', 'close button', 'file menu'."
|
||||
), "images": [crop_b64]},
|
||||
],
|
||||
"stream": False,
|
||||
"options": {"temperature": 0.1, "num_predict": 20},
|
||||
}, timeout=30)
|
||||
result = {"success": resp.ok, "response": resp.json().get("message", {}).get("content", "")}
|
||||
|
||||
import os
|
||||
os.unlink(tmp_path)
|
||||
|
||||
Reference in New Issue
Block a user