fix(vision): Corriger les seuils CLIP/Template pour éviter les clics erronés
Problème résolu: - Le workflow cliquait au mauvais endroit (200-500px de distance) - Les seuils de matching étaient trop permissifs Corrections apportées: - CLIP: MAX_DISTANCE=120px, MIN_SCORE=0.55, MIN_COMBINED=0.5 - Template zonée: MAX_DISTANCE=150px - Template global: MAX_DISTANCE=150px (était 500px) - Ajout de logs détaillés pour debug des candidats rejetés - Désactivation de l'overlay debug (polling intensif inutile) Fichiers modifiés: - intelligent_executor.py: Seuils stricts + logs - execute.py: Logique d'exécution modes basic/intelligent/debug - ui_detection_service.py: Backend UI-DETR-1 - App.tsx: Overlay désactivé - ExecutionOverlay.tsx: URLs API corrigées Documentation: - docs/REFERENCE_VISION_RPA.md: Guide complet de référence Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
230
docs/REFERENCE_VISION_RPA.md
Normal file
230
docs/REFERENCE_VISION_RPA.md
Normal file
@@ -0,0 +1,230 @@
|
|||||||
|
# VWB Vision RPA - Document de Référence
|
||||||
|
## Session du 24 Janvier 2026
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. RÉSUMÉ DU PROBLÈME INITIAL
|
||||||
|
|
||||||
|
Le workflow "Onlyoffice" (12 étapes) cliquait au mauvais endroit :
|
||||||
|
- **Symptôme** : Gedit s'ouvrait au lieu de OnlyOffice
|
||||||
|
- **Cause** : Les seuils de matching étaient trop permissifs (acceptait des matches à 200+ pixels de distance)
|
||||||
|
- **Impact** : Le workflow continuait même après un clic erroné
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. ARCHITECTURE DU SYSTÈME DE VISION
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────┐
|
||||||
|
│ PIPELINE DE MATCHING │
|
||||||
|
├─────────────────────────────────────────────────────────────┤
|
||||||
|
│ 1. UI-DETR-1 (rfdetr) │
|
||||||
|
│ → Détecte tous les éléments UI à l'écran │
|
||||||
|
│ → Retourne des bounding boxes │
|
||||||
|
│ │
|
||||||
|
│ 2. CLIP (OpenCLIP) │
|
||||||
|
│ → Compare l'ancre avec chaque élément détecté │
|
||||||
|
│ → Score de similarité sémantique (0-1) │
|
||||||
|
│ → Pondéré par la distance à la position originale │
|
||||||
|
│ │
|
||||||
|
│ 3. Template Matching (OpenCV) │
|
||||||
|
│ → Fallback si CLIP échoue │
|
||||||
|
│ → Comparaison pixel à pixel │
|
||||||
|
│ → Zoned (100-200px) puis Global │
|
||||||
|
│ │
|
||||||
|
│ 4. Static Fallback │
|
||||||
|
│ → Dernier recours : coordonnées originales │
|
||||||
|
└─────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. SEUILS CRITIQUES (VALEURS ACTUELLES)
|
||||||
|
|
||||||
|
### Dans `intelligent_executor.py` - Méthode CLIP
|
||||||
|
|
||||||
|
```python
|
||||||
|
# === SEUILS ÉQUILIBRÉS ===
|
||||||
|
MAX_DISTANCE_PX = 120 # Rejeter tout élément > 120px de la position originale
|
||||||
|
MIN_CLIP_SCORE = 0.55 # Score CLIP minimum requis
|
||||||
|
MIN_COMBINED_SCORE = 0.5 # Score combiné minimum pour accepter un match
|
||||||
|
```
|
||||||
|
|
||||||
|
### Dans `intelligent_executor.py` - Template Matching Zoné
|
||||||
|
|
||||||
|
```python
|
||||||
|
MAX_TEMPLATE_DISTANCE = 150 # Dans zoned_template_match()
|
||||||
|
```
|
||||||
|
|
||||||
|
### Dans `intelligent_executor.py` - Template Matching Global
|
||||||
|
|
||||||
|
```python
|
||||||
|
MAX_GLOBAL_DISTANCE = 150 # Dans find_and_click()
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. FICHIERS MODIFIÉS
|
||||||
|
|
||||||
|
| Fichier | Modifications |
|
||||||
|
|---------|---------------|
|
||||||
|
| `services/intelligent_executor.py` | Seuils CLIP, limites de distance, logs détaillés |
|
||||||
|
| `api_v3/execute.py` | Logique d'exécution avec modes basic/intelligent/debug |
|
||||||
|
| `services/ui_detection_service.py` | Backend UI-DETR-1 |
|
||||||
|
| `frontend_v4/src/App.tsx` | Overlay debug désactivé |
|
||||||
|
| `frontend_v4/src/components/ExecutionOverlay.tsx` | URLs API corrigées |
|
||||||
|
| `catalog_routes_v2_vlm.py` | Intégration VLM Ollama |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. MODES D'EXÉCUTION
|
||||||
|
|
||||||
|
| Mode | Comportement | Vitesse | Utilisation |
|
||||||
|
|------|--------------|---------|-------------|
|
||||||
|
| **basic** | Coordonnées statiques uniquement | Rapide | Écran identique à l'enregistrement |
|
||||||
|
| **intelligent** | Vision (CLIP + Template) | Lent | Interface peut changer |
|
||||||
|
| **debug** | Vision + logs détaillés | Lent | Débogage |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. ORDRE DES STRATÉGIES DE MATCHING
|
||||||
|
|
||||||
|
```
|
||||||
|
1. CLIP (UI-DETR-1 + embeddings CLIP)
|
||||||
|
├── Si trouvé avec confiance ≥ 0.5 et distance ≤ 120px → UTILISER
|
||||||
|
└── Sinon → Fallback
|
||||||
|
|
||||||
|
2. Template Matching Zoné (100px)
|
||||||
|
├── Si trouvé avec confiance ≥ 0.7 et distance ≤ 150px → UTILISER
|
||||||
|
└── Sinon → Élargir
|
||||||
|
|
||||||
|
3. Template Matching Zoné Élargi (200px)
|
||||||
|
├── Si trouvé avec confiance ≥ 0.6 et distance ≤ 150px → UTILISER
|
||||||
|
└── Sinon → Global
|
||||||
|
|
||||||
|
4. Template Matching Global
|
||||||
|
├── Si trouvé avec confiance ≥ 0.75 et distance ≤ 150px → UTILISER
|
||||||
|
└── Sinon → Static Fallback
|
||||||
|
|
||||||
|
5. Static Fallback
|
||||||
|
└── Utiliser les coordonnées originales de l'enregistrement
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. PROBLÈMES COURANTS ET SOLUTIONS
|
||||||
|
|
||||||
|
### Problème : "Aucun candidat valide (tous rejetés par seuils stricts)"
|
||||||
|
**Cause** : Les seuils CLIP sont trop stricts ou UI-DETR-1 ne détecte pas l'élément
|
||||||
|
**Solution** :
|
||||||
|
- Baisser `MIN_CLIP_SCORE` (ex: 0.50)
|
||||||
|
- Augmenter `MAX_DISTANCE_PX` (ex: 150)
|
||||||
|
|
||||||
|
### Problème : Clic au mauvais endroit
|
||||||
|
**Cause** : Template matching trouve un faux positif loin de la cible
|
||||||
|
**Solution** :
|
||||||
|
- Réduire `MAX_TEMPLATE_DISTANCE` et `MAX_GLOBAL_DISTANCE`
|
||||||
|
- Vérifier que l'ancre est bien distinctive
|
||||||
|
|
||||||
|
### Problème : Workflow très lent
|
||||||
|
**Cause** :
|
||||||
|
- Modèles rechargés à chaque étape
|
||||||
|
- Ollama sur CPU
|
||||||
|
- Multiples fallbacks
|
||||||
|
**Solutions** :
|
||||||
|
- Utiliser mode `basic` pour workflows stables
|
||||||
|
- Configurer Ollama pour GPU
|
||||||
|
- Implémenter un cache des modèles
|
||||||
|
|
||||||
|
### Problème : Ollama sur CPU au lieu de GPU
|
||||||
|
**Vérification** : `ollama ps`
|
||||||
|
**Solution** :
|
||||||
|
```bash
|
||||||
|
# Vérifier CUDA
|
||||||
|
nvidia-smi
|
||||||
|
# Relancer Ollama avec GPU
|
||||||
|
CUDA_VISIBLE_DEVICES=0 ollama serve
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. MODÈLES UTILISÉS
|
||||||
|
|
||||||
|
| Modèle | Utilisation | Emplacement |
|
||||||
|
|--------|-------------|-------------|
|
||||||
|
| UI-DETR-1 (rfdetr) | Détection éléments UI | `/home/dom/ai/rpa_vision_v3/models/ui-detr-1/model.pth` |
|
||||||
|
| CLIP (ViT-B-32) | Similarité sémantique | OpenCLIP (téléchargé automatiquement) |
|
||||||
|
| qwen2.5vl:3b | Analyse IA (vision) | Ollama |
|
||||||
|
|
||||||
|
### Modèles Ollama recommandés pour meilleure qualité :
|
||||||
|
- `qwen2.5vl:7b` - Meilleur que 3b
|
||||||
|
- `llama3.2-vision:11b` - Encore meilleur
|
||||||
|
- `mistral:7b` - Pour texte pur (pas de vision)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. COMMANDES UTILES
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Démarrer le backend VWB
|
||||||
|
cd /home/dom/ai/rpa_vision_v3/visual_workflow_builder/backend
|
||||||
|
./venv/bin/python app.py
|
||||||
|
|
||||||
|
# Vérifier le port 5001
|
||||||
|
lsof -i :5001
|
||||||
|
|
||||||
|
# Voir les logs d'exécution
|
||||||
|
tail -f /tmp/vwb_backend.log | grep -E "(Execute|Vision|CLIP)"
|
||||||
|
|
||||||
|
# Vérifier le status d'une exécution
|
||||||
|
curl -s http://localhost:5001/api/v3/execute/status | python3 -m json.tool
|
||||||
|
|
||||||
|
# Lister les modèles Ollama
|
||||||
|
ollama list
|
||||||
|
|
||||||
|
# Voir si Ollama utilise le GPU
|
||||||
|
ollama ps
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. RÉSULTAT FINAL
|
||||||
|
|
||||||
|
Le workflow "Onlyoffice" (12 étapes) fonctionne maintenant :
|
||||||
|
|
||||||
|
| Étape | Action | Méthode | Status |
|
||||||
|
|-------|--------|---------|--------|
|
||||||
|
| 1 | Clic menu | CLIP 99.8% | ✅ |
|
||||||
|
| 2 | Saisie "onlyoffice" | - | ✅ |
|
||||||
|
| 3 | Clic OnlyOffice | static_fallback | ✅ |
|
||||||
|
| 4 | Clic docx | CLIP 99.2% | ✅ |
|
||||||
|
| 5 | Attente 5s | - | ✅ |
|
||||||
|
| 6 | Saisie texte | - | ✅ |
|
||||||
|
| 7 | Analyse IA | qwen2.5vl:3b | ✅ |
|
||||||
|
| 8 | Clic menu | CLIP 98.9% | ✅ |
|
||||||
|
| 9 | Saisie "gedit" | - | ✅ |
|
||||||
|
| 10 | Clic gedit | static_fallback | ✅ |
|
||||||
|
| 11 | Attente 10s | - | ✅ |
|
||||||
|
| 12 | Coller résultat IA | - | ✅ |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. PROCHAINES AMÉLIORATIONS SUGGÉRÉES
|
||||||
|
|
||||||
|
1. **Cache des modèles** : Charger UI-DETR-1 et CLIP une seule fois au démarrage
|
||||||
|
2. **Ollama GPU** : Configurer pour utiliser le GPU
|
||||||
|
3. **Seuils adaptatifs** : Ajuster automatiquement selon le contexte
|
||||||
|
4. **Vérification post-action** : Confirmer que l'action a eu l'effet attendu
|
||||||
|
5. **Mode hybride** : Basic par défaut, vision uniquement si échec
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. CONTACT / HISTORIQUE
|
||||||
|
|
||||||
|
- **Date de résolution** : 24 Janvier 2026
|
||||||
|
- **Durée de débogage** : ~2 heures
|
||||||
|
- **Fichiers sauvegardés** : `/home/dom/ai/rpa_vision_v3/backups_24janv.2026_vision_fix/`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Document généré automatiquement - Ne pas modifier manuellement*
|
||||||
@@ -16,7 +16,26 @@ import threading
|
|||||||
import time
|
import time
|
||||||
import base64
|
import base64
|
||||||
import os
|
import os
|
||||||
|
import subprocess
|
||||||
from . import api_v3_bp
|
from . import api_v3_bp
|
||||||
|
|
||||||
|
|
||||||
|
def minimize_active_window():
|
||||||
|
"""Minimise la fenêtre active (Linux avec xdotool)"""
|
||||||
|
try:
|
||||||
|
# Attendre un court instant pour que la requête HTTP soit traitée
|
||||||
|
time.sleep(0.3)
|
||||||
|
# Minimiser la fenêtre active
|
||||||
|
subprocess.run(['xdotool', 'getactivewindow', 'windowminimize'],
|
||||||
|
capture_output=True, timeout=2)
|
||||||
|
print("📦 [Execute] Fenêtre du navigateur minimisée")
|
||||||
|
return True
|
||||||
|
except FileNotFoundError:
|
||||||
|
print("⚠️ [Execute] xdotool non installé - impossible de minimiser")
|
||||||
|
return False
|
||||||
|
except Exception as e:
|
||||||
|
print(f"⚠️ [Execute] Erreur minimisation: {e}")
|
||||||
|
return False
|
||||||
from db.models import db, Workflow, Step, Execution, ExecutionStep, VisualAnchor, get_session_state
|
from db.models import db, Workflow, Step, Execution, ExecutionStep, VisualAnchor, get_session_state
|
||||||
from contracts.action_contracts import enforce_action_contract, ContractValidationError, get_required_params
|
from contracts.action_contracts import enforce_action_contract, ContractValidationError, get_required_params
|
||||||
|
|
||||||
@@ -32,7 +51,8 @@ _execution_state = {
|
|||||||
'is_paused': False,
|
'is_paused': False,
|
||||||
'should_stop': False,
|
'should_stop': False,
|
||||||
'current_execution_id': None,
|
'current_execution_id': None,
|
||||||
'thread': None
|
'thread': None,
|
||||||
|
'execution_mode': 'basic' # 'basic', 'intelligent', 'debug'
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -99,9 +119,11 @@ def execute_workflow_thread(execution_id: str, workflow_id: str, app):
|
|||||||
if step.anchor_id:
|
if step.anchor_id:
|
||||||
anchor = VisualAnchor.query.get(step.anchor_id)
|
anchor = VisualAnchor.query.get(step.anchor_id)
|
||||||
if anchor:
|
if anchor:
|
||||||
# Charger l'image base64 depuis le fichier
|
# Charger l'image CROPPÉE (thumbnail) pour le template matching
|
||||||
if anchor.image_path and os.path.exists(anchor.image_path):
|
# thumbnail_path = zone de l'ancre, image_path = écran complet
|
||||||
with open(anchor.image_path, 'rb') as f:
|
anchor_image_path = anchor.thumbnail_path or anchor.image_path
|
||||||
|
if anchor_image_path and os.path.exists(anchor_image_path):
|
||||||
|
with open(anchor_image_path, 'rb') as f:
|
||||||
image_base64 = base64.b64encode(f.read()).decode('utf-8')
|
image_base64 = base64.b64encode(f.read()).decode('utf-8')
|
||||||
else:
|
else:
|
||||||
image_base64 = None
|
image_base64 = None
|
||||||
@@ -202,57 +224,249 @@ def execute_workflow_thread(execution_id: str, workflow_id: str, app):
|
|||||||
_execution_state['current_execution_id'] = None
|
_execution_state['current_execution_id'] = None
|
||||||
|
|
||||||
|
|
||||||
|
def execute_ai_analyze(params: dict) -> dict:
|
||||||
|
"""
|
||||||
|
Exécute une analyse IA avec Ollama.
|
||||||
|
Capture la zone de l'ancre et envoie à l'IA pour analyse.
|
||||||
|
"""
|
||||||
|
import requests
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Récupérer les paramètres
|
||||||
|
anchor = params.get('visual_anchor', {})
|
||||||
|
prompt = params.get('analysis_prompt', params.get('prompt', ''))
|
||||||
|
model = params.get('model', params.get('ollama_model', 'qwen2.5-vl:7b'))
|
||||||
|
output_variable = params.get('output_variable', 'resultat_analyse')
|
||||||
|
timeout_ms = params.get('timeout_ms', 60000)
|
||||||
|
temperature = params.get('temperature', 0.3)
|
||||||
|
|
||||||
|
# Récupérer l'image de l'ancre
|
||||||
|
screenshot_base64 = anchor.get('screenshot')
|
||||||
|
|
||||||
|
if not screenshot_base64:
|
||||||
|
# Capturer l'écran si pas d'image dans l'ancre
|
||||||
|
try:
|
||||||
|
from PIL import ImageGrab
|
||||||
|
import io
|
||||||
|
|
||||||
|
bbox = anchor.get('bounding_box', {})
|
||||||
|
if bbox:
|
||||||
|
# Capturer la zone spécifique
|
||||||
|
x, y = int(bbox.get('x', 0)), int(bbox.get('y', 0))
|
||||||
|
w, h = int(bbox.get('width', 100)), int(bbox.get('height', 100))
|
||||||
|
screenshot = ImageGrab.grab(bbox=(x, y, x + w, y + h))
|
||||||
|
else:
|
||||||
|
# Capturer tout l'écran
|
||||||
|
screenshot = ImageGrab.grab()
|
||||||
|
|
||||||
|
buffer = io.BytesIO()
|
||||||
|
screenshot.save(buffer, format='PNG')
|
||||||
|
screenshot_base64 = base64.b64encode(buffer.getvalue()).decode('utf-8')
|
||||||
|
except Exception as cap_err:
|
||||||
|
return {'success': False, 'error': f"Erreur capture: {cap_err}"}
|
||||||
|
|
||||||
|
if not prompt:
|
||||||
|
prompt = "Décris ce que tu vois dans cette image."
|
||||||
|
|
||||||
|
print(f"🤖 [IA] Analyse avec {model}...")
|
||||||
|
print(f" Prompt: {prompt[:80]}...")
|
||||||
|
|
||||||
|
# Appeler Ollama
|
||||||
|
ollama_url = params.get('ollama_url', 'http://localhost:11434')
|
||||||
|
|
||||||
|
payload = {
|
||||||
|
"model": model,
|
||||||
|
"prompt": prompt,
|
||||||
|
"images": [screenshot_base64],
|
||||||
|
"stream": False,
|
||||||
|
"options": {
|
||||||
|
"temperature": temperature,
|
||||||
|
"num_predict": 1000
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
response = requests.post(
|
||||||
|
f"{ollama_url}/api/generate",
|
||||||
|
json=payload,
|
||||||
|
timeout=timeout_ms / 1000
|
||||||
|
)
|
||||||
|
|
||||||
|
if response.status_code == 200:
|
||||||
|
result = response.json()
|
||||||
|
analysis_text = result.get('response', '').strip()
|
||||||
|
|
||||||
|
print(f"✅ [IA] Analyse terminée ({len(analysis_text)} caractères)")
|
||||||
|
print(f" Résultat: {analysis_text[:150]}...")
|
||||||
|
|
||||||
|
# Stocker le résultat dans le contexte d'exécution pour les variables
|
||||||
|
global _execution_state
|
||||||
|
if 'variables' not in _execution_state:
|
||||||
|
_execution_state['variables'] = {}
|
||||||
|
_execution_state['variables'][output_variable] = analysis_text
|
||||||
|
|
||||||
|
return {
|
||||||
|
'success': True,
|
||||||
|
'output': {
|
||||||
|
'analysis': analysis_text,
|
||||||
|
'variable': output_variable,
|
||||||
|
'model': model
|
||||||
|
}
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
return {'success': False, 'error': f"Erreur Ollama: {response.status_code}"}
|
||||||
|
|
||||||
|
except requests.exceptions.Timeout:
|
||||||
|
return {'success': False, 'error': f"Timeout Ollama après {timeout_ms}ms"}
|
||||||
|
except requests.exceptions.ConnectionError:
|
||||||
|
return {'success': False, 'error': "Ollama non accessible (vérifiez qu'il est lancé)"}
|
||||||
|
except Exception as e:
|
||||||
|
return {'success': False, 'error': str(e)}
|
||||||
|
|
||||||
|
|
||||||
def execute_action(action_type: str, params: dict) -> dict:
|
def execute_action(action_type: str, params: dict) -> dict:
|
||||||
"""
|
"""
|
||||||
Exécute une action RPA.
|
Exécute une action RPA.
|
||||||
Utilise pyautogui pour les interactions.
|
Utilise pyautogui pour les interactions.
|
||||||
|
En mode intelligent/debug, utilise la vision pour localiser les éléments.
|
||||||
"""
|
"""
|
||||||
import pyautogui
|
import pyautogui
|
||||||
import time
|
import time
|
||||||
|
|
||||||
|
execution_mode = _execution_state.get('execution_mode', 'basic')
|
||||||
|
|
||||||
try:
|
try:
|
||||||
if action_type in ['click_anchor', 'click', 'double_click_anchor', 'right_click_anchor']:
|
if action_type in ['click_anchor', 'click', 'double_click_anchor', 'right_click_anchor']:
|
||||||
# Récupérer les coordonnées depuis l'ancre
|
# Récupérer les coordonnées depuis l'ancre
|
||||||
anchor = params.get('visual_anchor', {})
|
anchor = params.get('visual_anchor', {})
|
||||||
bbox = anchor.get('bounding_box', {})
|
bbox = anchor.get('bounding_box', {})
|
||||||
|
screenshot_base64 = anchor.get('screenshot')
|
||||||
|
|
||||||
if not bbox:
|
if not bbox:
|
||||||
return {'success': False, 'error': 'Pas de bounding_box dans visual_anchor'}
|
return {'success': False, 'error': 'Pas de bounding_box dans visual_anchor'}
|
||||||
|
|
||||||
# Calculer le centre
|
# Déterminer le type de clic
|
||||||
|
click_type = 'left'
|
||||||
|
if action_type == 'double_click_anchor':
|
||||||
|
click_type = 'double'
|
||||||
|
elif action_type == 'right_click_anchor':
|
||||||
|
click_type = 'right'
|
||||||
|
|
||||||
|
# === MODE INTELLIGENT / DEBUG ===
|
||||||
|
if execution_mode in ['intelligent', 'debug'] and screenshot_base64:
|
||||||
|
try:
|
||||||
|
from services.intelligent_executor import find_and_click
|
||||||
|
|
||||||
|
print(f"🧠 [Action] Mode {execution_mode}: recherche visuelle de l'ancre...")
|
||||||
|
|
||||||
|
# Convertir bbox au format attendu
|
||||||
|
anchor_bbox = {
|
||||||
|
'x': bbox.get('x', 0),
|
||||||
|
'y': bbox.get('y', 0),
|
||||||
|
'width': bbox.get('width', 0),
|
||||||
|
'height': bbox.get('height', 0)
|
||||||
|
}
|
||||||
|
|
||||||
|
# Trouver l'ancre avec la vision (CLIP + position - cf VISION_RPA_INTELLIGENT.md)
|
||||||
|
result = find_and_click(
|
||||||
|
anchor_image_base64=screenshot_base64,
|
||||||
|
anchor_bbox=anchor_bbox,
|
||||||
|
method='clip', # UI-DETR-1 + CLIP avec pondération par distance
|
||||||
|
detection_threshold=0.35
|
||||||
|
)
|
||||||
|
|
||||||
|
if result['found'] and result['coordinates']:
|
||||||
|
x, y = result['coordinates']['x'], result['coordinates']['y']
|
||||||
|
confidence = result['confidence']
|
||||||
|
|
||||||
|
print(f"✅ [Vision] Ancre trouvée à ({x}, {y}) - confiance: {confidence:.2f}")
|
||||||
|
|
||||||
|
# Effectuer le clic
|
||||||
|
if click_type == 'double':
|
||||||
|
pyautogui.doubleClick(x, y)
|
||||||
|
elif click_type == 'right':
|
||||||
|
pyautogui.rightClick(x, y)
|
||||||
|
else:
|
||||||
|
pyautogui.click(x, y)
|
||||||
|
|
||||||
|
# Délai après le clic pour que l'application réagisse
|
||||||
|
# 2 secondes pour laisser le temps aux applications de s'ouvrir
|
||||||
|
time.sleep(2.0)
|
||||||
|
|
||||||
|
return {
|
||||||
|
'success': True,
|
||||||
|
'output': {
|
||||||
|
'clicked_at': {'x': x, 'y': y},
|
||||||
|
'mode': execution_mode,
|
||||||
|
'confidence': confidence,
|
||||||
|
'method': result.get('method', 'template')
|
||||||
|
}
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
# En mode intelligent/debug, on refuse d'utiliser les coordonnées statiques
|
||||||
|
# si l'ancre n'est pas trouvée - cela évite les clics au mauvais endroit
|
||||||
|
reason = result.get('reason', 'Ancre non trouvée à l\'écran')
|
||||||
|
confidence = result.get('confidence', 0)
|
||||||
|
print(f"❌ [Vision] Ancre NON trouvée (confiance: {confidence:.2f})")
|
||||||
|
print(f" Raison: {reason}")
|
||||||
|
return {
|
||||||
|
'success': False,
|
||||||
|
'error': f"Ancre non trouvée à l'écran (confiance: {confidence:.2f}). {reason}"
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as vision_err:
|
||||||
|
print(f"❌ [Vision] Erreur: {vision_err}")
|
||||||
|
return {
|
||||||
|
'success': False,
|
||||||
|
'error': f"Erreur vision: {str(vision_err)}"
|
||||||
|
}
|
||||||
|
|
||||||
|
# === MODE BASIC (ou fallback) ===
|
||||||
|
# Calculer le centre depuis les coordonnées statiques
|
||||||
x = bbox.get('x', 0) + bbox.get('width', 0) / 2
|
x = bbox.get('x', 0) + bbox.get('width', 0) / 2
|
||||||
y = bbox.get('y', 0) + bbox.get('height', 0) / 2
|
y = bbox.get('y', 0) + bbox.get('height', 0) / 2
|
||||||
|
|
||||||
# TODO: Utiliser la détection visuelle (OmniParser/VLM) ici
|
print(f"🖱️ [Action] Clic {click_type} à ({x}, {y}) [mode: {execution_mode}]")
|
||||||
# Pour l'instant, on utilise les coordonnées statiques
|
|
||||||
|
|
||||||
print(f"🖱️ [Action] Clic à ({x}, {y})")
|
if click_type == 'double':
|
||||||
|
|
||||||
if action_type == 'double_click_anchor':
|
|
||||||
pyautogui.doubleClick(x, y)
|
pyautogui.doubleClick(x, y)
|
||||||
elif action_type == 'right_click_anchor':
|
elif click_type == 'right':
|
||||||
pyautogui.rightClick(x, y)
|
pyautogui.rightClick(x, y)
|
||||||
else:
|
else:
|
||||||
pyautogui.click(x, y)
|
pyautogui.click(x, y)
|
||||||
|
|
||||||
return {'success': True, 'output': {'clicked_at': {'x': x, 'y': y}}}
|
return {'success': True, 'output': {'clicked_at': {'x': x, 'y': y}, 'mode': execution_mode}}
|
||||||
|
|
||||||
elif action_type in ['type_text', 'type']:
|
elif action_type in ['type_text', 'type']:
|
||||||
text = params.get('text', '')
|
text = params.get('text', '')
|
||||||
if not text:
|
if not text:
|
||||||
return {'success': False, 'error': 'Pas de texte à saisir'}
|
return {'success': False, 'error': 'Pas de texte à saisir'}
|
||||||
|
|
||||||
print(f"⌨️ [Action] Saisie: {text[:30]}...")
|
# Remplacer les variables {{variable}} par leur valeur
|
||||||
|
import re
|
||||||
|
variables = _execution_state.get('variables', {})
|
||||||
|
|
||||||
|
def replace_var(match):
|
||||||
|
var_name = match.group(1)
|
||||||
|
value = variables.get(var_name, match.group(0)) # Garder {{var}} si non trouvée
|
||||||
|
print(f" 📌 Variable {{{{{var_name}}}}} → {str(value)[:50]}...")
|
||||||
|
return str(value)
|
||||||
|
|
||||||
|
text = re.sub(r'\{\{(\w+)\}\}', replace_var, text)
|
||||||
|
|
||||||
|
print(f"⌨️ [Action] Saisie: {text[:50]}...")
|
||||||
|
|
||||||
|
# Effacer avant si demandé
|
||||||
|
if params.get('clear_before', False):
|
||||||
|
pyautogui.hotkey('ctrl', 'a')
|
||||||
|
time.sleep(0.1)
|
||||||
|
|
||||||
# Petit délai pour s'assurer que le focus est bon
|
# Petit délai pour s'assurer que le focus est bon
|
||||||
time.sleep(0.2)
|
time.sleep(0.2)
|
||||||
|
|
||||||
if text.isascii():
|
# Utiliser write() pour supporter l'unicode (caractères français, etc.)
|
||||||
pyautogui.typewrite(text, interval=0.05)
|
pyautogui.write(text)
|
||||||
else:
|
|
||||||
pyautogui.write(text)
|
|
||||||
|
|
||||||
return {'success': True, 'output': {'typed': text}}
|
return {'success': True, 'output': {'typed': text[:100] + '...' if len(text) > 100 else text}}
|
||||||
|
|
||||||
elif action_type in ['wait_for_anchor', 'wait']:
|
elif action_type in ['wait_for_anchor', 'wait']:
|
||||||
timeout_ms = params.get('timeout_ms', params.get('timeout', 5000))
|
timeout_ms = params.get('timeout_ms', params.get('timeout', 5000))
|
||||||
@@ -269,6 +483,10 @@ def execute_action(action_type: str, params: dict) -> dict:
|
|||||||
pyautogui.hotkey(*keys)
|
pyautogui.hotkey(*keys)
|
||||||
return {'success': True, 'output': {'hotkey': keys}}
|
return {'success': True, 'output': {'hotkey': keys}}
|
||||||
|
|
||||||
|
elif action_type == 'ai_analyze_text':
|
||||||
|
# Analyse de texte avec IA (Ollama)
|
||||||
|
return execute_ai_analyze(params)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
return {'success': False, 'error': f"Type d'action non supporté: {action_type}"}
|
return {'success': False, 'error': f"Type d'action non supporté: {action_type}"}
|
||||||
|
|
||||||
@@ -297,6 +515,12 @@ def start_execution():
|
|||||||
|
|
||||||
data = request.get_json() or {}
|
data = request.get_json() or {}
|
||||||
workflow_id = data.get('workflow_id')
|
workflow_id = data.get('workflow_id')
|
||||||
|
execution_mode = data.get('execution_mode', 'basic')
|
||||||
|
minimize_browser = data.get('minimize_browser', True) # Activé par défaut
|
||||||
|
|
||||||
|
# Valider le mode
|
||||||
|
if execution_mode not in ['basic', 'intelligent', 'debug']:
|
||||||
|
execution_mode = 'basic'
|
||||||
|
|
||||||
# Utiliser le workflow actif si non spécifié
|
# Utiliser le workflow actif si non spécifié
|
||||||
if not workflow_id:
|
if not workflow_id:
|
||||||
@@ -340,6 +564,13 @@ def start_execution():
|
|||||||
_execution_state['is_paused'] = False
|
_execution_state['is_paused'] = False
|
||||||
_execution_state['should_stop'] = False
|
_execution_state['should_stop'] = False
|
||||||
_execution_state['current_execution_id'] = execution.id
|
_execution_state['current_execution_id'] = execution.id
|
||||||
|
_execution_state['execution_mode'] = execution_mode
|
||||||
|
|
||||||
|
print(f"🎯 [API v3] Mode d'exécution: {execution_mode}")
|
||||||
|
|
||||||
|
# Minimiser la fenêtre du navigateur si demandé
|
||||||
|
if minimize_browser:
|
||||||
|
minimize_active_window()
|
||||||
|
|
||||||
# Lancer le thread d'exécution
|
# Lancer le thread d'exécution
|
||||||
from flask import current_app
|
from flask import current_app
|
||||||
@@ -474,6 +705,7 @@ def get_execution_status():
|
|||||||
'success': True,
|
'success': True,
|
||||||
'is_running': _execution_state['is_running'],
|
'is_running': _execution_state['is_running'],
|
||||||
'is_paused': _execution_state['is_paused'],
|
'is_paused': _execution_state['is_paused'],
|
||||||
|
'execution_mode': _execution_state.get('execution_mode', 'basic'),
|
||||||
'execution': execution.to_dict() if execution else None,
|
'execution': execution.to_dict() if execution else None,
|
||||||
'session': session.to_dict()
|
'session': session.to_dict()
|
||||||
})
|
})
|
||||||
|
|||||||
816
visual_workflow_builder/backend/services/intelligent_executor.py
Normal file
816
visual_workflow_builder/backend/services/intelligent_executor.py
Normal file
@@ -0,0 +1,816 @@
|
|||||||
|
"""
|
||||||
|
Service d'exécution intelligente pour VWB
|
||||||
|
Utilise UI-DETR-1 pour la détection et le matching d'ancres visuelles
|
||||||
|
"""
|
||||||
|
|
||||||
|
import time
|
||||||
|
import base64
|
||||||
|
import io
|
||||||
|
from typing import Dict, Any, Optional, List, Tuple
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from PIL import Image
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
# Import du service de détection UI
|
||||||
|
from .ui_detection_service import detect_ui_elements, DetectionResult, UIElement
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class MatchResult:
|
||||||
|
"""Résultat de matching d'ancre"""
|
||||||
|
found: bool
|
||||||
|
confidence: float
|
||||||
|
element: Optional[UIElement]
|
||||||
|
center: Optional[Dict[str, int]]
|
||||||
|
bbox: Optional[Dict[str, int]]
|
||||||
|
method: str
|
||||||
|
search_time_ms: float
|
||||||
|
all_candidates: List[Dict[str, Any]]
|
||||||
|
|
||||||
|
|
||||||
|
class IntelligentExecutor:
|
||||||
|
"""
|
||||||
|
Exécuteur intelligent qui utilise la vision pour localiser les éléments.
|
||||||
|
|
||||||
|
Modes de matching:
|
||||||
|
1. Template matching (comparaison pixel)
|
||||||
|
2. Embedding similarity (CLIP - à implémenter)
|
||||||
|
3. Position-based fallback (si template échoue)
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, detection_threshold: float = 0.35):
|
||||||
|
self.detection_threshold = detection_threshold
|
||||||
|
self._clip_model = None # Lazy loading
|
||||||
|
|
||||||
|
def find_anchor_in_screen(
|
||||||
|
self,
|
||||||
|
screen_image: Image.Image,
|
||||||
|
anchor_image: Image.Image,
|
||||||
|
anchor_bbox: Optional[Dict[str, int]] = None,
|
||||||
|
method: str = 'clip'
|
||||||
|
) -> MatchResult:
|
||||||
|
"""
|
||||||
|
Trouve une ancre visuelle dans l'écran actuel.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
screen_image: Screenshot actuel (PIL Image)
|
||||||
|
anchor_image: Image de l'ancre à trouver (PIL Image)
|
||||||
|
anchor_bbox: Bounding box originale de l'ancre (pour fallback)
|
||||||
|
method: Méthode de matching ('template', 'clip', 'hybrid')
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
MatchResult avec les coordonnées si trouvé
|
||||||
|
"""
|
||||||
|
start_time = time.time()
|
||||||
|
|
||||||
|
# Étape 1: Détecter tous les éléments UI avec UI-DETR-1
|
||||||
|
detection_result = detect_ui_elements(screen_image, self.detection_threshold)
|
||||||
|
|
||||||
|
if len(detection_result.elements) == 0:
|
||||||
|
return MatchResult(
|
||||||
|
found=False,
|
||||||
|
confidence=0.0,
|
||||||
|
element=None,
|
||||||
|
center=None,
|
||||||
|
bbox=None,
|
||||||
|
method=method,
|
||||||
|
search_time_ms=(time.time() - start_time) * 1000,
|
||||||
|
all_candidates=[]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Étape 2: Matcher l'ancre avec les éléments détectés
|
||||||
|
if method == 'template':
|
||||||
|
match = self._template_match(screen_image, anchor_image, detection_result.elements)
|
||||||
|
elif method == 'clip':
|
||||||
|
# CLIP avec pondération par position originale
|
||||||
|
match = self._clip_match(screen_image, anchor_image, detection_result.elements, anchor_bbox)
|
||||||
|
elif method == 'hybrid':
|
||||||
|
# Essayer CLIP d'abord (conforme au doc), puis template si échec
|
||||||
|
match = self._clip_match(screen_image, anchor_image, detection_result.elements, anchor_bbox)
|
||||||
|
if not match['found'] or match['confidence'] < 0.5:
|
||||||
|
template_match = self._template_match(screen_image, anchor_image, detection_result.elements)
|
||||||
|
if template_match['confidence'] > match['confidence']:
|
||||||
|
match = template_match
|
||||||
|
else:
|
||||||
|
# Fallback sur position si méthode inconnue
|
||||||
|
match = self._position_fallback(detection_result.elements, anchor_bbox, screen_image.size)
|
||||||
|
|
||||||
|
search_time_ms = (time.time() - start_time) * 1000
|
||||||
|
|
||||||
|
if match['found']:
|
||||||
|
elem = match['element']
|
||||||
|
return MatchResult(
|
||||||
|
found=True,
|
||||||
|
confidence=match['confidence'],
|
||||||
|
element=elem,
|
||||||
|
center={'x': elem.center['x'], 'y': elem.center['y']},
|
||||||
|
bbox=elem.bbox,
|
||||||
|
method=match['method'],
|
||||||
|
search_time_ms=search_time_ms,
|
||||||
|
all_candidates=match.get('candidates', [])
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
return MatchResult(
|
||||||
|
found=False,
|
||||||
|
confidence=match.get('confidence', 0.0),
|
||||||
|
element=None,
|
||||||
|
center=None,
|
||||||
|
bbox=None,
|
||||||
|
method=match['method'],
|
||||||
|
search_time_ms=search_time_ms,
|
||||||
|
all_candidates=match.get('candidates', [])
|
||||||
|
)
|
||||||
|
|
||||||
|
def _template_match(
|
||||||
|
self,
|
||||||
|
screen_image: Image.Image,
|
||||||
|
anchor_image: Image.Image,
|
||||||
|
elements: List[UIElement]
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Matching par comparaison de template (pixels).
|
||||||
|
Compare l'ancre avec chaque élément détecté.
|
||||||
|
"""
|
||||||
|
import cv2
|
||||||
|
|
||||||
|
# Convertir l'ancre en numpy
|
||||||
|
anchor_np = np.array(anchor_image.convert('RGB'))
|
||||||
|
anchor_gray = cv2.cvtColor(anchor_np, cv2.COLOR_RGB2GRAY)
|
||||||
|
anchor_h, anchor_w = anchor_gray.shape
|
||||||
|
|
||||||
|
# Convertir le screen en numpy
|
||||||
|
screen_np = np.array(screen_image.convert('RGB'))
|
||||||
|
screen_gray = cv2.cvtColor(screen_np, cv2.COLOR_RGB2GRAY)
|
||||||
|
|
||||||
|
best_match = None
|
||||||
|
best_score = 0.0
|
||||||
|
candidates = []
|
||||||
|
|
||||||
|
for elem in elements:
|
||||||
|
# Extraire la région de l'élément
|
||||||
|
x1, y1 = elem.bbox['x1'], elem.bbox['y1']
|
||||||
|
x2, y2 = elem.bbox['x2'], elem.bbox['y2']
|
||||||
|
|
||||||
|
# S'assurer que les coordonnées sont valides
|
||||||
|
x1 = max(0, x1)
|
||||||
|
y1 = max(0, y1)
|
||||||
|
x2 = min(screen_gray.shape[1], x2)
|
||||||
|
y2 = min(screen_gray.shape[0], y2)
|
||||||
|
|
||||||
|
if x2 <= x1 or y2 <= y1:
|
||||||
|
continue
|
||||||
|
|
||||||
|
elem_region = screen_gray[y1:y2, x1:x2]
|
||||||
|
|
||||||
|
# Redimensionner si nécessaire pour le matching
|
||||||
|
elem_h, elem_w = elem_region.shape
|
||||||
|
if elem_h < 5 or elem_w < 5:
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Redimensionner l'ancre à la taille de l'élément pour comparaison
|
||||||
|
anchor_resized = cv2.resize(anchor_gray, (elem_w, elem_h))
|
||||||
|
|
||||||
|
# Calculer la similarité (normalized cross-correlation)
|
||||||
|
result = cv2.matchTemplate(elem_region, anchor_resized, cv2.TM_CCOEFF_NORMED)
|
||||||
|
score = float(np.max(result))
|
||||||
|
|
||||||
|
candidates.append({
|
||||||
|
'element_id': elem.id,
|
||||||
|
'score': score,
|
||||||
|
'bbox': elem.bbox
|
||||||
|
})
|
||||||
|
|
||||||
|
if score > best_score:
|
||||||
|
best_score = score
|
||||||
|
best_match = elem
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
# Ignorer les erreurs de matching pour cet élément
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Trier les candidats par score
|
||||||
|
candidates.sort(key=lambda x: x['score'], reverse=True)
|
||||||
|
|
||||||
|
return {
|
||||||
|
'found': best_score > 0.5, # Seuil de matching template
|
||||||
|
'confidence': best_score,
|
||||||
|
'element': best_match,
|
||||||
|
'method': 'template_matching',
|
||||||
|
'candidates': candidates[:5] # Top 5
|
||||||
|
}
|
||||||
|
|
||||||
|
def _clip_match(
|
||||||
|
self,
|
||||||
|
screen_image: Image.Image,
|
||||||
|
anchor_image: Image.Image,
|
||||||
|
elements: List[UIElement],
|
||||||
|
anchor_bbox: Optional[Dict[str, int]] = None
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Matching par similarité d'embeddings CLIP + pondération par distance.
|
||||||
|
Combine le score sémantique avec la proximité à la position originale.
|
||||||
|
|
||||||
|
SEUILS STRICTS pour éviter les faux positifs:
|
||||||
|
- MAX_DISTANCE_PX: Distance maximale absolue (80px)
|
||||||
|
- MIN_CLIP_SCORE: Score CLIP minimum (0.65)
|
||||||
|
- MIN_COMBINED_SCORE: Score combiné minimum (0.6)
|
||||||
|
"""
|
||||||
|
# === SEUILS ÉQUILIBRÉS ===
|
||||||
|
# Permet des variations raisonnables tout en évitant les faux positifs
|
||||||
|
MAX_DISTANCE_PX = 120 # Rejeter tout élément > 120px de la position originale
|
||||||
|
MIN_CLIP_SCORE = 0.55 # Score CLIP minimum requis (0.55 = similarité raisonnable)
|
||||||
|
MIN_COMBINED_SCORE = 0.5 # Score combiné minimum pour accepter un match
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Essayer d'importer et utiliser CLIP
|
||||||
|
from core.embedding.clip_embedder import CLIPEmbedder
|
||||||
|
|
||||||
|
if self._clip_model is None:
|
||||||
|
print("🔄 [CLIP] Chargement du modèle CLIP...")
|
||||||
|
self._clip_model = CLIPEmbedder()
|
||||||
|
print("✅ [CLIP] Modèle chargé")
|
||||||
|
|
||||||
|
# Position originale de l'ancre (pour pondération)
|
||||||
|
anchor_center_x = None
|
||||||
|
anchor_center_y = None
|
||||||
|
if anchor_bbox:
|
||||||
|
anchor_center_x = anchor_bbox.get('x', 0) + anchor_bbox.get('width', 0) // 2
|
||||||
|
anchor_center_y = anchor_bbox.get('y', 0) + anchor_bbox.get('height', 0) // 2
|
||||||
|
print(f"📍 [CLIP] Position originale de l'ancre: ({anchor_center_x}, {anchor_center_y})")
|
||||||
|
|
||||||
|
# Diagonale de l'écran pour normaliser les distances
|
||||||
|
screen_diagonal = np.sqrt(screen_image.width ** 2 + screen_image.height ** 2)
|
||||||
|
|
||||||
|
# Obtenir l'embedding de l'ancre
|
||||||
|
anchor_embedding = self._clip_model.embed_image(anchor_image)
|
||||||
|
|
||||||
|
best_match = None
|
||||||
|
best_combined_score = 0.0
|
||||||
|
candidates = []
|
||||||
|
rejected_candidates = [] # Pour debug: garder trace des rejetés
|
||||||
|
|
||||||
|
print(f"🔍 [CLIP] {len(elements)} éléments détectés par UI-DETR-1")
|
||||||
|
|
||||||
|
for elem in elements:
|
||||||
|
# Extraire la région de l'élément
|
||||||
|
x1, y1 = elem.bbox['x1'], elem.bbox['y1']
|
||||||
|
x2, y2 = elem.bbox['x2'], elem.bbox['y2']
|
||||||
|
|
||||||
|
elem_crop = screen_image.crop((x1, y1, x2, y2))
|
||||||
|
|
||||||
|
# Obtenir l'embedding de l'élément
|
||||||
|
elem_embedding = self._clip_model.embed_image(elem_crop)
|
||||||
|
|
||||||
|
# Calculer la similarité cosinus (score sémantique CLIP)
|
||||||
|
clip_score = float(np.dot(anchor_embedding, elem_embedding) /
|
||||||
|
(np.linalg.norm(anchor_embedding) * np.linalg.norm(elem_embedding)))
|
||||||
|
|
||||||
|
# Calculer la pondération par distance si position originale connue
|
||||||
|
distance_factor = 1.0
|
||||||
|
distance = None
|
||||||
|
rejected_reason = None
|
||||||
|
|
||||||
|
if anchor_center_x is not None and anchor_center_y is not None:
|
||||||
|
elem_center_x = (x1 + x2) // 2
|
||||||
|
elem_center_y = (y1 + y2) // 2
|
||||||
|
distance = np.sqrt(
|
||||||
|
(elem_center_x - anchor_center_x) ** 2 +
|
||||||
|
(elem_center_y - anchor_center_y) ** 2
|
||||||
|
)
|
||||||
|
|
||||||
|
# Pondération par distance
|
||||||
|
normalized_distance = distance / screen_diagonal
|
||||||
|
distance_factor = max(0.2, 1.0 - (normalized_distance * 5.0))
|
||||||
|
|
||||||
|
# REJET STRICT: distance > MAX_DISTANCE_PX
|
||||||
|
if distance > MAX_DISTANCE_PX:
|
||||||
|
rejected_reason = f"distance {distance:.0f}px > {MAX_DISTANCE_PX}px"
|
||||||
|
rejected_candidates.append({
|
||||||
|
'element_id': elem.id,
|
||||||
|
'clip_score': clip_score,
|
||||||
|
'distance': distance,
|
||||||
|
'reason': rejected_reason,
|
||||||
|
'center': {'x': elem_center_x, 'y': elem_center_y}
|
||||||
|
})
|
||||||
|
continue
|
||||||
|
|
||||||
|
# REJET STRICT: score CLIP < MIN_CLIP_SCORE
|
||||||
|
if clip_score < MIN_CLIP_SCORE:
|
||||||
|
rejected_reason = f"CLIP {clip_score:.2f} < {MIN_CLIP_SCORE}"
|
||||||
|
rejected_candidates.append({
|
||||||
|
'element_id': elem.id,
|
||||||
|
'clip_score': clip_score,
|
||||||
|
'distance': distance,
|
||||||
|
'reason': rejected_reason,
|
||||||
|
'center': {'x': (x1+x2)//2, 'y': (y1+y2)//2}
|
||||||
|
})
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Score combiné: CLIP * distance_factor
|
||||||
|
combined_score = clip_score * distance_factor
|
||||||
|
|
||||||
|
candidates.append({
|
||||||
|
'element_id': elem.id,
|
||||||
|
'clip_score': clip_score,
|
||||||
|
'distance': distance,
|
||||||
|
'distance_factor': distance_factor,
|
||||||
|
'combined_score': combined_score,
|
||||||
|
'bbox': elem.bbox
|
||||||
|
})
|
||||||
|
|
||||||
|
if combined_score > best_combined_score:
|
||||||
|
best_combined_score = combined_score
|
||||||
|
best_match = elem
|
||||||
|
|
||||||
|
# Trier par score combiné
|
||||||
|
candidates.sort(key=lambda x: x['combined_score'], reverse=True)
|
||||||
|
|
||||||
|
# Log pour debug
|
||||||
|
if candidates:
|
||||||
|
top = candidates[0]
|
||||||
|
print(f"🎯 [CLIP] Meilleur candidat: {top['element_id']} "
|
||||||
|
f"(CLIP: {top['clip_score']:.2f}, distance: {top.get('distance', 'N/A'):.0f}px, "
|
||||||
|
f"combiné: {top['combined_score']:.2f})")
|
||||||
|
else:
|
||||||
|
print(f"⚠️ [CLIP] Aucun candidat valide ({len(rejected_candidates)} rejetés)")
|
||||||
|
# Afficher les 3 meilleurs rejetés pour comprendre le problème
|
||||||
|
rejected_candidates.sort(key=lambda x: x['clip_score'], reverse=True)
|
||||||
|
for i, rej in enumerate(rejected_candidates[:3]):
|
||||||
|
print(f" 📊 Rejeté #{i+1}: elem={rej['element_id']} CLIP={rej['clip_score']:.2f} "
|
||||||
|
f"dist={rej.get('distance', 'N/A')}px pos=({rej['center']['x']},{rej['center']['y']}) "
|
||||||
|
f"→ {rej['reason']}")
|
||||||
|
|
||||||
|
# Vérification finale avec seuil combiné strict
|
||||||
|
found = best_combined_score >= MIN_COMBINED_SCORE
|
||||||
|
if not found and best_match:
|
||||||
|
print(f"⛔ [CLIP] Match rejeté: score combiné {best_combined_score:.2f} < {MIN_COMBINED_SCORE}")
|
||||||
|
|
||||||
|
return {
|
||||||
|
'found': found,
|
||||||
|
'confidence': best_combined_score,
|
||||||
|
'element': best_match if found else None,
|
||||||
|
'method': 'clip_embedding',
|
||||||
|
'candidates': [{'element_id': c['element_id'], 'score': c['combined_score'], 'bbox': c['bbox']}
|
||||||
|
for c in candidates[:5]]
|
||||||
|
}
|
||||||
|
|
||||||
|
except ImportError:
|
||||||
|
# CLIP non disponible, fallback sur template
|
||||||
|
print("⚠️ CLIP non disponible, fallback sur template matching")
|
||||||
|
return self._template_match(screen_image, anchor_image, elements)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"⚠️ Erreur CLIP: {e}, fallback sur template matching")
|
||||||
|
return self._template_match(screen_image, anchor_image, elements)
|
||||||
|
|
||||||
|
def _position_fallback(
|
||||||
|
self,
|
||||||
|
elements: List[UIElement],
|
||||||
|
anchor_bbox: Optional[Dict[str, int]],
|
||||||
|
screen_size: Tuple[int, int]
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Fallback basé sur la position.
|
||||||
|
Trouve l'élément le plus proche de la position originale de l'ancre.
|
||||||
|
"""
|
||||||
|
if not anchor_bbox or not elements:
|
||||||
|
return {
|
||||||
|
'found': False,
|
||||||
|
'confidence': 0.0,
|
||||||
|
'element': None,
|
||||||
|
'method': 'position_fallback',
|
||||||
|
'candidates': []
|
||||||
|
}
|
||||||
|
|
||||||
|
# Position originale de l'ancre
|
||||||
|
anchor_center_x = anchor_bbox.get('x', 0) + anchor_bbox.get('width', 0) // 2
|
||||||
|
anchor_center_y = anchor_bbox.get('y', 0) + anchor_bbox.get('height', 0) // 2
|
||||||
|
|
||||||
|
best_match = None
|
||||||
|
best_distance = float('inf')
|
||||||
|
candidates = []
|
||||||
|
|
||||||
|
for elem in elements:
|
||||||
|
# Distance entre le centre de l'élément et la position originale
|
||||||
|
distance = np.sqrt(
|
||||||
|
(elem.center['x'] - anchor_center_x) ** 2 +
|
||||||
|
(elem.center['y'] - anchor_center_y) ** 2
|
||||||
|
)
|
||||||
|
|
||||||
|
candidates.append({
|
||||||
|
'element_id': elem.id,
|
||||||
|
'distance': distance,
|
||||||
|
'bbox': elem.bbox
|
||||||
|
})
|
||||||
|
|
||||||
|
if distance < best_distance:
|
||||||
|
best_distance = distance
|
||||||
|
best_match = elem
|
||||||
|
|
||||||
|
candidates.sort(key=lambda x: x['distance'])
|
||||||
|
|
||||||
|
# Calculer un score de confiance basé sur la distance
|
||||||
|
# Plus l'élément est proche, plus la confiance est élevée
|
||||||
|
max_distance = np.sqrt(screen_size[0]**2 + screen_size[1]**2)
|
||||||
|
confidence = max(0, 1 - (best_distance / (max_distance * 0.1))) # 10% de l'écran = confiance 0
|
||||||
|
|
||||||
|
return {
|
||||||
|
'found': best_distance < max_distance * 0.05, # 5% de la diagonale max
|
||||||
|
'confidence': confidence,
|
||||||
|
'element': best_match,
|
||||||
|
'method': 'position_fallback',
|
||||||
|
'candidates': [{'element_id': c['element_id'], 'score': 1/(1+c['distance']), 'bbox': c['bbox']}
|
||||||
|
for c in candidates[:5]]
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def direct_template_match(
|
||||||
|
screen_image: Image.Image,
|
||||||
|
anchor_image: Image.Image,
|
||||||
|
threshold: float = 0.7
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Template matching direct sur l'écran entier.
|
||||||
|
Plus fiable que le matching via UI-DETR-1 car ne dépend pas de la détection.
|
||||||
|
"""
|
||||||
|
import cv2
|
||||||
|
|
||||||
|
# Convertir en numpy grayscale
|
||||||
|
screen_np = np.array(screen_image.convert('RGB'))
|
||||||
|
screen_gray = cv2.cvtColor(screen_np, cv2.COLOR_RGB2GRAY)
|
||||||
|
|
||||||
|
anchor_np = np.array(anchor_image.convert('RGB'))
|
||||||
|
anchor_gray = cv2.cvtColor(anchor_np, cv2.COLOR_RGB2GRAY)
|
||||||
|
anchor_h, anchor_w = anchor_gray.shape
|
||||||
|
|
||||||
|
# Template matching multi-échelle
|
||||||
|
best_score = 0.0
|
||||||
|
best_loc = None
|
||||||
|
best_scale = 1.0
|
||||||
|
|
||||||
|
# Essayer différentes échelles (0.8x à 1.2x)
|
||||||
|
for scale in [1.0, 0.95, 1.05, 0.9, 1.1, 0.85, 1.15, 0.8, 1.2]:
|
||||||
|
# Redimensionner l'ancre
|
||||||
|
scaled_w = int(anchor_w * scale)
|
||||||
|
scaled_h = int(anchor_h * scale)
|
||||||
|
if scaled_w < 10 or scaled_h < 10:
|
||||||
|
continue
|
||||||
|
if scaled_w > screen_gray.shape[1] or scaled_h > screen_gray.shape[0]:
|
||||||
|
continue
|
||||||
|
|
||||||
|
anchor_scaled = cv2.resize(anchor_gray, (scaled_w, scaled_h))
|
||||||
|
|
||||||
|
# Template matching
|
||||||
|
result = cv2.matchTemplate(screen_gray, anchor_scaled, cv2.TM_CCOEFF_NORMED)
|
||||||
|
_, max_val, _, max_loc = cv2.minMaxLoc(result)
|
||||||
|
|
||||||
|
if max_val > best_score:
|
||||||
|
best_score = max_val
|
||||||
|
best_loc = max_loc
|
||||||
|
best_scale = scale
|
||||||
|
|
||||||
|
if best_loc and best_score >= threshold:
|
||||||
|
# Calculer le centre
|
||||||
|
center_x = best_loc[0] + int(anchor_w * best_scale / 2)
|
||||||
|
center_y = best_loc[1] + int(anchor_h * best_scale / 2)
|
||||||
|
|
||||||
|
return {
|
||||||
|
'found': True,
|
||||||
|
'confidence': best_score,
|
||||||
|
'coordinates': {'x': center_x, 'y': center_y},
|
||||||
|
'bbox': {
|
||||||
|
'x': best_loc[0],
|
||||||
|
'y': best_loc[1],
|
||||||
|
'width': int(anchor_w * best_scale),
|
||||||
|
'height': int(anchor_h * best_scale)
|
||||||
|
},
|
||||||
|
'method': 'direct_template',
|
||||||
|
'scale': best_scale
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
'found': False,
|
||||||
|
'confidence': best_score,
|
||||||
|
'coordinates': None,
|
||||||
|
'bbox': None,
|
||||||
|
'method': 'direct_template'
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def zoned_template_match(
|
||||||
|
screen_image: Image.Image,
|
||||||
|
anchor_image: Image.Image,
|
||||||
|
anchor_bbox: Dict[str, int],
|
||||||
|
zone_margin: int = 100, # Réduit de 200 à 100 pour être plus strict
|
||||||
|
threshold: float = 0.6,
|
||||||
|
distance_weight: float = 0.15 # Pondération par distance
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Template matching dans une zone autour de la position originale.
|
||||||
|
Plus rapide et évite les faux positifs loin de la cible.
|
||||||
|
|
||||||
|
Le score final combine:
|
||||||
|
- Score de template matching (85%)
|
||||||
|
- Bonus de proximité à la position originale (15%)
|
||||||
|
|
||||||
|
Args:
|
||||||
|
screen_image: Screenshot complet
|
||||||
|
anchor_image: Image de l'ancre
|
||||||
|
anchor_bbox: Position originale {x, y, width, height}
|
||||||
|
zone_margin: Marge autour de la position originale (pixels)
|
||||||
|
threshold: Seuil de confiance
|
||||||
|
distance_weight: Poids du bonus de proximité (0-1)
|
||||||
|
"""
|
||||||
|
import cv2
|
||||||
|
import math
|
||||||
|
|
||||||
|
# Position originale
|
||||||
|
orig_x = anchor_bbox.get('x', 0)
|
||||||
|
orig_y = anchor_bbox.get('y', 0)
|
||||||
|
orig_w = anchor_bbox.get('width', 100)
|
||||||
|
orig_h = anchor_bbox.get('height', 100)
|
||||||
|
|
||||||
|
# Centre original de l'ancre
|
||||||
|
orig_center_x = orig_x + orig_w / 2
|
||||||
|
orig_center_y = orig_y + orig_h / 2
|
||||||
|
|
||||||
|
# Définir la zone de recherche (avec marge réduite)
|
||||||
|
zone_x1 = max(0, orig_x - zone_margin)
|
||||||
|
zone_y1 = max(0, orig_y - zone_margin)
|
||||||
|
zone_x2 = min(screen_image.width, orig_x + orig_w + zone_margin)
|
||||||
|
zone_y2 = min(screen_image.height, orig_y + orig_h + zone_margin)
|
||||||
|
|
||||||
|
# Extraire la zone
|
||||||
|
zone_image = screen_image.crop((zone_x1, zone_y1, zone_x2, zone_y2))
|
||||||
|
|
||||||
|
# Convertir en grayscale
|
||||||
|
zone_np = np.array(zone_image.convert('RGB'))
|
||||||
|
zone_gray = cv2.cvtColor(zone_np, cv2.COLOR_RGB2GRAY)
|
||||||
|
|
||||||
|
anchor_np = np.array(anchor_image.convert('RGB'))
|
||||||
|
anchor_gray = cv2.cvtColor(anchor_np, cv2.COLOR_RGB2GRAY)
|
||||||
|
anchor_h, anchor_w = anchor_gray.shape
|
||||||
|
|
||||||
|
# Vérifier que l'ancre tient dans la zone
|
||||||
|
if anchor_w > zone_gray.shape[1] or anchor_h > zone_gray.shape[0]:
|
||||||
|
return {'found': False, 'confidence': 0, 'method': 'zoned_template'}
|
||||||
|
|
||||||
|
# Distance maximale possible dans la zone (pour normalisation)
|
||||||
|
max_distance = math.sqrt(zone_margin**2 + zone_margin**2) * 2
|
||||||
|
|
||||||
|
best_combined_score = 0.0
|
||||||
|
best_template_score = 0.0
|
||||||
|
best_loc = None
|
||||||
|
best_scale = 1.0
|
||||||
|
|
||||||
|
# Multi-échelle
|
||||||
|
for scale in [1.0, 0.95, 1.05, 0.9, 1.1]:
|
||||||
|
scaled_w = int(anchor_w * scale)
|
||||||
|
scaled_h = int(anchor_h * scale)
|
||||||
|
if scaled_w < 10 or scaled_h < 10:
|
||||||
|
continue
|
||||||
|
if scaled_w > zone_gray.shape[1] or scaled_h > zone_gray.shape[0]:
|
||||||
|
continue
|
||||||
|
|
||||||
|
anchor_scaled = cv2.resize(anchor_gray, (scaled_w, scaled_h))
|
||||||
|
result = cv2.matchTemplate(zone_gray, anchor_scaled, cv2.TM_CCOEFF_NORMED)
|
||||||
|
_, max_val, _, max_loc = cv2.minMaxLoc(result)
|
||||||
|
|
||||||
|
if max_val > 0.5: # Seuil minimum pour considérer
|
||||||
|
# Calculer le centre du match en coordonnées écran
|
||||||
|
match_center_x = zone_x1 + max_loc[0] + scaled_w / 2
|
||||||
|
match_center_y = zone_y1 + max_loc[1] + scaled_h / 2
|
||||||
|
|
||||||
|
# Distance au centre original
|
||||||
|
distance = math.sqrt((match_center_x - orig_center_x)**2 +
|
||||||
|
(match_center_y - orig_center_y)**2)
|
||||||
|
|
||||||
|
# Bonus de proximité (1.0 si parfait, 0.0 si très loin)
|
||||||
|
proximity_bonus = max(0, 1.0 - distance / max_distance)
|
||||||
|
|
||||||
|
# Score combiné: template matching + bonus de proximité
|
||||||
|
combined_score = max_val * (1 - distance_weight) + proximity_bonus * distance_weight
|
||||||
|
|
||||||
|
print(f" 📍 Match scale={scale:.2f}: template={max_val:.3f}, "
|
||||||
|
f"distance={distance:.0f}px, combined={combined_score:.3f}")
|
||||||
|
|
||||||
|
if combined_score > best_combined_score:
|
||||||
|
best_combined_score = combined_score
|
||||||
|
best_template_score = max_val
|
||||||
|
best_loc = max_loc
|
||||||
|
best_scale = scale
|
||||||
|
|
||||||
|
if best_loc and best_template_score >= threshold:
|
||||||
|
# Convertir en coordonnées écran (ajouter offset de la zone)
|
||||||
|
center_x = zone_x1 + best_loc[0] + int(anchor_w * best_scale / 2)
|
||||||
|
center_y = zone_y1 + best_loc[1] + int(anchor_h * best_scale / 2)
|
||||||
|
|
||||||
|
# === VÉRIFICATION DISTANCE MAXIMALE ===
|
||||||
|
# Rejeter tout match trop loin de la position originale
|
||||||
|
MAX_TEMPLATE_DISTANCE = 150 # Limite absolue en pixels
|
||||||
|
final_distance = math.sqrt((center_x - orig_center_x)**2 + (center_y - orig_center_y)**2)
|
||||||
|
|
||||||
|
if final_distance > MAX_TEMPLATE_DISTANCE:
|
||||||
|
print(f" ⛔ Match rejeté: distance {final_distance:.0f}px > {MAX_TEMPLATE_DISTANCE}px max")
|
||||||
|
return {
|
||||||
|
'found': False,
|
||||||
|
'confidence': best_template_score,
|
||||||
|
'coordinates': None,
|
||||||
|
'method': 'zoned_template',
|
||||||
|
'reason': f'Distance {final_distance:.0f}px > {MAX_TEMPLATE_DISTANCE}px max'
|
||||||
|
}
|
||||||
|
|
||||||
|
print(f" ✅ Meilleur match: ({center_x}, {center_y}) conf={best_template_score:.3f}, dist={final_distance:.0f}px")
|
||||||
|
|
||||||
|
return {
|
||||||
|
'found': True,
|
||||||
|
'confidence': best_template_score,
|
||||||
|
'coordinates': {'x': center_x, 'y': center_y},
|
||||||
|
'bbox': {
|
||||||
|
'x': zone_x1 + best_loc[0],
|
||||||
|
'y': zone_y1 + best_loc[1],
|
||||||
|
'width': int(anchor_w * best_scale),
|
||||||
|
'height': int(anchor_h * best_scale)
|
||||||
|
},
|
||||||
|
'method': 'zoned_template',
|
||||||
|
'zone': {'x1': zone_x1, 'y1': zone_y1, 'x2': zone_x2, 'y2': zone_y2}
|
||||||
|
}
|
||||||
|
|
||||||
|
return {
|
||||||
|
'found': False,
|
||||||
|
'confidence': best_template_score,
|
||||||
|
'coordinates': None,
|
||||||
|
'method': 'zoned_template'
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def find_and_click(
|
||||||
|
anchor_image_base64: str,
|
||||||
|
anchor_bbox: Optional[Dict[str, int]] = None,
|
||||||
|
method: str = 'clip',
|
||||||
|
detection_threshold: float = 0.35
|
||||||
|
) -> Dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Fonction utilitaire pour trouver une ancre et retourner les coordonnées de clic.
|
||||||
|
|
||||||
|
Méthodes disponibles:
|
||||||
|
- 'clip': UI-DETR-1 + CLIP (matching sémantique intelligent, recommandé)
|
||||||
|
- 'zoned': Template matching zonée (fallback)
|
||||||
|
|
||||||
|
Args:
|
||||||
|
anchor_image_base64: Image de l'ancre en base64
|
||||||
|
anchor_bbox: Bounding box originale
|
||||||
|
method: 'clip' pour UI-DETR-1+CLIP, 'zoned' pour template zonée
|
||||||
|
detection_threshold: Seuil de détection pour UI-DETR-1
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict avec found, coordinates, confidence, etc.
|
||||||
|
"""
|
||||||
|
import time as _time
|
||||||
|
start_time = _time.time()
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Capturer l'écran actuel
|
||||||
|
import mss
|
||||||
|
|
||||||
|
with mss.mss() as sct:
|
||||||
|
monitor = sct.monitors[1] # Premier écran
|
||||||
|
screenshot = sct.grab(monitor)
|
||||||
|
screen_image = Image.frombytes('RGB', screenshot.size, screenshot.bgra, 'raw', 'BGRX')
|
||||||
|
|
||||||
|
# Décoder l'image de l'ancre
|
||||||
|
if ',' in anchor_image_base64:
|
||||||
|
anchor_image_base64 = anchor_image_base64.split(',')[1]
|
||||||
|
anchor_bytes = base64.b64decode(anchor_image_base64)
|
||||||
|
anchor_image = Image.open(io.BytesIO(anchor_bytes))
|
||||||
|
|
||||||
|
# === MÉTHODE CLIP: UI-DETR-1 + CLIP (matching sémantique) ===
|
||||||
|
if method == 'clip':
|
||||||
|
print("🧠 [Vision] Essai UI-DETR-1 + CLIP (matching sémantique)...")
|
||||||
|
try:
|
||||||
|
executor = IntelligentExecutor(detection_threshold=detection_threshold)
|
||||||
|
clip_result = executor.find_anchor_in_screen(
|
||||||
|
screen_image=screen_image,
|
||||||
|
anchor_image=anchor_image,
|
||||||
|
anchor_bbox=anchor_bbox,
|
||||||
|
method='clip'
|
||||||
|
)
|
||||||
|
|
||||||
|
# clip_result.found est déjà conditionné par MIN_COMBINED_SCORE (0.6)
|
||||||
|
# et les seuils stricts (MAX_DISTANCE_PX=80, MIN_CLIP_SCORE=0.65)
|
||||||
|
if clip_result.found:
|
||||||
|
print(f"✅ [Vision] UI-DETR-1+CLIP réussi! Confiance: {clip_result.confidence:.2f}")
|
||||||
|
return {
|
||||||
|
'found': True,
|
||||||
|
'confidence': clip_result.confidence,
|
||||||
|
'coordinates': clip_result.center,
|
||||||
|
'bbox': clip_result.bbox,
|
||||||
|
'method': 'clip',
|
||||||
|
'search_time_ms': (_time.time() - start_time) * 1000
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
# Seuils stricts: MAX_DISTANCE=80px, MIN_CLIP=0.65, MIN_COMBINED=0.6
|
||||||
|
print(f"⚠️ [Vision] UI-DETR-1+CLIP: rejeté (confiance: {clip_result.confidence:.2f} < 0.6 ou distance > 80px)")
|
||||||
|
except Exception as clip_err:
|
||||||
|
print(f"⚠️ [Vision] Erreur UI-DETR-1+CLIP: {clip_err}")
|
||||||
|
import traceback
|
||||||
|
traceback.print_exc()
|
||||||
|
|
||||||
|
# Fallback sur template zonée si CLIP échoue
|
||||||
|
print("🔄 [Vision] Fallback sur template zonée...")
|
||||||
|
|
||||||
|
# === STRATÉGIE ZONÉE: Template matching dans zone ===
|
||||||
|
if anchor_bbox:
|
||||||
|
print("🔍 [Vision] Essai Template zonée (100px)...")
|
||||||
|
result = zoned_template_match(screen_image, anchor_image, anchor_bbox,
|
||||||
|
zone_margin=100, threshold=0.7)
|
||||||
|
if result['found']:
|
||||||
|
print(f"✅ [Vision] Template zonée réussi! Confiance: {result['confidence']:.2f}")
|
||||||
|
result['search_time_ms'] = (_time.time() - start_time) * 1000
|
||||||
|
return result
|
||||||
|
|
||||||
|
# === Zone élargie si échec ===
|
||||||
|
print("🔍 [Vision] Essai Template zonée élargie (200px)...")
|
||||||
|
result = zoned_template_match(screen_image, anchor_image, anchor_bbox,
|
||||||
|
zone_margin=200, threshold=0.6)
|
||||||
|
if result['found']:
|
||||||
|
print(f"✅ [Vision] Template zonée élargie réussi! Confiance: {result['confidence']:.2f}")
|
||||||
|
result['search_time_ms'] = (_time.time() - start_time) * 1000
|
||||||
|
return result
|
||||||
|
|
||||||
|
# === STRATÉGIE GLOBALE: Template global (seuil strict) ===
|
||||||
|
print("🔍 [Vision] Essai Template global (seuil strict)...")
|
||||||
|
global_result = direct_template_match(screen_image, anchor_image, threshold=0.75)
|
||||||
|
|
||||||
|
if global_result['found']:
|
||||||
|
# Vérifier que le résultat n'est pas trop loin de la position originale
|
||||||
|
if anchor_bbox:
|
||||||
|
orig_x = anchor_bbox.get('x', 0) + anchor_bbox.get('width', 0) // 2
|
||||||
|
orig_y = anchor_bbox.get('y', 0) + anchor_bbox.get('height', 0) // 2
|
||||||
|
found_x = global_result['coordinates']['x']
|
||||||
|
found_y = global_result['coordinates']['y']
|
||||||
|
distance = np.sqrt((found_x - orig_x)**2 + (found_y - orig_y)**2)
|
||||||
|
|
||||||
|
# Rejeter si trop loin (> 150px de la position originale)
|
||||||
|
MAX_GLOBAL_DISTANCE = 150
|
||||||
|
if distance > MAX_GLOBAL_DISTANCE:
|
||||||
|
print(f"⛔ [Vision] Template global rejeté: distance {distance:.0f}px > {MAX_GLOBAL_DISTANCE}px max")
|
||||||
|
else:
|
||||||
|
print(f"✅ [Vision] Template global réussi! Confiance: {global_result['confidence']:.2f}")
|
||||||
|
global_result['search_time_ms'] = (_time.time() - start_time) * 1000
|
||||||
|
return global_result
|
||||||
|
else:
|
||||||
|
print(f"✅ [Vision] Template global réussi! Confiance: {global_result['confidence']:.2f}")
|
||||||
|
global_result['search_time_ms'] = (_time.time() - start_time) * 1000
|
||||||
|
return global_result
|
||||||
|
|
||||||
|
# === STRATÉGIE 4: Coordonnées statiques (dernier recours) ===
|
||||||
|
if anchor_bbox:
|
||||||
|
best_conf = max(global_result.get('confidence', 0), 0)
|
||||||
|
|
||||||
|
# Utiliser coordonnées statiques seulement si confiance > 0.5
|
||||||
|
if best_conf >= 0.5:
|
||||||
|
print(f"⚠️ [Vision] Fallback: coordonnées statiques (confiance: {best_conf:.2f})")
|
||||||
|
center_x = anchor_bbox.get('x', 0) + anchor_bbox.get('width', 0) // 2
|
||||||
|
center_y = anchor_bbox.get('y', 0) + anchor_bbox.get('height', 0) // 2
|
||||||
|
return {
|
||||||
|
'found': True,
|
||||||
|
'coordinates': {'x': int(center_x), 'y': int(center_y)},
|
||||||
|
'bbox': anchor_bbox,
|
||||||
|
'confidence': best_conf,
|
||||||
|
'method': 'static_fallback',
|
||||||
|
'search_time_ms': (_time.time() - start_time) * 1000,
|
||||||
|
'candidates': []
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
print(f"❌ [Vision] Ancre non trouvée (confiance: {best_conf:.2f})")
|
||||||
|
return {
|
||||||
|
'found': False,
|
||||||
|
'coordinates': None,
|
||||||
|
'bbox': anchor_bbox,
|
||||||
|
'confidence': best_conf,
|
||||||
|
'method': 'not_found',
|
||||||
|
'search_time_ms': (_time.time() - start_time) * 1000,
|
||||||
|
'candidates': [],
|
||||||
|
'reason': 'Ancre non trouvée à l\'écran'
|
||||||
|
}
|
||||||
|
|
||||||
|
# Pas de bbox, impossible de chercher
|
||||||
|
return {
|
||||||
|
'found': False,
|
||||||
|
'coordinates': None,
|
||||||
|
'bbox': None,
|
||||||
|
'confidence': 0,
|
||||||
|
'method': 'no_bbox',
|
||||||
|
'search_time_ms': (_time.time() - start_time) * 1000,
|
||||||
|
'candidates': []
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f"❌ [Vision] Erreur: {e}")
|
||||||
|
return {
|
||||||
|
'found': False,
|
||||||
|
'error': str(e),
|
||||||
|
'coordinates': None,
|
||||||
|
'confidence': 0.0
|
||||||
|
}
|
||||||
@@ -1,25 +1,33 @@
|
|||||||
"""
|
"""
|
||||||
Service de détection UI utilisant UI-DETR-1
|
Service de détection UI - Multi-backend
|
||||||
Détecte les éléments d'interface utilisateur dans un screenshot
|
Détecte les éléments d'interface utilisateur dans un screenshot
|
||||||
|
|
||||||
|
Backends supportés (par ordre de priorité):
|
||||||
|
1. UI-DETR-1 (rfdetr) - Le plus précis si disponible
|
||||||
|
2. OmniParser (Microsoft) - Fallback GPU, bonne précision
|
||||||
|
3. Désactivé - Message d'erreur explicite
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import os
|
import os
|
||||||
|
import sys
|
||||||
import time
|
import time
|
||||||
import base64
|
import base64
|
||||||
import io
|
import io
|
||||||
from typing import List, Dict, Any, Optional
|
from typing import List, Dict, Any, Optional, Tuple
|
||||||
from dataclasses import dataclass
|
from dataclasses import dataclass
|
||||||
import numpy as np
|
import numpy as np
|
||||||
from PIL import Image
|
from PIL import Image
|
||||||
|
|
||||||
# Configuration du modèle
|
# Configuration
|
||||||
MODEL_PATH = "/home/dom/ai/rpa_vision_v3/models/ui-detr-1/model.pth"
|
MODEL_PATH = "/home/dom/ai/rpa_vision_v3/models/ui-detr-1/model.pth"
|
||||||
CONFIDENCE_THRESHOLD = 0.35
|
CONFIDENCE_THRESHOLD = 0.35
|
||||||
RESOLUTION = 1600
|
RESOLUTION = 1600
|
||||||
|
|
||||||
# Instance globale du modèle (lazy loading)
|
# État des backends
|
||||||
_model = None
|
_rfdetr_model = None
|
||||||
_model_loading = False
|
_rfdetr_available = None # None = pas encore testé
|
||||||
|
_omniparser = None
|
||||||
|
_omniparser_available = False # DÉSACTIVÉ - on utilise uniquement UI-DETR-1
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
@@ -30,6 +38,7 @@ class UIElement:
|
|||||||
center: Dict[str, int] # x, y
|
center: Dict[str, int] # x, y
|
||||||
confidence: float
|
confidence: float
|
||||||
area: int
|
area: int
|
||||||
|
label: str = ""
|
||||||
|
|
||||||
def to_dict(self) -> Dict[str, Any]:
|
def to_dict(self) -> Dict[str, Any]:
|
||||||
return {
|
return {
|
||||||
@@ -37,7 +46,8 @@ class UIElement:
|
|||||||
"bbox": self.bbox,
|
"bbox": self.bbox,
|
||||||
"center": self.center,
|
"center": self.center,
|
||||||
"confidence": round(self.confidence, 3),
|
"confidence": round(self.confidence, 3),
|
||||||
"area": self.area
|
"area": self.area,
|
||||||
|
"label": self.label
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
@@ -47,55 +57,161 @@ class DetectionResult:
|
|||||||
elements: List[UIElement]
|
elements: List[UIElement]
|
||||||
processing_time_ms: float
|
processing_time_ms: float
|
||||||
image_size: Dict[str, int]
|
image_size: Dict[str, int]
|
||||||
model_name: str = "UI-DETR-1"
|
model_name: str = "unknown"
|
||||||
|
error: Optional[str] = None
|
||||||
|
|
||||||
def to_dict(self) -> Dict[str, Any]:
|
def to_dict(self) -> Dict[str, Any]:
|
||||||
return {
|
result = {
|
||||||
"elements": [e.to_dict() for e in self.elements],
|
"elements": [e.to_dict() for e in self.elements],
|
||||||
"count": len(self.elements),
|
"count": len(self.elements),
|
||||||
"processing_time_ms": round(self.processing_time_ms, 1),
|
"processing_time_ms": round(self.processing_time_ms, 1),
|
||||||
"image_size": self.image_size,
|
"image_size": self.image_size,
|
||||||
"model": self.model_name
|
"model": self.model_name
|
||||||
}
|
}
|
||||||
|
if self.error:
|
||||||
|
result["error"] = self.error
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
def load_model():
|
# ==============================================================================
|
||||||
"""Charge le modèle UI-DETR-1 (lazy loading)"""
|
# Backend 1: UI-DETR-1 (rfdetr)
|
||||||
global _model, _model_loading
|
# ==============================================================================
|
||||||
|
|
||||||
if _model is not None:
|
def _check_rfdetr_available() -> bool:
|
||||||
return _model
|
"""Vérifie si rfdetr est disponible"""
|
||||||
|
global _rfdetr_available
|
||||||
if _model_loading:
|
if _rfdetr_available is not None:
|
||||||
# Attendre que le chargement soit terminé
|
return _rfdetr_available
|
||||||
while _model_loading and _model is None:
|
|
||||||
time.sleep(0.1)
|
|
||||||
return _model
|
|
||||||
|
|
||||||
_model_loading = True
|
|
||||||
|
|
||||||
try:
|
try:
|
||||||
print(f"[UI-DETR-1] Chargement du modèle depuis {MODEL_PATH}...")
|
|
||||||
start = time.time()
|
|
||||||
|
|
||||||
from rfdetr.detr import RFDETRMedium
|
from rfdetr.detr import RFDETRMedium
|
||||||
|
_rfdetr_available = os.path.exists(MODEL_PATH)
|
||||||
|
if _rfdetr_available:
|
||||||
|
print(f"✅ [UI-Detection] Backend rfdetr disponible")
|
||||||
|
else:
|
||||||
|
print(f"⚠️ [UI-Detection] rfdetr installé mais modèle non trouvé: {MODEL_PATH}")
|
||||||
|
except ImportError:
|
||||||
|
print(f"⚠️ [UI-Detection] rfdetr non installé")
|
||||||
|
_rfdetr_available = False
|
||||||
|
|
||||||
if not os.path.exists(MODEL_PATH):
|
return _rfdetr_available
|
||||||
raise FileNotFoundError(f"Modèle non trouvé: {MODEL_PATH}")
|
|
||||||
|
|
||||||
_model = RFDETRMedium(pretrain_weights=MODEL_PATH, resolution=RESOLUTION)
|
|
||||||
|
|
||||||
elapsed = time.time() - start
|
def _load_rfdetr():
|
||||||
print(f"[UI-DETR-1] Modèle chargé en {elapsed:.1f}s")
|
"""Charge le modèle rfdetr"""
|
||||||
|
global _rfdetr_model
|
||||||
|
if _rfdetr_model is not None:
|
||||||
|
return _rfdetr_model
|
||||||
|
|
||||||
return _model
|
from rfdetr.detr import RFDETRMedium
|
||||||
|
print(f"[UI-DETR-1] Chargement du modèle...")
|
||||||
|
start = time.time()
|
||||||
|
_rfdetr_model = RFDETRMedium(pretrain_weights=MODEL_PATH, resolution=RESOLUTION)
|
||||||
|
print(f"[UI-DETR-1] Modèle chargé en {time.time() - start:.1f}s")
|
||||||
|
return _rfdetr_model
|
||||||
|
|
||||||
|
|
||||||
|
def _detect_with_rfdetr(image: Image.Image, threshold: float) -> Tuple[List[UIElement], str]:
|
||||||
|
"""Détection avec rfdetr"""
|
||||||
|
model = _load_rfdetr()
|
||||||
|
image_np = np.array(image.convert('RGB'))
|
||||||
|
detections = model.predict(image_np, threshold=threshold)
|
||||||
|
|
||||||
|
elements = []
|
||||||
|
boxes = detections.xyxy
|
||||||
|
scores = detections.confidence
|
||||||
|
|
||||||
|
for i, (box, score) in enumerate(zip(boxes, scores)):
|
||||||
|
x1, y1, x2, y2 = map(int, box)
|
||||||
|
elements.append(UIElement(
|
||||||
|
id=i,
|
||||||
|
bbox={"x1": x1, "y1": y1, "x2": x2, "y2": y2},
|
||||||
|
center={"x": (x1 + x2) // 2, "y": (y1 + y2) // 2},
|
||||||
|
confidence=float(score),
|
||||||
|
area=(x2 - x1) * (y2 - y1)
|
||||||
|
))
|
||||||
|
|
||||||
|
return elements, "UI-DETR-1"
|
||||||
|
|
||||||
|
|
||||||
|
# ==============================================================================
|
||||||
|
# Backend 2: OmniParser (Microsoft)
|
||||||
|
# ==============================================================================
|
||||||
|
|
||||||
|
def _check_omniparser_available() -> bool:
|
||||||
|
"""Vérifie si OmniParser est disponible"""
|
||||||
|
global _omniparser_available, _omniparser
|
||||||
|
if _omniparser_available is not None:
|
||||||
|
return _omniparser_available
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Ajouter les chemins nécessaires
|
||||||
|
if '/home/dom/ai/rpa_vision_v3' not in sys.path:
|
||||||
|
sys.path.insert(0, '/home/dom/ai/rpa_vision_v3')
|
||||||
|
if '/home/dom/ai/OmniParser' not in sys.path:
|
||||||
|
sys.path.insert(0, '/home/dom/ai/OmniParser')
|
||||||
|
|
||||||
|
from core.detection.omniparser_adapter import get_omniparser
|
||||||
|
_omniparser = get_omniparser()
|
||||||
|
_omniparser_available = _omniparser.available
|
||||||
|
|
||||||
|
if _omniparser_available:
|
||||||
|
print(f"✅ [UI-Detection] Backend OmniParser disponible")
|
||||||
|
else:
|
||||||
|
print(f"⚠️ [UI-Detection] OmniParser non disponible")
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"[UI-DETR-1] Erreur chargement modèle: {e}")
|
print(f"⚠️ [UI-Detection] Erreur chargement OmniParser: {e}")
|
||||||
_model_loading = False
|
_omniparser_available = False
|
||||||
raise
|
|
||||||
finally:
|
return _omniparser_available
|
||||||
_model_loading = False
|
|
||||||
|
|
||||||
|
def _detect_with_omniparser(image: Image.Image, threshold: float) -> Tuple[List[UIElement], str]:
|
||||||
|
"""Détection avec OmniParser"""
|
||||||
|
global _omniparser
|
||||||
|
|
||||||
|
if _omniparser is None:
|
||||||
|
_check_omniparser_available()
|
||||||
|
|
||||||
|
if not _omniparser or not _omniparser.available:
|
||||||
|
raise RuntimeError("OmniParser non disponible")
|
||||||
|
|
||||||
|
# OmniParser détecte les éléments avec sa méthode detect()
|
||||||
|
detected = _omniparser.detect(image)
|
||||||
|
|
||||||
|
elements = []
|
||||||
|
for i, elem in enumerate(detected):
|
||||||
|
# DetectedElement a: bbox (tuple), label, confidence, center (tuple)
|
||||||
|
x1, y1, x2, y2 = elem.bbox
|
||||||
|
cx, cy = elem.center
|
||||||
|
|
||||||
|
# Filtrer par seuil de confiance
|
||||||
|
if elem.confidence < threshold:
|
||||||
|
continue
|
||||||
|
|
||||||
|
elements.append(UIElement(
|
||||||
|
id=i,
|
||||||
|
bbox={"x1": x1, "y1": y1, "x2": x2, "y2": y2},
|
||||||
|
center={"x": cx, "y": cy},
|
||||||
|
confidence=elem.confidence,
|
||||||
|
area=(x2 - x1) * (y2 - y1),
|
||||||
|
label=elem.label
|
||||||
|
))
|
||||||
|
|
||||||
|
return elements, "OmniParser"
|
||||||
|
|
||||||
|
|
||||||
|
# ==============================================================================
|
||||||
|
# API Publique
|
||||||
|
# ==============================================================================
|
||||||
|
|
||||||
|
def get_available_backend() -> Optional[str]:
|
||||||
|
"""Retourne le nom du backend disponible"""
|
||||||
|
if _check_rfdetr_available():
|
||||||
|
return "UI-DETR-1"
|
||||||
|
if _check_omniparser_available():
|
||||||
|
return "OmniParser"
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
def detect_ui_elements(
|
def detect_ui_elements(
|
||||||
@@ -113,37 +229,33 @@ def detect_ui_elements(
|
|||||||
DetectionResult avec la liste des éléments détectés
|
DetectionResult avec la liste des éléments détectés
|
||||||
"""
|
"""
|
||||||
start_time = time.time()
|
start_time = time.time()
|
||||||
|
|
||||||
# Charger le modèle
|
|
||||||
model = load_model()
|
|
||||||
|
|
||||||
# Convertir en numpy array RGB
|
|
||||||
image_np = np.array(image.convert('RGB'))
|
|
||||||
|
|
||||||
# Exécuter la détection
|
|
||||||
detections = model.predict(image_np, threshold=threshold)
|
|
||||||
|
|
||||||
# Parser les résultats
|
|
||||||
elements = []
|
elements = []
|
||||||
boxes = detections.xyxy # [x1, y1, x2, y2]
|
model_name = "none"
|
||||||
scores = detections.confidence
|
error = None
|
||||||
|
|
||||||
for i, (box, score) in enumerate(zip(boxes, scores)):
|
# Essayer rfdetr d'abord
|
||||||
x1, y1, x2, y2 = map(int, box)
|
if _check_rfdetr_available():
|
||||||
|
try:
|
||||||
|
elements, model_name = _detect_with_rfdetr(image, threshold)
|
||||||
|
except Exception as e:
|
||||||
|
print(f"⚠️ [UI-Detection] Erreur rfdetr: {e}, fallback OmniParser...")
|
||||||
|
error = str(e)
|
||||||
|
|
||||||
element = UIElement(
|
# Fallback OmniParser
|
||||||
id=i,
|
if not elements and _check_omniparser_available():
|
||||||
bbox={"x1": x1, "y1": y1, "x2": x2, "y2": y2},
|
try:
|
||||||
center={"x": (x1 + x2) // 2, "y": (y1 + y2) // 2},
|
elements, model_name = _detect_with_omniparser(image, threshold)
|
||||||
confidence=float(score),
|
error = None # Reset error si fallback réussit
|
||||||
area=(x2 - x1) * (y2 - y1)
|
except Exception as e:
|
||||||
)
|
print(f"⚠️ [UI-Detection] Erreur OmniParser: {e}")
|
||||||
elements.append(element)
|
error = str(e)
|
||||||
|
|
||||||
# Trier par position (haut-gauche vers bas-droite)
|
# Aucun backend disponible
|
||||||
|
if not elements and error is None:
|
||||||
|
error = "Aucun backend de détection disponible (rfdetr ou OmniParser requis)"
|
||||||
|
|
||||||
|
# Trier par position
|
||||||
elements.sort(key=lambda e: (e.bbox["y1"], e.bbox["x1"]))
|
elements.sort(key=lambda e: (e.bbox["y1"], e.bbox["x1"]))
|
||||||
|
|
||||||
# Réassigner les IDs après tri
|
|
||||||
for i, elem in enumerate(elements):
|
for i, elem in enumerate(elements):
|
||||||
elem.id = i
|
elem.id = i
|
||||||
|
|
||||||
@@ -152,7 +264,9 @@ def detect_ui_elements(
|
|||||||
return DetectionResult(
|
return DetectionResult(
|
||||||
elements=elements,
|
elements=elements,
|
||||||
processing_time_ms=processing_time,
|
processing_time_ms=processing_time,
|
||||||
image_size={"width": image.width, "height": image.height}
|
image_size={"width": image.width, "height": image.height},
|
||||||
|
model_name=model_name,
|
||||||
|
error=error
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@@ -160,21 +274,11 @@ def detect_from_base64(
|
|||||||
image_base64: str,
|
image_base64: str,
|
||||||
threshold: float = CONFIDENCE_THRESHOLD
|
threshold: float = CONFIDENCE_THRESHOLD
|
||||||
) -> DetectionResult:
|
) -> DetectionResult:
|
||||||
"""
|
"""Détecte les éléments UI depuis une image base64"""
|
||||||
Détecte les éléments UI depuis une image base64
|
|
||||||
|
|
||||||
Args:
|
|
||||||
image_base64: Image encodée en base64 (avec ou sans préfixe data:image/...)
|
|
||||||
threshold: Seuil de confiance
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
DetectionResult
|
|
||||||
"""
|
|
||||||
# Retirer le préfixe data:image/... si présent
|
# Retirer le préfixe data:image/... si présent
|
||||||
if ',' in image_base64:
|
if ',' in image_base64:
|
||||||
image_base64 = image_base64.split(',')[1]
|
image_base64 = image_base64.split(',')[1]
|
||||||
|
|
||||||
# Décoder
|
|
||||||
image_bytes = base64.b64decode(image_base64)
|
image_bytes = base64.b64decode(image_base64)
|
||||||
image = Image.open(io.BytesIO(image_bytes))
|
image = Image.open(io.BytesIO(image_bytes))
|
||||||
|
|
||||||
@@ -185,16 +289,7 @@ def detect_from_file(
|
|||||||
file_path: str,
|
file_path: str,
|
||||||
threshold: float = CONFIDENCE_THRESHOLD
|
threshold: float = CONFIDENCE_THRESHOLD
|
||||||
) -> DetectionResult:
|
) -> DetectionResult:
|
||||||
"""
|
"""Détecte les éléments UI depuis un fichier image"""
|
||||||
Détecte les éléments UI depuis un fichier image
|
|
||||||
|
|
||||||
Args:
|
|
||||||
file_path: Chemin vers l'image
|
|
||||||
threshold: Seuil de confiance
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
DetectionResult
|
|
||||||
"""
|
|
||||||
image = Image.open(file_path)
|
image = Image.open(file_path)
|
||||||
return detect_ui_elements(image, threshold)
|
return detect_ui_elements(image, threshold)
|
||||||
|
|
||||||
@@ -205,69 +300,42 @@ def create_annotated_image(
|
|||||||
show_ids: bool = True,
|
show_ids: bool = True,
|
||||||
show_confidence: bool = False
|
show_confidence: bool = False
|
||||||
) -> Image.Image:
|
) -> Image.Image:
|
||||||
"""
|
"""Crée une image annotée avec les bboxes et IDs"""
|
||||||
Crée une image annotée avec les bboxes et IDs
|
|
||||||
|
|
||||||
Args:
|
|
||||||
image: Image originale
|
|
||||||
detection_result: Résultat de détection
|
|
||||||
show_ids: Afficher les numéros d'ID
|
|
||||||
show_confidence: Afficher les scores de confiance
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Image annotée
|
|
||||||
"""
|
|
||||||
from PIL import ImageDraw, ImageFont
|
from PIL import ImageDraw, ImageFont
|
||||||
|
|
||||||
# Copier l'image
|
|
||||||
annotated = image.copy()
|
annotated = image.copy()
|
||||||
draw = ImageDraw.Draw(annotated)
|
draw = ImageDraw.Draw(annotated)
|
||||||
|
|
||||||
# Essayer de charger une police, sinon utiliser la police par défaut
|
|
||||||
try:
|
try:
|
||||||
font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 14)
|
font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 14)
|
||||||
small_font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf", 10)
|
|
||||||
except:
|
except:
|
||||||
font = ImageFont.load_default()
|
font = ImageFont.load_default()
|
||||||
small_font = font
|
|
||||||
|
|
||||||
# Couleurs pour les bboxes
|
bbox_color = (233, 69, 96)
|
||||||
bbox_color = (233, 69, 96) # Rouge/rose
|
|
||||||
text_bg_color = (233, 69, 96)
|
|
||||||
text_color = (255, 255, 255)
|
text_color = (255, 255, 255)
|
||||||
|
|
||||||
for elem in detection_result.elements:
|
for elem in detection_result.elements:
|
||||||
bbox = elem.bbox
|
bbox = elem.bbox
|
||||||
x1, y1, x2, y2 = bbox["x1"], bbox["y1"], bbox["x2"], bbox["y2"]
|
x1, y1, x2, y2 = bbox["x1"], bbox["y1"], bbox["x2"], bbox["y2"]
|
||||||
|
|
||||||
# Dessiner la bbox
|
|
||||||
draw.rectangle([x1, y1, x2, y2], outline=bbox_color, width=2)
|
draw.rectangle([x1, y1, x2, y2], outline=bbox_color, width=2)
|
||||||
|
|
||||||
if show_ids:
|
if show_ids:
|
||||||
# Texte à afficher
|
|
||||||
label = str(elem.id)
|
label = str(elem.id)
|
||||||
if show_confidence:
|
if show_confidence:
|
||||||
label += f" ({elem.confidence:.0%})"
|
label += f" ({elem.confidence:.0%})"
|
||||||
|
|
||||||
# Mesurer le texte
|
|
||||||
text_bbox = draw.textbbox((0, 0), label, font=font)
|
text_bbox = draw.textbbox((0, 0), label, font=font)
|
||||||
text_width = text_bbox[2] - text_bbox[0]
|
text_width = text_bbox[2] - text_bbox[0]
|
||||||
text_height = text_bbox[3] - text_bbox[1]
|
text_height = text_bbox[3] - text_bbox[1]
|
||||||
|
|
||||||
# Position du label (en haut à gauche de la bbox)
|
label_y = y1 - text_height - 4 if y1 - text_height - 4 > 0 else y1 + 2
|
||||||
label_x = x1
|
|
||||||
label_y = y1 - text_height - 4
|
|
||||||
if label_y < 0:
|
|
||||||
label_y = y1 + 2
|
|
||||||
|
|
||||||
# Fond du label
|
|
||||||
draw.rectangle(
|
draw.rectangle(
|
||||||
[label_x - 2, label_y - 2, label_x + text_width + 4, label_y + text_height + 2],
|
[x1 - 2, label_y - 2, x1 + text_width + 4, label_y + text_height + 2],
|
||||||
fill=text_bg_color
|
fill=bbox_color
|
||||||
)
|
)
|
||||||
|
draw.text((x1, label_y), label, fill=text_color, font=font)
|
||||||
# Texte du label
|
|
||||||
draw.text((label_x, label_y), label, fill=text_color, font=font)
|
|
||||||
|
|
||||||
return annotated
|
return annotated
|
||||||
|
|
||||||
@@ -278,9 +346,7 @@ def annotated_image_to_base64(
|
|||||||
show_ids: bool = True,
|
show_ids: bool = True,
|
||||||
show_confidence: bool = False
|
show_confidence: bool = False
|
||||||
) -> str:
|
) -> str:
|
||||||
"""
|
"""Crée une image annotée et la retourne en base64"""
|
||||||
Crée une image annotée et la retourne en base64
|
|
||||||
"""
|
|
||||||
annotated = create_annotated_image(image, detection_result, show_ids, show_confidence)
|
annotated = create_annotated_image(image, detection_result, show_ids, show_confidence)
|
||||||
|
|
||||||
buffer = io.BytesIO()
|
buffer = io.BytesIO()
|
||||||
@@ -290,9 +356,36 @@ def annotated_image_to_base64(
|
|||||||
return base64.b64encode(buffer.read()).decode('utf-8')
|
return base64.b64encode(buffer.read()).decode('utf-8')
|
||||||
|
|
||||||
|
|
||||||
# Préchargement optionnel
|
# ==============================================================================
|
||||||
|
# Compatibilité avec l'ancienne API
|
||||||
|
# ==============================================================================
|
||||||
|
|
||||||
|
# Alias pour l'ancienne variable _model (utilisé par l'API)
|
||||||
|
_model = None # Sera non-None si un backend est chargé
|
||||||
|
|
||||||
|
|
||||||
def preload_model():
|
def preload_model():
|
||||||
"""Précharge le modèle en arrière-plan"""
|
"""
|
||||||
import threading
|
Précharge le modèle de détection (pour éviter la latence du premier appel).
|
||||||
thread = threading.Thread(target=load_model, daemon=True)
|
Compatible avec l'ancienne API.
|
||||||
thread.start()
|
"""
|
||||||
|
global _model
|
||||||
|
|
||||||
|
# Essayer rfdetr d'abord
|
||||||
|
if _check_rfdetr_available():
|
||||||
|
try:
|
||||||
|
_load_rfdetr()
|
||||||
|
_model = _rfdetr_model
|
||||||
|
print("[UI-Detection] Modèle rfdetr préchargé")
|
||||||
|
return
|
||||||
|
except Exception as e:
|
||||||
|
print(f"⚠️ [UI-Detection] Erreur préchargement rfdetr: {e}")
|
||||||
|
|
||||||
|
# Fallback OmniParser
|
||||||
|
if _check_omniparser_available():
|
||||||
|
_model = _omniparser
|
||||||
|
print("[UI-Detection] OmniParser préchargé")
|
||||||
|
|
||||||
|
|
||||||
|
# Vérification au chargement du module
|
||||||
|
print(f"[UI-Detection] Backends disponibles: rfdetr={_check_rfdetr_available()}, omniparser={_check_omniparser_available()}")
|
||||||
|
|||||||
@@ -17,9 +17,14 @@ import StepNode from './components/StepNode';
|
|||||||
import ToolPalette from './components/ToolPalette';
|
import ToolPalette from './components/ToolPalette';
|
||||||
import PropertiesPanel from './components/PropertiesPanel';
|
import PropertiesPanel from './components/PropertiesPanel';
|
||||||
import CapturePanel from './components/CapturePanel';
|
import CapturePanel from './components/CapturePanel';
|
||||||
import WorkflowList from './components/WorkflowList';
|
import WorkflowSelector from './components/WorkflowSelector';
|
||||||
|
import WorkflowManagerModal from './components/WorkflowManagerModal';
|
||||||
import ExecutionControls from './components/ExecutionControls';
|
import ExecutionControls from './components/ExecutionControls';
|
||||||
import ExecutionModeToggle from './components/ExecutionModeToggle';
|
import ExecutionModeToggle from './components/ExecutionModeToggle';
|
||||||
|
import ExecutionOverlay from './components/ExecutionOverlay';
|
||||||
|
import VariableManager from './components/VariableManager';
|
||||||
|
import type { Variable } from './components/VariableManager';
|
||||||
|
import CaptureLibrary from './components/CaptureLibrary';
|
||||||
|
|
||||||
const nodeTypes: NodeTypes = {
|
const nodeTypes: NodeTypes = {
|
||||||
step: StepNode,
|
step: StepNode,
|
||||||
@@ -32,6 +37,12 @@ function App() {
|
|||||||
const [capture, setCapture] = useState<Capture | null>(null);
|
const [capture, setCapture] = useState<Capture | null>(null);
|
||||||
const [error, setError] = useState<string | null>(null);
|
const [error, setError] = useState<string | null>(null);
|
||||||
const [executionMode, setExecutionMode] = useState<ExecutionMode>('basic');
|
const [executionMode, setExecutionMode] = useState<ExecutionMode>('basic');
|
||||||
|
const [showDebugOverlay, setShowDebugOverlay] = useState(false);
|
||||||
|
const [isExecutionRunning, setIsExecutionRunning] = useState(false);
|
||||||
|
const [detectionZone, setDetectionZone] = useState<{x: number; y: number; width: number; height: number} | null>(null);
|
||||||
|
const [variables, setVariables] = useState<Variable[]>([]);
|
||||||
|
const [showWorkflowManager, setShowWorkflowManager] = useState(false);
|
||||||
|
const [currentCapture, setCurrentCapture] = useState<Capture | null>(null);
|
||||||
|
|
||||||
// Charger l'état initial
|
// Charger l'état initial
|
||||||
const loadState = useCallback(async () => {
|
const loadState = useCallback(async () => {
|
||||||
@@ -48,6 +59,31 @@ function App() {
|
|||||||
loadState();
|
loadState();
|
||||||
}, [loadState]);
|
}, [loadState]);
|
||||||
|
|
||||||
|
// Polling du status d'exécution
|
||||||
|
useEffect(() => {
|
||||||
|
if (!isExecutionRunning) return;
|
||||||
|
|
||||||
|
const pollStatus = async () => {
|
||||||
|
try {
|
||||||
|
const status = await api.getExecutionStatus();
|
||||||
|
setIsExecutionRunning(status.is_running);
|
||||||
|
|
||||||
|
// Mettre à jour l'état si l'exécution est terminée
|
||||||
|
// Note: Ne PAS fermer l'overlay automatiquement pour permettre
|
||||||
|
// à l'utilisateur de voir les résultats de détection
|
||||||
|
if (!status.is_running) {
|
||||||
|
await loadState();
|
||||||
|
// L'overlay reste visible, l'utilisateur peut le fermer manuellement
|
||||||
|
}
|
||||||
|
} catch (err) {
|
||||||
|
console.error('Erreur polling status:', err);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
const interval = setInterval(pollStatus, 500);
|
||||||
|
return () => clearInterval(interval);
|
||||||
|
}, [isExecutionRunning, loadState]);
|
||||||
|
|
||||||
// Convertir les étapes en nœuds React Flow
|
// Convertir les étapes en nœuds React Flow
|
||||||
const updateNodesFromWorkflow = (steps: Step[]) => {
|
const updateNodesFromWorkflow = (steps: Step[]) => {
|
||||||
const newNodes: Node[] = steps.map((step, index) => ({
|
const newNodes: Node[] = steps.map((step, index) => ({
|
||||||
@@ -97,7 +133,6 @@ function App() {
|
|||||||
};
|
};
|
||||||
|
|
||||||
const handleDeleteWorkflow = async (id: string) => {
|
const handleDeleteWorkflow = async (id: string) => {
|
||||||
if (!confirm('Supprimer ce workflow ?')) return;
|
|
||||||
try {
|
try {
|
||||||
await api.deleteWorkflow(id);
|
await api.deleteWorkflow(id);
|
||||||
await loadState();
|
await loadState();
|
||||||
@@ -106,6 +141,29 @@ function App() {
|
|||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
|
const handleRenameWorkflow = async (id: string, newName: string) => {
|
||||||
|
try {
|
||||||
|
await api.updateWorkflow(id, { name: newName });
|
||||||
|
await loadState();
|
||||||
|
} catch (err) {
|
||||||
|
setError((err as Error).message);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
const handleUpdateWorkflowMeta = async (id: string, metadata: { description?: string; tags?: string[]; trigger_examples?: string[] }) => {
|
||||||
|
try {
|
||||||
|
// Convertir trigger_examples en triggerExamples pour l'API
|
||||||
|
const apiData: { description?: string; tags?: string[]; triggerExamples?: string[] } = {};
|
||||||
|
if (metadata.description !== undefined) apiData.description = metadata.description;
|
||||||
|
if (metadata.tags !== undefined) apiData.tags = metadata.tags;
|
||||||
|
if (metadata.trigger_examples !== undefined) apiData.triggerExamples = metadata.trigger_examples;
|
||||||
|
await api.updateWorkflow(id, apiData);
|
||||||
|
await loadState();
|
||||||
|
} catch (err) {
|
||||||
|
setError((err as Error).message);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
const handleAddStep = async (actionType: ActionType, position?: { x: number; y: number }) => {
|
const handleAddStep = async (actionType: ActionType, position?: { x: number; y: number }) => {
|
||||||
if (!appState?.session.active_workflow_id) {
|
if (!appState?.session.active_workflow_id) {
|
||||||
setError('Sélectionnez un workflow d\'abord');
|
setError('Sélectionnez un workflow d\'abord');
|
||||||
@@ -163,11 +221,17 @@ function App() {
|
|||||||
try {
|
try {
|
||||||
const result = await api.captureScreen();
|
const result = await api.captureScreen();
|
||||||
setCapture(result.capture);
|
setCapture(result.capture);
|
||||||
|
setCurrentCapture(result.capture);
|
||||||
} catch (err) {
|
} catch (err) {
|
||||||
setError((err as Error).message);
|
setError((err as Error).message);
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
|
const handleSelectCaptureFromLibrary = (cap: Capture) => {
|
||||||
|
setCapture(cap);
|
||||||
|
setCurrentCapture(cap);
|
||||||
|
};
|
||||||
|
|
||||||
const handleSelectAnchor = async (bbox: { x: number; y: number; width: number; height: number }, screenshotBase64?: string) => {
|
const handleSelectAnchor = async (bbox: { x: number; y: number; width: number; height: number }, screenshotBase64?: string) => {
|
||||||
if (!appState?.session.selected_step_id) {
|
if (!appState?.session.selected_step_id) {
|
||||||
setError('Sélectionnez une étape d\'abord');
|
setError('Sélectionnez une étape d\'abord');
|
||||||
@@ -183,7 +247,14 @@ function App() {
|
|||||||
|
|
||||||
const handleStartExecution = async () => {
|
const handleStartExecution = async () => {
|
||||||
try {
|
try {
|
||||||
await api.startExecution();
|
await api.startExecution(undefined, executionMode);
|
||||||
|
setIsExecutionRunning(true);
|
||||||
|
|
||||||
|
// Overlay désactivé - génère trop de requêtes et n'est pas utile
|
||||||
|
// if (executionMode === 'debug') {
|
||||||
|
// setShowDebugOverlay(true);
|
||||||
|
// }
|
||||||
|
|
||||||
await loadState();
|
await loadState();
|
||||||
} catch (err) {
|
} catch (err) {
|
||||||
setError((err as Error).message);
|
setError((err as Error).message);
|
||||||
@@ -193,12 +264,31 @@ function App() {
|
|||||||
const handleStopExecution = async () => {
|
const handleStopExecution = async () => {
|
||||||
try {
|
try {
|
||||||
await api.stopExecution();
|
await api.stopExecution();
|
||||||
|
setIsExecutionRunning(false);
|
||||||
|
setShowDebugOverlay(false);
|
||||||
await loadState();
|
await loadState();
|
||||||
} catch (err) {
|
} catch (err) {
|
||||||
setError((err as Error).message);
|
setError((err as Error).message);
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
|
// Gestion des variables
|
||||||
|
const handleVariableCreate = (data: Omit<Variable, 'id'>) => {
|
||||||
|
const newVariable: Variable = {
|
||||||
|
...data,
|
||||||
|
id: `var_${Date.now()}`,
|
||||||
|
};
|
||||||
|
setVariables(prev => [...prev, newVariable]);
|
||||||
|
};
|
||||||
|
|
||||||
|
const handleVariableUpdate = (id: string, data: Partial<Variable>) => {
|
||||||
|
setVariables(prev => prev.map(v => v.id === id ? { ...v, ...data } : v));
|
||||||
|
};
|
||||||
|
|
||||||
|
const handleVariableDelete = (id: string) => {
|
||||||
|
setVariables(prev => prev.filter(v => v.id !== id));
|
||||||
|
};
|
||||||
|
|
||||||
// Drop d'un outil sur le canvas
|
// Drop d'un outil sur le canvas
|
||||||
const onDrop = useCallback(
|
const onDrop = useCallback(
|
||||||
(event: React.DragEvent) => {
|
(event: React.DragEvent) => {
|
||||||
@@ -230,7 +320,15 @@ function App() {
|
|||||||
<div className="app">
|
<div className="app">
|
||||||
{/* Header */}
|
{/* Header */}
|
||||||
<header className="header">
|
<header className="header">
|
||||||
<h1>VWB - Visual Workflow Builder</h1>
|
<h1>VWB</h1>
|
||||||
|
<WorkflowSelector
|
||||||
|
workflows={appState?.workflows_list || []}
|
||||||
|
activeWorkflow={appState?.workflow ? { id: appState.workflow.id, name: appState.workflow.name } : null}
|
||||||
|
onSelect={handleSelectWorkflow}
|
||||||
|
onCreate={handleCreateWorkflow}
|
||||||
|
onOpenManager={() => setShowWorkflowManager(true)}
|
||||||
|
onRename={handleRenameWorkflow}
|
||||||
|
/>
|
||||||
<ExecutionModeToggle
|
<ExecutionModeToggle
|
||||||
mode={executionMode}
|
mode={executionMode}
|
||||||
onChange={setExecutionMode}
|
onChange={setExecutionMode}
|
||||||
@@ -251,15 +349,8 @@ function App() {
|
|||||||
)}
|
)}
|
||||||
|
|
||||||
<div className="main-layout">
|
<div className="main-layout">
|
||||||
{/* Sidebar gauche: Workflows + Outils */}
|
{/* Sidebar gauche: Outils */}
|
||||||
<aside className="sidebar left">
|
<aside className="sidebar left">
|
||||||
<WorkflowList
|
|
||||||
workflows={appState?.workflows_list || []}
|
|
||||||
activeId={appState?.session.active_workflow_id || null}
|
|
||||||
onSelect={handleSelectWorkflow}
|
|
||||||
onCreate={handleCreateWorkflow}
|
|
||||||
onDelete={handleDeleteWorkflow}
|
|
||||||
/>
|
|
||||||
<ToolPalette />
|
<ToolPalette />
|
||||||
</aside>
|
</aside>
|
||||||
|
|
||||||
@@ -286,7 +377,7 @@ function App() {
|
|||||||
)}
|
)}
|
||||||
</main>
|
</main>
|
||||||
|
|
||||||
{/* Sidebar droite: Propriétés + Capture */}
|
{/* Sidebar droite: Propriétés + Capture + Variables */}
|
||||||
<aside className="sidebar right">
|
<aside className="sidebar right">
|
||||||
<PropertiesPanel
|
<PropertiesPanel
|
||||||
step={selectedStep || null}
|
step={selectedStep || null}
|
||||||
@@ -299,6 +390,19 @@ function App() {
|
|||||||
onSelectAnchor={handleSelectAnchor}
|
onSelectAnchor={handleSelectAnchor}
|
||||||
hasSelectedStep={!!appState?.session.selected_step_id}
|
hasSelectedStep={!!appState?.session.selected_step_id}
|
||||||
executionMode={executionMode}
|
executionMode={executionMode}
|
||||||
|
detectionZone={detectionZone}
|
||||||
|
onSetDetectionZone={setDetectionZone}
|
||||||
|
/>
|
||||||
|
<CaptureLibrary
|
||||||
|
currentCapture={currentCapture}
|
||||||
|
onSelectCapture={handleSelectCaptureFromLibrary}
|
||||||
|
onCapture={handleCapture}
|
||||||
|
/>
|
||||||
|
<VariableManager
|
||||||
|
variables={variables}
|
||||||
|
onVariableCreate={handleVariableCreate}
|
||||||
|
onVariableUpdate={handleVariableUpdate}
|
||||||
|
onVariableDelete={handleVariableDelete}
|
||||||
/>
|
/>
|
||||||
</aside>
|
</aside>
|
||||||
</div>
|
</div>
|
||||||
@@ -308,6 +412,27 @@ function App() {
|
|||||||
<span>{EXECUTION_MODES[executionMode].icon}</span>
|
<span>{EXECUTION_MODES[executionMode].icon}</span>
|
||||||
<span>Mode {EXECUTION_MODES[executionMode].label}</span>
|
<span>Mode {EXECUTION_MODES[executionMode].label}</span>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
{/* Overlay de debug en temps réel */}
|
||||||
|
<ExecutionOverlay
|
||||||
|
isVisible={showDebugOverlay}
|
||||||
|
isRunning={isExecutionRunning}
|
||||||
|
onClose={() => setShowDebugOverlay(false)}
|
||||||
|
initialDetectionZone={detectionZone}
|
||||||
|
/>
|
||||||
|
|
||||||
|
{/* Modal de gestion des workflows */}
|
||||||
|
{showWorkflowManager && (
|
||||||
|
<WorkflowManagerModal
|
||||||
|
workflows={appState?.workflows_list || []}
|
||||||
|
activeWorkflowId={appState?.session.active_workflow_id || null}
|
||||||
|
onSelect={handleSelectWorkflow}
|
||||||
|
onDelete={handleDeleteWorkflow}
|
||||||
|
onRename={handleRenameWorkflow}
|
||||||
|
onUpdateMetadata={handleUpdateWorkflowMeta}
|
||||||
|
onClose={() => setShowWorkflowManager(false)}
|
||||||
|
/>
|
||||||
|
)}
|
||||||
</div>
|
</div>
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -0,0 +1,436 @@
|
|||||||
|
/**
|
||||||
|
* Overlay de debug en temps réel pendant l'exécution
|
||||||
|
* Affiche la détection UI et les actions en cours
|
||||||
|
*/
|
||||||
|
|
||||||
|
import { useState, useEffect, useCallback } from 'react';
|
||||||
|
import type { UIElement, DetectionResult } from '../services/uiDetection';
|
||||||
|
import { detectUIElements } from '../services/uiDetection';
|
||||||
|
|
||||||
|
interface ExecutionEvent {
|
||||||
|
type: 'step_start' | 'detection' | 'click' | 'step_end' | 'error';
|
||||||
|
stepIndex: number;
|
||||||
|
stepType: string;
|
||||||
|
timestamp: number;
|
||||||
|
data?: {
|
||||||
|
elements?: UIElement[];
|
||||||
|
targetElement?: UIElement;
|
||||||
|
clickCoordinates?: { x: number; y: number };
|
||||||
|
confidence?: number;
|
||||||
|
method?: string;
|
||||||
|
error?: string;
|
||||||
|
};
|
||||||
|
}
|
||||||
|
|
||||||
|
interface DetectionZone {
|
||||||
|
x: number;
|
||||||
|
y: number;
|
||||||
|
width: number;
|
||||||
|
height: number;
|
||||||
|
}
|
||||||
|
|
||||||
|
interface Props {
|
||||||
|
isVisible: boolean;
|
||||||
|
isRunning: boolean;
|
||||||
|
onClose: () => void;
|
||||||
|
initialDetectionZone?: DetectionZone | null;
|
||||||
|
}
|
||||||
|
|
||||||
|
export default function ExecutionOverlay({ isVisible, isRunning, onClose, initialDetectionZone }: Props) {
|
||||||
|
const [screenshot, setScreenshot] = useState<string | null>(null);
|
||||||
|
const [elements, setElements] = useState<UIElement[]>([]);
|
||||||
|
const [targetElement, setTargetElement] = useState<UIElement | null>(null);
|
||||||
|
const [clickPoint, setClickPoint] = useState<{ x: number; y: number } | null>(null);
|
||||||
|
const [isDetecting, setIsDetecting] = useState(false);
|
||||||
|
const [lastEvent, setLastEvent] = useState<ExecutionEvent | null>(null);
|
||||||
|
const [confidence, setConfidence] = useState<number | null>(null);
|
||||||
|
const [imageSize, setImageSize] = useState({ width: 1920, height: 1080 });
|
||||||
|
const [detectionZone, setDetectionZone] = useState<DetectionZone | null>(initialDetectionZone || null);
|
||||||
|
const [isSelectingZone, setIsSelectingZone] = useState(false);
|
||||||
|
const [zoneStart, setZoneStart] = useState<{ x: number; y: number } | null>(null);
|
||||||
|
const [tempZone, setTempZone] = useState<DetectionZone | null>(null);
|
||||||
|
|
||||||
|
// Fonction pour cropper une image base64
|
||||||
|
const cropImage = useCallback(async (
|
||||||
|
imageBase64: string,
|
||||||
|
zone: DetectionZone
|
||||||
|
): Promise<string> => {
|
||||||
|
return new Promise((resolve) => {
|
||||||
|
const img = new Image();
|
||||||
|
img.onload = () => {
|
||||||
|
const canvas = document.createElement('canvas');
|
||||||
|
canvas.width = zone.width;
|
||||||
|
canvas.height = zone.height;
|
||||||
|
const ctx = canvas.getContext('2d');
|
||||||
|
if (ctx) {
|
||||||
|
ctx.drawImage(
|
||||||
|
img,
|
||||||
|
zone.x, zone.y, zone.width, zone.height,
|
||||||
|
0, 0, zone.width, zone.height
|
||||||
|
);
|
||||||
|
resolve(canvas.toDataURL('image/png'));
|
||||||
|
} else {
|
||||||
|
resolve(imageBase64);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
img.src = imageBase64;
|
||||||
|
});
|
||||||
|
}, []);
|
||||||
|
|
||||||
|
// Capturer l'écran et détecter les éléments
|
||||||
|
const captureAndDetect = useCallback(async () => {
|
||||||
|
// Permettre la capture même si l'exécution est terminée (pour voir l'écran final)
|
||||||
|
if (isDetecting) return;
|
||||||
|
|
||||||
|
setIsDetecting(true);
|
||||||
|
try {
|
||||||
|
// Appeler l'API de capture sur le backend (port 5001)
|
||||||
|
const API_BASE = 'http://localhost:5001';
|
||||||
|
const response = await fetch(`${API_BASE}/api/v3/capture/screen`, { method: 'POST' });
|
||||||
|
const data = await response.json();
|
||||||
|
|
||||||
|
if (data.success && data.capture) {
|
||||||
|
const screenshotBase64 = `data:image/png;base64,${data.capture.screenshot_base64}`;
|
||||||
|
setScreenshot(screenshotBase64);
|
||||||
|
setImageSize({
|
||||||
|
width: data.capture.width,
|
||||||
|
height: data.capture.height
|
||||||
|
});
|
||||||
|
|
||||||
|
// Si une zone de détection est définie, cropper l'image
|
||||||
|
let imageToDetect = screenshotBase64;
|
||||||
|
let offsetX = 0;
|
||||||
|
let offsetY = 0;
|
||||||
|
|
||||||
|
if (detectionZone) {
|
||||||
|
imageToDetect = await cropImage(screenshotBase64, detectionZone);
|
||||||
|
offsetX = detectionZone.x;
|
||||||
|
offsetY = detectionZone.y;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Détecter les éléments
|
||||||
|
const detectionResult = await detectUIElements(imageToDetect, {
|
||||||
|
threshold: 0.30 // Seuil plus bas pour les petits éléments
|
||||||
|
});
|
||||||
|
|
||||||
|
// Ajuster les coordonnées si on a croppé
|
||||||
|
const adjustedElements = detectionResult.elements.map(elem => ({
|
||||||
|
...elem,
|
||||||
|
bbox: {
|
||||||
|
x1: elem.bbox.x1 + offsetX,
|
||||||
|
y1: elem.bbox.y1 + offsetY,
|
||||||
|
x2: elem.bbox.x2 + offsetX,
|
||||||
|
y2: elem.bbox.y2 + offsetY,
|
||||||
|
},
|
||||||
|
center: {
|
||||||
|
x: elem.center.x + offsetX,
|
||||||
|
y: elem.center.y + offsetY,
|
||||||
|
}
|
||||||
|
}));
|
||||||
|
|
||||||
|
setElements(adjustedElements);
|
||||||
|
}
|
||||||
|
} catch (err) {
|
||||||
|
console.error('Erreur capture/détection:', err);
|
||||||
|
} finally {
|
||||||
|
setIsDetecting(false);
|
||||||
|
}
|
||||||
|
}, [isDetecting, detectionZone, cropImage]);
|
||||||
|
|
||||||
|
// Polling pour mise à jour pendant l'exécution
|
||||||
|
useEffect(() => {
|
||||||
|
if (!isVisible) return;
|
||||||
|
|
||||||
|
// Capture initiale (même si l'exécution n'est pas en cours, pour voir l'écran actuel)
|
||||||
|
captureAndDetect();
|
||||||
|
|
||||||
|
// Polling toutes les 500ms seulement si l'exécution est en cours
|
||||||
|
if (isRunning) {
|
||||||
|
const interval = setInterval(captureAndDetect, 500);
|
||||||
|
return () => clearInterval(interval);
|
||||||
|
}
|
||||||
|
}, [isVisible, isRunning, captureAndDetect]);
|
||||||
|
|
||||||
|
// Polling du status d'exécution pour les événements
|
||||||
|
useEffect(() => {
|
||||||
|
if (!isVisible || !isRunning) return;
|
||||||
|
|
||||||
|
const pollStatus = async () => {
|
||||||
|
try {
|
||||||
|
const API_BASE = 'http://localhost:5001';
|
||||||
|
const response = await fetch(`${API_BASE}/api/v3/execute/status`);
|
||||||
|
const data = await response.json();
|
||||||
|
|
||||||
|
if (data.success && data.execution) {
|
||||||
|
// Simuler un événement basé sur le status
|
||||||
|
const event: ExecutionEvent = {
|
||||||
|
type: 'step_start',
|
||||||
|
stepIndex: data.execution.current_step_index || 0,
|
||||||
|
stepType: 'click',
|
||||||
|
timestamp: Date.now()
|
||||||
|
};
|
||||||
|
setLastEvent(event);
|
||||||
|
}
|
||||||
|
} catch (err) {
|
||||||
|
console.error('Erreur polling status:', err);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
const interval = setInterval(pollStatus, 200);
|
||||||
|
return () => clearInterval(interval);
|
||||||
|
}, [isVisible, isRunning]);
|
||||||
|
|
||||||
|
// Handlers pour la sélection de zone
|
||||||
|
const handleMouseDown = (e: React.MouseEvent) => {
|
||||||
|
if (!isSelectingZone) return;
|
||||||
|
|
||||||
|
const rect = e.currentTarget.getBoundingClientRect();
|
||||||
|
const x = (e.clientX - rect.left) / scale;
|
||||||
|
const y = (e.clientY - rect.top) / scale;
|
||||||
|
|
||||||
|
setZoneStart({ x, y });
|
||||||
|
setTempZone({ x, y, width: 0, height: 0 });
|
||||||
|
};
|
||||||
|
|
||||||
|
const handleMouseMove = (e: React.MouseEvent) => {
|
||||||
|
if (!isSelectingZone || !zoneStart) return;
|
||||||
|
|
||||||
|
const rect = e.currentTarget.getBoundingClientRect();
|
||||||
|
const currentX = (e.clientX - rect.left) / scale;
|
||||||
|
const currentY = (e.clientY - rect.top) / scale;
|
||||||
|
|
||||||
|
const width = currentX - zoneStart.x;
|
||||||
|
const height = currentY - zoneStart.y;
|
||||||
|
|
||||||
|
setTempZone({
|
||||||
|
x: width < 0 ? currentX : zoneStart.x,
|
||||||
|
y: height < 0 ? currentY : zoneStart.y,
|
||||||
|
width: Math.abs(width),
|
||||||
|
height: Math.abs(height)
|
||||||
|
});
|
||||||
|
};
|
||||||
|
|
||||||
|
const handleMouseUp = () => {
|
||||||
|
if (!isSelectingZone || !tempZone) return;
|
||||||
|
|
||||||
|
if (tempZone.width > 50 && tempZone.height > 50) {
|
||||||
|
setDetectionZone({
|
||||||
|
x: Math.round(tempZone.x),
|
||||||
|
y: Math.round(tempZone.y),
|
||||||
|
width: Math.round(tempZone.width),
|
||||||
|
height: Math.round(tempZone.height)
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
setIsSelectingZone(false);
|
||||||
|
setZoneStart(null);
|
||||||
|
setTempZone(null);
|
||||||
|
};
|
||||||
|
|
||||||
|
const clearDetectionZone = () => {
|
||||||
|
setDetectionZone(null);
|
||||||
|
setElements([]);
|
||||||
|
};
|
||||||
|
|
||||||
|
// Simuler la mise en surbrillance de l'élément cible (pour démo)
|
||||||
|
const handleElementHover = (elem: UIElement) => {
|
||||||
|
setTargetElement(elem);
|
||||||
|
setClickPoint({
|
||||||
|
x: elem.center.x,
|
||||||
|
y: elem.center.y
|
||||||
|
});
|
||||||
|
setConfidence(elem.confidence);
|
||||||
|
};
|
||||||
|
|
||||||
|
// Initialiser la zone de détection depuis les props
|
||||||
|
useEffect(() => {
|
||||||
|
if (initialDetectionZone) {
|
||||||
|
setDetectionZone(initialDetectionZone);
|
||||||
|
}
|
||||||
|
}, [initialDetectionZone]);
|
||||||
|
|
||||||
|
// Réinitialiser quand l'exécution s'arrête
|
||||||
|
useEffect(() => {
|
||||||
|
if (!isRunning) {
|
||||||
|
setTargetElement(null);
|
||||||
|
setClickPoint(null);
|
||||||
|
setConfidence(null);
|
||||||
|
}
|
||||||
|
}, [isRunning]);
|
||||||
|
|
||||||
|
// Raccourci Échap pour fermer
|
||||||
|
useEffect(() => {
|
||||||
|
if (!isVisible) return;
|
||||||
|
|
||||||
|
const handleKeyDown = (e: KeyboardEvent) => {
|
||||||
|
if (e.key === 'Escape') {
|
||||||
|
onClose();
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
document.addEventListener('keydown', handleKeyDown);
|
||||||
|
return () => document.removeEventListener('keydown', handleKeyDown);
|
||||||
|
}, [isVisible, onClose]);
|
||||||
|
|
||||||
|
// Calculer le scale pour l'affichage (défini avant les handlers qui l'utilisent)
|
||||||
|
const displayWidth = Math.min(window.innerWidth * 0.9, 1400);
|
||||||
|
const scale = displayWidth / imageSize.width;
|
||||||
|
const displayHeight = imageSize.height * scale;
|
||||||
|
|
||||||
|
if (!isVisible) return null;
|
||||||
|
|
||||||
|
return (
|
||||||
|
<div className="execution-overlay-modal">
|
||||||
|
<div className="execution-overlay-header">
|
||||||
|
<div className="header-left">
|
||||||
|
<span className="status-indicator running" />
|
||||||
|
<span className="status-text">
|
||||||
|
{isRunning ? 'Exécution en cours' : 'En pause'}
|
||||||
|
</span>
|
||||||
|
{lastEvent && (
|
||||||
|
<span className="step-info">
|
||||||
|
Étape {lastEvent.stepIndex + 1}
|
||||||
|
</span>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
<div className="header-center">
|
||||||
|
<button
|
||||||
|
className={`zone-btn ${isSelectingZone ? 'active' : ''}`}
|
||||||
|
onClick={() => setIsSelectingZone(!isSelectingZone)}
|
||||||
|
>
|
||||||
|
{isSelectingZone ? '✋ Annuler' : '✂️ Sélectionner zone'}
|
||||||
|
</button>
|
||||||
|
{detectionZone && (
|
||||||
|
<button className="zone-btn clear" onClick={clearDetectionZone}>
|
||||||
|
❌ Effacer zone
|
||||||
|
</button>
|
||||||
|
)}
|
||||||
|
<span className="detection-count">
|
||||||
|
{elements.length} éléments détectés
|
||||||
|
{detectionZone && ' (zone)'}
|
||||||
|
</span>
|
||||||
|
{confidence !== null && (
|
||||||
|
<span className="confidence-badge">
|
||||||
|
Confiance: {(confidence * 100).toFixed(0)}%
|
||||||
|
</span>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
<div className="header-right">
|
||||||
|
<button onClick={onClose}>Fermer (Échap)</button>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div className="execution-overlay-content">
|
||||||
|
{screenshot ? (
|
||||||
|
<div
|
||||||
|
className={`screen-container ${isSelectingZone ? 'selecting' : ''}`}
|
||||||
|
style={{
|
||||||
|
width: displayWidth,
|
||||||
|
height: displayHeight,
|
||||||
|
position: 'relative',
|
||||||
|
cursor: isSelectingZone ? 'crosshair' : 'default'
|
||||||
|
}}
|
||||||
|
onMouseDown={handleMouseDown}
|
||||||
|
onMouseMove={handleMouseMove}
|
||||||
|
onMouseUp={handleMouseUp}
|
||||||
|
onMouseLeave={handleMouseUp}
|
||||||
|
>
|
||||||
|
<img
|
||||||
|
src={screenshot}
|
||||||
|
alt="Écran en temps réel"
|
||||||
|
style={{ width: '100%', height: '100%', display: 'block', pointerEvents: 'none' }}
|
||||||
|
/>
|
||||||
|
|
||||||
|
{/* Zone de détection définie */}
|
||||||
|
{detectionZone && (
|
||||||
|
<div
|
||||||
|
className="detection-zone"
|
||||||
|
style={{
|
||||||
|
position: 'absolute',
|
||||||
|
left: detectionZone.x * scale,
|
||||||
|
top: detectionZone.y * scale,
|
||||||
|
width: detectionZone.width * scale,
|
||||||
|
height: detectionZone.height * scale,
|
||||||
|
}}
|
||||||
|
/>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* Zone en cours de sélection */}
|
||||||
|
{tempZone && tempZone.width > 0 && (
|
||||||
|
<div
|
||||||
|
className="detection-zone temp"
|
||||||
|
style={{
|
||||||
|
position: 'absolute',
|
||||||
|
left: tempZone.x * scale,
|
||||||
|
top: tempZone.y * scale,
|
||||||
|
width: tempZone.width * scale,
|
||||||
|
height: tempZone.height * scale,
|
||||||
|
}}
|
||||||
|
/>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* Éléments détectés */}
|
||||||
|
{!isSelectingZone && elements.map((elem) => {
|
||||||
|
const isTarget = targetElement?.id === elem.id;
|
||||||
|
return (
|
||||||
|
<div
|
||||||
|
key={elem.id}
|
||||||
|
className={`overlay-bbox ${isTarget ? 'target' : ''}`}
|
||||||
|
style={{
|
||||||
|
position: 'absolute',
|
||||||
|
left: elem.bbox.x1 * scale,
|
||||||
|
top: elem.bbox.y1 * scale,
|
||||||
|
width: (elem.bbox.x2 - elem.bbox.x1) * scale,
|
||||||
|
height: (elem.bbox.y2 - elem.bbox.y1) * scale,
|
||||||
|
}}
|
||||||
|
onMouseEnter={() => handleElementHover(elem)}
|
||||||
|
onMouseLeave={() => {
|
||||||
|
if (!isRunning) {
|
||||||
|
setTargetElement(null);
|
||||||
|
setClickPoint(null);
|
||||||
|
}
|
||||||
|
}}
|
||||||
|
>
|
||||||
|
<span className="bbox-id">{elem.id}</span>
|
||||||
|
</div>
|
||||||
|
);
|
||||||
|
})}
|
||||||
|
|
||||||
|
{/* Point de clic animé */}
|
||||||
|
{clickPoint && (
|
||||||
|
<div
|
||||||
|
className="click-indicator"
|
||||||
|
style={{
|
||||||
|
position: 'absolute',
|
||||||
|
left: clickPoint.x * scale - 20,
|
||||||
|
top: clickPoint.y * scale - 20,
|
||||||
|
}}
|
||||||
|
>
|
||||||
|
<div className="click-ring" />
|
||||||
|
<div className="click-center" />
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
|
||||||
|
{/* Indicateur de chargement */}
|
||||||
|
{isDetecting && (
|
||||||
|
<div className="detecting-indicator">
|
||||||
|
<span>Détection...</span>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
) : (
|
||||||
|
<div className="loading-screen">
|
||||||
|
<span>Capture de l'écran...</span>
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
|
||||||
|
{/* Barre d'info en bas */}
|
||||||
|
<div className="execution-overlay-footer">
|
||||||
|
<span>Mode Debug - Vision AI activée</span>
|
||||||
|
<span>UI-DETR-1 | Template Matching</span>
|
||||||
|
<span>Survolez un élément pour voir le point de clic</span>
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
);
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user