""" ExtractionEngine - Orchestrateur principal du moteur d'extraction de donnees Orchestre le cycle complet : naviguer -> screenshot -> extraire -> valider -> stocker -> suivant S'appuie sur FieldExtractor (VLM/OCR), DataStore (SQLite), et IterationController (navigation) pour realiser l'extraction automatisee de donnees depuis des interfaces utilisateur. """ import logging import time from datetime import datetime from pathlib import Path from typing import Any, Callable, Dict, List, Optional import requests from .data_store import DataStore from .field_extractor import FieldExtractor from .iteration_controller import IterationController from .schema import ExtractionSchema logger = logging.getLogger(__name__) class ExtractionEngine: """ Moteur d'extraction principal. Orchestre le cycle : naviguer -> screenshot -> extraire -> stocker -> suivant. Modes d'utilisation : 1. Automatique : start_extraction() — boucle complete avec navigation 2. Manuel : extract_current_screen() — extraction ponctuelle d'un screenshot """ def __init__( self, schema: ExtractionSchema, store: Optional[DataStore] = None, field_extractor: Optional[FieldExtractor] = None, streaming_server_url: str = "http://localhost:5005", screenshot_dir: str = "data/extractions/screenshots", ): """ Args: schema: Schema d'extraction decrivant les champs et la navigation store: DataStore pour le stockage (cree un par defaut si absent) field_extractor: Extracteur de champs (cree un par defaut si absent) streaming_server_url: URL du streaming server Agent V1 screenshot_dir: Repertoire pour sauvegarder les screenshots """ self.schema = schema self.store = store or DataStore() self.field_extractor = field_extractor or FieldExtractor() self.controller = IterationController(schema, streaming_server_url) self.streaming_server_url = streaming_server_url.rstrip("/") self.screenshot_dir = Path(screenshot_dir) self.screenshot_dir.mkdir(parents=True, exist_ok=True) # Etat interne self._current_extraction_id: Optional[str] = None self._is_running = False self._should_stop = False self._progress_callback: Optional[Callable] = None # ------------------------------------------------------------------ # API publique - Extraction automatique # ------------------------------------------------------------------ def start_extraction( self, session_id: str, on_progress: Optional[Callable[[Dict[str, Any]], None]] = None, ) -> str: """ Demarrer une session d'extraction automatique. Boucle : 1. Creer l'extraction dans le store 2. Pour chaque enregistrement : a. Prendre un screenshot b. Extraire les champs c. Valider d. Stocker e. Naviguer au suivant 3. Finaliser et retourner l'extraction_id Args: session_id: ID de la session de streaming (pour navigation) on_progress: Callback appele a chaque record (optionnel) Returns: extraction_id """ self._is_running = True self._should_stop = False self._progress_callback = on_progress # Creer la session d'extraction extraction_id = self.store.create_extraction(self.schema) self._current_extraction_id = extraction_id logger.info( "Demarrage extraction %s (schema=%s, max=%d)", extraction_id[:8], self.schema.name, self.controller.max_records, ) try: while self.controller.has_next() and not self._should_stop: idx = self.controller.current_index # 1. Screenshot screenshot_path = self._take_screenshot(session_id, idx) if screenshot_path is None: logger.warning("Screenshot echoue a l'index %d, on continue", idx) # Naviguer quand meme pour ne pas rester bloque self.controller.navigate_to_next(session_id) continue # 2. Extraction result = self.extract_current_screen(screenshot_path) # 3. Stockage self.store.add_record( extraction_id=extraction_id, data=result["data"], screenshot_path=screenshot_path, confidence=result["confidence"], errors=result.get("errors"), ) # 4. Callback de progression if self._progress_callback: progress = self.get_progress() progress["last_record"] = result["data"] progress["last_confidence"] = result["confidence"] self._progress_callback(progress) logger.info( "Record %d/%d extrait (confiance=%.2f)", idx + 1, self.controller.max_records, result["confidence"], ) # 5. Navigation if not self.controller.navigate_to_next(session_id): logger.info("Fin de navigation a l'index %d", idx) break # Finaliser status = "stopped" if self._should_stop else "completed" self.store.finish_extraction(extraction_id, status=status) logger.info( "Extraction %s terminee : %s (%d records)", extraction_id[:8], status, self.controller.current_index, ) except Exception as e: logger.error("Erreur pendant l'extraction : %s", e) self.store.finish_extraction(extraction_id, status="error") raise finally: self._is_running = False self._current_extraction_id = None return extraction_id def stop_extraction(self) -> None: """Demander l'arret de l'extraction en cours.""" if self._is_running: logger.info("Arret demande pour l'extraction en cours") self._should_stop = True # ------------------------------------------------------------------ # API publique - Extraction ponctuelle # ------------------------------------------------------------------ def extract_current_screen(self, screenshot_path: str) -> Dict[str, Any]: """ Extraire les champs du screenshot actuel sans navigation. Args: screenshot_path: Chemin vers le screenshot Returns: Dict avec 'data', 'confidence', 'errors', 'validation' """ # Extraction result = self.field_extractor.extract_fields(screenshot_path, self.schema) # Validation contre le schema validation = self.schema.validate_record(result["data"]) result["validation"] = validation return result # ------------------------------------------------------------------ # API publique - Progression # ------------------------------------------------------------------ def get_progress(self) -> Dict[str, Any]: """Retourne la progression actuelle de l'extraction.""" nav_progress = self.controller.progress stats = {} if self._current_extraction_id: stats = self.store.get_stats(self._current_extraction_id) return { "extraction_id": self._current_extraction_id, "is_running": self._is_running, "navigation": nav_progress, "stats": stats, "schema_name": self.schema.name, } # ------------------------------------------------------------------ # Screenshot # ------------------------------------------------------------------ def _take_screenshot(self, session_id: str, index: int) -> Optional[str]: """ Prendre un screenshot via le streaming server. Essaie d'appeler l'API du streaming server pour obtenir le screenshot courant. En cas d'echec, retourne None. Args: session_id: ID de la session de streaming index: Index de l'enregistrement courant Returns: Chemin du screenshot sauvegarde, ou None """ try: response = requests.get( f"{self.streaming_server_url}/api/screenshot", params={"session_id": session_id}, timeout=10, ) if response.status_code == 200: # Sauvegarder le screenshot timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S") filename = f"record_{index:04d}_{timestamp}.png" filepath = self.screenshot_dir / filename with open(filepath, "wb") as f: f.write(response.content) return str(filepath) else: logger.warning( "Screenshot echoue : HTTP %d", response.status_code ) return None except requests.exceptions.ConnectionError: logger.warning( "Streaming server non accessible pour screenshot" ) return None except Exception as e: logger.error("Erreur screenshot : %s", e) return None # ------------------------------------------------------------------ # Utilitaires # ------------------------------------------------------------------ def extract_from_file(self, screenshot_path: str) -> Dict[str, Any]: """ Raccourci pour extraire depuis un fichier existant et stocker le resultat. Utile pour du retraitement offline de screenshots. Args: screenshot_path: Chemin vers un screenshot existant Returns: Dict avec les donnees extraites et le record_id """ if self._current_extraction_id is None: extraction_id = self.store.create_extraction(self.schema) else: extraction_id = self._current_extraction_id result = self.extract_current_screen(screenshot_path) record_id = self.store.add_record( extraction_id=extraction_id, data=result["data"], screenshot_path=screenshot_path, confidence=result["confidence"], errors=result.get("errors"), ) result["record_id"] = record_id result["extraction_id"] = extraction_id return result