feat(cognition): orchestrateur VRAM + VLM 7b par défaut

VRAMOrchestrator : bascule automatique entre modes SHADOW et REPLAY. - SHADOW : streaming server + agent_chat actifs - REPLAY : VLM qwen2.5vl:7b chargé, services non-essentiels stoppés vlm_reason_about_screen() appelle ensure_reasoning_ready() avant chaque raisonnement — libère la VRAM si nécessaire. Benchmark : qwen2.5vl:7b en 10s (warm) vs 44s quand VRAM saturée. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-20 22:13:29 +02:00
parent cbe8dc95d2
commit 5da4581e76
2 changed files with 196 additions and 1 deletions
--- a/core/execution/input_handler.py
+++ b/core/execution/input_handler.py
@@ -286,8 +286,12 @@ Si tu vois un dialogue ou une popup, indique quel bouton cliquer.
 Si l'écran est normal sans action nécessaire, réponds action="nothing".
 Réponds UNIQUEMENT le JSON, pas d'explication."""

+        from core.cognition.vram_orchestrator import get_orchestrator
+        orch = get_orchestrator()
+        orch.ensure_reasoning_ready()
+
        ollama_url = os.environ.get("OLLAMA_URL", "http://localhost:11434")
-        model = os.environ.get("RPA_REASONING_MODEL", os.environ.get("RPA_VLM_MODEL", "qwen2.5vl:3b"))
+        model = os.environ.get("RPA_REASONING_MODEL", "qwen2.5vl:7b")

        response = requests.post(
            f"{ollama_url}/api/generate",