chore(dgx): snapshot consolidation WIP pour transfert poc DGX

Regroupe le WIP non committé requis pour le clone/runtime DGX (Option A) : - api_stream.py : préflight replay + smoke santé modèles + handler 403 WP-B - de-hardcode VLM : vlm_config, gpu/*, vram_orchestrator, ollama_manager - stream_processor, semantic_matcher, agent_chat (app/planner/intent) - workflows.db (acquis ; le transfert artifacts le mettra à jour + rewrite chemins) - docs : plans DGX, benchmarks VLM/grounders, recherche SOTA, coordination 8 juin Snapshot destiné à la branche poc-dgx poussée sur Gitea pour cloner le DGX. Scan anti-secret : clean. graphify (repo embarqué) exclu. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 16:33:58 +02:00
parent f18de016d7
commit 6d34b3cb68
204 changed files with 15744 additions and 47 deletions
--- a/core/gpu/init.py
+++ b/core/gpu/init.py
@@ -2,7 +2,7 @@
 GPU Resource Management Module for RPA Vision V3

 This module provides dynamic GPU resource allocation between ML models:
- Ollama VLM (gemma4:e4b par défaut, configurable via RPA_VLM_MODEL) for UI classification
+- Ollama VLM (modèle central configurable via RPA_VLM_MODEL) for UI classification
 - CLIP (ViT-B-32) for embedding matching

 The GPUResourceManager optimizes VRAM usage by:
--- a/core/gpu/gpu_resource_manager.py
+++ b/core/gpu/gpu_resource_manager.py
@@ -2,7 +2,7 @@
 GPU Resource Manager - Central orchestrator for GPU resource allocation

 Manages dynamic allocation of GPU resources between:
- Ollama VLM (gemma4:e4b par défaut) - ~10 GB VRAM for UI classification
+- Ollama VLM (modèle reasoning/VLM central) - ~10 GB VRAM for UI classification
 - CLIP (ViT-B-32) - ~500 MB VRAM for embedding matching

 Optimizes VRAM usage based on execution mode:
@@ -21,6 +21,8 @@ from datetime import datetime
 from enum import Enum
 from typing import Any, Callable, Dict, Iterator, List, Optional

+from core.detection.vlm_config import get_reasoning_model
+
 logger = logging.getLogger(__name__)


@@ -54,7 +56,7 @@ class VRAMInfo:
 class GPUResourceConfig:
    """Configuration for GPU resource management."""
    ollama_endpoint: str = "http://localhost:11434"
-    vlm_model: str = "gemma4:e4b"
+    vlm_model: str = field(default_factory=get_reasoning_model)
    clip_model: str = "ViT-B-32"
    idle_timeout_seconds: int = 300  # 5 minutes
    vram_threshold_for_clip_gpu_mb: int = 1024  # 1 GB
--- a/core/gpu/ollama_manager.py
+++ b/core/gpu/ollama_manager.py
@@ -13,6 +13,8 @@ from typing import List, Optional

 import aiohttp

+from core.detection.vlm_config import get_reasoning_model
+
 logger = logging.getLogger(__name__)


@@ -32,7 +34,7 @@ class OllamaManager:
    def __init__(
        self,
        endpoint: str = "http://localhost:11434",
-        model: str = "gemma4:e4b",
+        model: Optional[str] = None,
        default_keep_alive: str = "5m"
    ):
        """
@@ -44,7 +46,7 @@ class OllamaManager:
            default_keep_alive: Default keep-alive duration
        """
        self._endpoint = endpoint.rstrip("/")
-        self._model = model
+        self._model = model or get_reasoning_model()
        self._default_keep_alive = default_keep_alive
        self._session: Optional[aiohttp.ClientSession] = None