# Design Document - GPU Resource Manager ## Overview Le GPU Resource Manager est un composant singleton qui orchestre l'allocation dynamique des ressources GPU entre les modèles ML du système RPA Vision V3. Il gère principalement deux ressources : 1. **Ollama VLM (qwen3-vl:8b)** - ~10.5 GB VRAM, utilisé pour la classification UI pendant l'enregistrement 2. **CLIP (ViT-B-32)** - ~500 MB VRAM, utilisé pour le matching d'embeddings Le manager optimise l'utilisation de la VRAM en : - Déchargeant le VLM quand non nécessaire (mode autopilot) - Migrant CLIP sur GPU quand la VRAM est disponible - Gérant un idle timeout pour libérer automatiquement les ressources ## Architecture ```mermaid graph TB subgraph "GPU Resource Manager" GRM[GPUResourceManager] OM[OllamaManager] CM[CLIPManager] VM[VRAMMonitor] EE[EventEmitter] end subgraph "External Services" OL[Ollama API :11434] NV[nvidia-smi / pynvml] end subgraph "Consumers" EL[ExecutionLoop] UD[UIDetector] FE[FusionEngine] end GRM --> OM GRM --> CM GRM --> VM GRM --> EE OM --> OL VM --> NV EL --> GRM UD --> GRM FE --> GRM ``` ## Components and Interfaces ### GPUResourceManager (Singleton) ```python class GPUResourceManager: """Gestionnaire central des ressources GPU.""" # Lifecycle def __init__(self, config: GPUResourceConfig) def shutdown(self) -> None # Mode Management def set_execution_mode(self, mode: ExecutionMode) -> None def get_execution_mode(self) -> ExecutionMode # VLM Management async def ensure_vlm_loaded(self) -> bool async def ensure_vlm_unloaded(self) -> bool def is_vlm_loaded(self) -> bool def get_vlm_state(self) -> ModelState # CLIP Management def get_clip_device(self) -> str # "cpu" or "cuda" async def migrate_clip_to_gpu(self) -> bool async def migrate_clip_to_cpu(self) -> bool # Monitoring def get_status(self) -> GPUResourceStatus def get_vram_usage(self) -> VRAMInfo # Events def on_resource_changed(self, callback: Callable) -> None def on_mode_changed(self, callback: Callable) -> None def on_idle_unload(self, callback: Callable) -> None ``` ### OllamaManager ```python class OllamaManager: """Gère le cycle de vie des modèles Ollama.""" def __init__(self, endpoint: str = "http://localhost:11434") async def load_model(self, model: str, keep_alive: str = "5m") -> bool async def unload_model(self, model: str) -> bool async def is_model_loaded(self, model: str) -> bool async def list_loaded_models(self) -> List[str] def is_available(self) -> bool ``` ### CLIPManager ```python class CLIPManager: """Gère la migration CPU/GPU du modèle CLIP.""" def __init__(self, model_name: str = "ViT-B-32") def get_current_device(self) -> str async def migrate_to_device(self, device: str) -> bool def get_model(self) -> Any # Returns the CLIP model def reinitialize_pipeline(self) -> None ``` ### VRAMMonitor ```python class VRAMMonitor: """Surveille l'utilisation de la VRAM.""" def __init__(self, poll_interval_ms: int = 1000) def get_vram_info(self) -> VRAMInfo def get_available_vram_mb(self) -> int def start_monitoring(self) -> None def stop_monitoring(self) -> None def on_vram_changed(self, callback: Callable, threshold_mb: int = 100) -> None ``` ## Data Models ```python from enum import Enum from dataclasses import dataclass from typing import Optional, List from datetime import datetime class ExecutionMode(str, Enum): IDLE = "idle" RECORDING = "recording" AUTOPILOT = "autopilot" class ModelState(str, Enum): UNLOADED = "unloaded" LOADING = "loading" LOADED = "loaded" UNLOADING = "unloading" ERROR = "error" @dataclass class VRAMInfo: total_mb: int used_mb: int free_mb: int gpu_name: str gpu_utilization_percent: int @dataclass class GPUResourceStatus: execution_mode: ExecutionMode vlm_state: ModelState vlm_model: str clip_device: str vram: VRAMInfo idle_timeout_seconds: int last_vlm_request: Optional[datetime] degraded_mode: bool degraded_reason: Optional[str] @dataclass class GPUResourceConfig: ollama_endpoint: str = "http://localhost:11434" vlm_model: str = "qwen3-vl:8b" clip_model: str = "ViT-B-32" idle_timeout_seconds: int = 300 # 5 minutes vram_threshold_for_clip_gpu_mb: int = 1024 # 1 GB max_load_retries: int = 3 load_timeout_seconds: int = 30 unload_timeout_seconds: int = 5 @dataclass class ResourceChangedEvent: timestamp: datetime event_type: str # "vram_changed", "model_loaded", "model_unloaded", "device_changed" details: dict ``` ## Correctness Properties *A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.* ### Property 1: Mode transition triggers VLM unload *For any* GPU Resource Manager in RECORDING mode with VLM loaded, transitioning to AUTOPILOT mode should result in VLM being unloaded within 5 seconds. **Validates: Requirements 1.1** ### Property 2: Mode transition triggers VLM load *For any* GPU Resource Manager in AUTOPILOT mode with VLM unloaded, transitioning to RECORDING mode should result in VLM being loaded within 30 seconds. **Validates: Requirements 1.2** ### Property 3: CLIP on GPU in AUTOPILOT *For any* GPU Resource Manager in AUTOPILOT mode with available VRAM > 1GB, CLIP should be on GPU device. **Validates: Requirements 1.3, 3.1** ### Property 4: VRAM decrease on VLM unload *For any* VLM unload operation, the VRAM usage should decrease by at least 8 GB. **Validates: Requirements 1.4** ### Property 5: Status query completeness *For any* call to get_status(), the returned GPUResourceStatus should contain valid values for all fields including vram, vlm_state, clip_device, and execution_mode. **Validates: Requirements 2.1** ### Property 6: CLIP migration ordering *For any* VLM load request when CLIP is on GPU, CLIP should be migrated to CPU before VLM loading completes. **Validates: Requirements 3.2** ### Property 7: Embedding pipeline consistency *For any* CLIP device change, the embedding pipeline should produce valid embeddings after reinitialization. **Validates: Requirements 3.3** ### Property 8: Idle timeout behavior *For any* configured idle_timeout value, VLM should be unloaded after that duration of inactivity (not the default). **Validates: Requirements 4.1, 4.3** ### Property 9: On-demand VLM loading *For any* VLM request when VLM is unloaded, the request should complete successfully after VLM is loaded. **Validates: Requirements 4.2** ### Property 10: ensure_vlm_loaded blocking *For any* call to ensure_vlm_loaded(), the function should only return when is_vlm_loaded() returns True. **Validates: Requirements 5.1** ### Property 11: ensure_vlm_unloaded blocking *For any* call to ensure_vlm_unloaded(), the function should only return when is_vlm_loaded() returns False. **Validates: Requirements 5.2** ### Property 12: get_clip_device validity *For any* call to get_clip_device(), the return value should be either "cpu" or "cuda". **Validates: Requirements 5.3** ### Property 13: Sequential operation processing *For any* concurrent model operations, they should be processed sequentially without race conditions. **Validates: Requirements 5.4** ## Error Handling ### Ollama Unavailable - Detect via connection timeout or HTTP error - Set `degraded_mode = True` with reason - CLIP continues on CPU - VLM operations return False with logged warning - Periodic retry every 30 seconds ### GPU Not Available - Detect via pynvml initialization failure - Force CPU-only mode for all models - Log warning at startup - All GPU migration requests return False gracefully ### VRAM Insufficient - Check available VRAM before operations - Return error with current VRAM info - Suggest unloading other models ### Load/Unload Timeout - Implement timeout with cancellation - Retry up to max_load_retries - Mark model as ERROR state after failures - Emit error event for monitoring ## Testing Strategy ### Unit Tests - Test each manager component in isolation - Mock Ollama API responses - Mock nvidia-smi/pynvml responses - Test state machine transitions - Test event emission ### Property-Based Tests (using Hypothesis) - Generate random sequences of mode transitions - Verify invariants hold after each transition - Test concurrent operation handling - Test timeout behavior with various configurations ### Integration Tests - Test with real Ollama instance - Test with real GPU (if available) - Test full workflow: RECORDING → AUTOPILOT → RECORDING - Measure actual VRAM changes ### Test Configuration ```python # pytest configuration for property tests HYPOTHESIS_SETTINGS = { "max_examples": 100, "deadline": 30000, # 30 seconds for GPU operations } ```