Files
rpa_vision_v3/.kiro/specs/gpu-resource-manager/design.md
Dom a7de6a488b feat: replay E2E fonctionnel — 25/25 actions, 0 retries, SomEngine via serveur
Validé sur PC Windows (DESKTOP-58D5CAC, 2560x1600) :
- 8 clics résolus visuellement (1 anchor_template, 1 som_text_match, 6 som_vlm)
- Score moyen 0.75, temps moyen 1.6s
- Texte tapé correctement (bonjour, test word, date, email)
- 0 retries, 2 actions non vérifiées (OK)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 14:04:41 +02:00

9.1 KiB

Design Document - GPU Resource Manager

Overview

Le GPU Resource Manager est un composant singleton qui orchestre l'allocation dynamique des ressources GPU entre les modèles ML du système RPA Vision V3. Il gère principalement deux ressources :

  1. Ollama VLM (qwen3-vl:8b) - ~10.5 GB VRAM, utilisé pour la classification UI pendant l'enregistrement
  2. CLIP (ViT-B-32) - ~500 MB VRAM, utilisé pour le matching d'embeddings

Le manager optimise l'utilisation de la VRAM en :

  • Déchargeant le VLM quand non nécessaire (mode autopilot)
  • Migrant CLIP sur GPU quand la VRAM est disponible
  • Gérant un idle timeout pour libérer automatiquement les ressources

Architecture

graph TB
    subgraph "GPU Resource Manager"
        GRM[GPUResourceManager]
        OM[OllamaManager]
        CM[CLIPManager]
        VM[VRAMMonitor]
        EE[EventEmitter]
    end
    
    subgraph "External Services"
        OL[Ollama API :11434]
        NV[nvidia-smi / pynvml]
    end
    
    subgraph "Consumers"
        EL[ExecutionLoop]
        UD[UIDetector]
        FE[FusionEngine]
    end
    
    GRM --> OM
    GRM --> CM
    GRM --> VM
    GRM --> EE
    
    OM --> OL
    VM --> NV
    
    EL --> GRM
    UD --> GRM
    FE --> GRM

Components and Interfaces

GPUResourceManager (Singleton)

class GPUResourceManager:
    """Gestionnaire central des ressources GPU."""
    
    # Lifecycle
    def __init__(self, config: GPUResourceConfig)
    def shutdown(self) -> None
    
    # Mode Management
    def set_execution_mode(self, mode: ExecutionMode) -> None
    def get_execution_mode(self) -> ExecutionMode
    
    # VLM Management
    async def ensure_vlm_loaded(self) -> bool
    async def ensure_vlm_unloaded(self) -> bool
    def is_vlm_loaded(self) -> bool
    def get_vlm_state(self) -> ModelState
    
    # CLIP Management
    def get_clip_device(self) -> str  # "cpu" or "cuda"
    async def migrate_clip_to_gpu(self) -> bool
    async def migrate_clip_to_cpu(self) -> bool
    
    # Monitoring
    def get_status(self) -> GPUResourceStatus
    def get_vram_usage(self) -> VRAMInfo
    
    # Events
    def on_resource_changed(self, callback: Callable) -> None
    def on_mode_changed(self, callback: Callable) -> None
    def on_idle_unload(self, callback: Callable) -> None

OllamaManager

class OllamaManager:
    """Gère le cycle de vie des modèles Ollama."""
    
    def __init__(self, endpoint: str = "http://localhost:11434")
    
    async def load_model(self, model: str, keep_alive: str = "5m") -> bool
    async def unload_model(self, model: str) -> bool
    async def is_model_loaded(self, model: str) -> bool
    async def list_loaded_models(self) -> List[str]
    def is_available(self) -> bool

CLIPManager

class CLIPManager:
    """Gère la migration CPU/GPU du modèle CLIP."""
    
    def __init__(self, model_name: str = "ViT-B-32")
    
    def get_current_device(self) -> str
    async def migrate_to_device(self, device: str) -> bool
    def get_model(self) -> Any  # Returns the CLIP model
    def reinitialize_pipeline(self) -> None

VRAMMonitor

class VRAMMonitor:
    """Surveille l'utilisation de la VRAM."""
    
    def __init__(self, poll_interval_ms: int = 1000)
    
    def get_vram_info(self) -> VRAMInfo
    def get_available_vram_mb(self) -> int
    def start_monitoring(self) -> None
    def stop_monitoring(self) -> None
    def on_vram_changed(self, callback: Callable, threshold_mb: int = 100) -> None

Data Models

from enum import Enum
from dataclasses import dataclass
from typing import Optional, List
from datetime import datetime

class ExecutionMode(str, Enum):
    IDLE = "idle"
    RECORDING = "recording"
    AUTOPILOT = "autopilot"

class ModelState(str, Enum):
    UNLOADED = "unloaded"
    LOADING = "loading"
    LOADED = "loaded"
    UNLOADING = "unloading"
    ERROR = "error"

@dataclass
class VRAMInfo:
    total_mb: int
    used_mb: int
    free_mb: int
    gpu_name: str
    gpu_utilization_percent: int

@dataclass
class GPUResourceStatus:
    execution_mode: ExecutionMode
    vlm_state: ModelState
    vlm_model: str
    clip_device: str
    vram: VRAMInfo
    idle_timeout_seconds: int
    last_vlm_request: Optional[datetime]
    degraded_mode: bool
    degraded_reason: Optional[str]

@dataclass
class GPUResourceConfig:
    ollama_endpoint: str = "http://localhost:11434"
    vlm_model: str = "qwen3-vl:8b"
    clip_model: str = "ViT-B-32"
    idle_timeout_seconds: int = 300  # 5 minutes
    vram_threshold_for_clip_gpu_mb: int = 1024  # 1 GB
    max_load_retries: int = 3
    load_timeout_seconds: int = 30
    unload_timeout_seconds: int = 5

@dataclass
class ResourceChangedEvent:
    timestamp: datetime
    event_type: str  # "vram_changed", "model_loaded", "model_unloaded", "device_changed"
    details: dict

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

Property 1: Mode transition triggers VLM unload

For any GPU Resource Manager in RECORDING mode with VLM loaded, transitioning to AUTOPILOT mode should result in VLM being unloaded within 5 seconds. Validates: Requirements 1.1

Property 2: Mode transition triggers VLM load

For any GPU Resource Manager in AUTOPILOT mode with VLM unloaded, transitioning to RECORDING mode should result in VLM being loaded within 30 seconds. Validates: Requirements 1.2

Property 3: CLIP on GPU in AUTOPILOT

For any GPU Resource Manager in AUTOPILOT mode with available VRAM > 1GB, CLIP should be on GPU device. Validates: Requirements 1.3, 3.1

Property 4: VRAM decrease on VLM unload

For any VLM unload operation, the VRAM usage should decrease by at least 8 GB. Validates: Requirements 1.4

Property 5: Status query completeness

For any call to get_status(), the returned GPUResourceStatus should contain valid values for all fields including vram, vlm_state, clip_device, and execution_mode. Validates: Requirements 2.1

Property 6: CLIP migration ordering

For any VLM load request when CLIP is on GPU, CLIP should be migrated to CPU before VLM loading completes. Validates: Requirements 3.2

Property 7: Embedding pipeline consistency

For any CLIP device change, the embedding pipeline should produce valid embeddings after reinitialization. Validates: Requirements 3.3

Property 8: Idle timeout behavior

For any configured idle_timeout value, VLM should be unloaded after that duration of inactivity (not the default). Validates: Requirements 4.1, 4.3

Property 9: On-demand VLM loading

For any VLM request when VLM is unloaded, the request should complete successfully after VLM is loaded. Validates: Requirements 4.2

Property 10: ensure_vlm_loaded blocking

For any call to ensure_vlm_loaded(), the function should only return when is_vlm_loaded() returns True. Validates: Requirements 5.1

Property 11: ensure_vlm_unloaded blocking

For any call to ensure_vlm_unloaded(), the function should only return when is_vlm_loaded() returns False. Validates: Requirements 5.2

Property 12: get_clip_device validity

For any call to get_clip_device(), the return value should be either "cpu" or "cuda". Validates: Requirements 5.3

Property 13: Sequential operation processing

For any concurrent model operations, they should be processed sequentially without race conditions. Validates: Requirements 5.4

Error Handling

Ollama Unavailable

  • Detect via connection timeout or HTTP error
  • Set degraded_mode = True with reason
  • CLIP continues on CPU
  • VLM operations return False with logged warning
  • Periodic retry every 30 seconds

GPU Not Available

  • Detect via pynvml initialization failure
  • Force CPU-only mode for all models
  • Log warning at startup
  • All GPU migration requests return False gracefully

VRAM Insufficient

  • Check available VRAM before operations
  • Return error with current VRAM info
  • Suggest unloading other models

Load/Unload Timeout

  • Implement timeout with cancellation
  • Retry up to max_load_retries
  • Mark model as ERROR state after failures
  • Emit error event for monitoring

Testing Strategy

Unit Tests

  • Test each manager component in isolation
  • Mock Ollama API responses
  • Mock nvidia-smi/pynvml responses
  • Test state machine transitions
  • Test event emission

Property-Based Tests (using Hypothesis)

  • Generate random sequences of mode transitions
  • Verify invariants hold after each transition
  • Test concurrent operation handling
  • Test timeout behavior with various configurations

Integration Tests

  • Test with real Ollama instance
  • Test with real GPU (if available)
  • Test full workflow: RECORDING → AUTOPILOT → RECORDING
  • Measure actual VRAM changes

Test Configuration

# pytest configuration for property tests
HYPOTHESIS_SETTINGS = {
    "max_examples": 100,
    "deadline": 30000,  # 30 seconds for GPU operations
}