rpa_vision_v3/.kiro/specs/gpu-resource-manager/design.md

# Design Document - GPU Resource Manager

## Overview

Le GPU Resource Manager est un composant singleton qui orchestre l'allocation dynamique des ressources GPU entre les modèles ML du système RPA Vision V3. Il gère principalement deux ressources :

1. **Ollama VLM (qwen3-vl:8b)** - ~10.5 GB VRAM, utilisé pour la classification UI pendant l'enregistrement
2. **CLIP (ViT-B-32)** - ~500 MB VRAM, utilisé pour le matching d'embeddings

Le manager optimise l'utilisation de la VRAM en :
- Déchargeant le VLM quand non nécessaire (mode autopilot)
- Migrant CLIP sur GPU quand la VRAM est disponible
- Gérant un idle timeout pour libérer automatiquement les ressources

## Architecture

```mermaid
graph TB
    subgraph "GPU Resource Manager"
        GRM[GPUResourceManager]
        OM[OllamaManager]
        CM[CLIPManager]
        VM[VRAMMonitor]
        EE[EventEmitter]
    end

    subgraph "External Services"
        OL[Ollama API :11434]
        NV[nvidia-smi / pynvml]
    end

    subgraph "Consumers"
        EL[ExecutionLoop]
        UD[UIDetector]
        FE[FusionEngine]
    end

    GRM --> OM
    GRM --> CM
    GRM --> VM
    GRM --> EE

    OM --> OL
    VM --> NV

    EL --> GRM
    UD --> GRM
    FE --> GRM
```

## Components and Interfaces

### GPUResourceManager (Singleton)

```python
class GPUResourceManager:
    """Gestionnaire central des ressources GPU."""

    # Lifecycle
    def __init__(self, config: GPUResourceConfig)
    def shutdown(self) -> None

    # Mode Management
    def set_execution_mode(self, mode: ExecutionMode) -> None
    def get_execution_mode(self) -> ExecutionMode

    # VLM Management
    async def ensure_vlm_loaded(self) -> bool
    async def ensure_vlm_unloaded(self) -> bool
    def is_vlm_loaded(self) -> bool
    def get_vlm_state(self) -> ModelState

    # CLIP Management
    def get_clip_device(self) -> str  # "cpu" or "cuda"
    async def migrate_clip_to_gpu(self) -> bool
    async def migrate_clip_to_cpu(self) -> bool

    # Monitoring
    def get_status(self) -> GPUResourceStatus
    def get_vram_usage(self) -> VRAMInfo

    # Events
    def on_resource_changed(self, callback: Callable) -> None
    def on_mode_changed(self, callback: Callable) -> None
    def on_idle_unload(self, callback: Callable) -> None
```

### OllamaManager

```python
class OllamaManager:
    """Gère le cycle de vie des modèles Ollama."""

    def __init__(self, endpoint: str = "http://localhost:11434")

    async def load_model(self, model: str, keep_alive: str = "5m") -> bool
    async def unload_model(self, model: str) -> bool
    async def is_model_loaded(self, model: str) -> bool
    async def list_loaded_models(self) -> List[str]
    def is_available(self) -> bool
```

### CLIPManager

```python
class CLIPManager:
    """Gère la migration CPU/GPU du modèle CLIP."""

    def __init__(self, model_name: str = "ViT-B-32")

    def get_current_device(self) -> str
    async def migrate_to_device(self, device: str) -> bool
    def get_model(self) -> Any  # Returns the CLIP model
    def reinitialize_pipeline(self) -> None
```

### VRAMMonitor

```python
class VRAMMonitor:
    """Surveille l'utilisation de la VRAM."""

    def __init__(self, poll_interval_ms: int = 1000)

    def get_vram_info(self) -> VRAMInfo
    def get_available_vram_mb(self) -> int
    def start_monitoring(self) -> None
    def stop_monitoring(self) -> None
    def on_vram_changed(self, callback: Callable, threshold_mb: int = 100) -> None
```

## Data Models

```python
from enum import Enum
from dataclasses import dataclass
from typing import Optional, List
from datetime import datetime

class ExecutionMode(str, Enum):
    IDLE = "idle"
    RECORDING = "recording"
    AUTOPILOT = "autopilot"

class ModelState(str, Enum):
    UNLOADED = "unloaded"
    LOADING = "loading"
    LOADED = "loaded"
    UNLOADING = "unloading"
    ERROR = "error"

@dataclass
class VRAMInfo:
    total_mb: int
    used_mb: int
    free_mb: int
    gpu_name: str
    gpu_utilization_percent: int

@dataclass
class GPUResourceStatus:
    execution_mode: ExecutionMode
    vlm_state: ModelState
    vlm_model: str
    clip_device: str
    vram: VRAMInfo
    idle_timeout_seconds: int
    last_vlm_request: Optional[datetime]
    degraded_mode: bool
    degraded_reason: Optional[str]

@dataclass
class GPUResourceConfig:
    ollama_endpoint: str = "http://localhost:11434"
    vlm_model: str = "qwen3-vl:8b"
    clip_model: str = "ViT-B-32"
    idle_timeout_seconds: int = 300  # 5 minutes
    vram_threshold_for_clip_gpu_mb: int = 1024  # 1 GB
    max_load_retries: int = 3
    load_timeout_seconds: int = 30
    unload_timeout_seconds: int = 5

@dataclass
class ResourceChangedEvent:
    timestamp: datetime
    event_type: str  # "vram_changed", "model_loaded", "model_unloaded", "device_changed"
    details: dict
```

## Correctness Properties

*A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*

### Property 1: Mode transition triggers VLM unload
*For any* GPU Resource Manager in RECORDING mode with VLM loaded, transitioning to AUTOPILOT mode should result in VLM being unloaded within 5 seconds.
**Validates: Requirements 1.1**

### Property 2: Mode transition triggers VLM load
*For any* GPU Resource Manager in AUTOPILOT mode with VLM unloaded, transitioning to RECORDING mode should result in VLM being loaded within 30 seconds.
**Validates: Requirements 1.2**

### Property 3: CLIP on GPU in AUTOPILOT
*For any* GPU Resource Manager in AUTOPILOT mode with available VRAM > 1GB, CLIP should be on GPU device.
**Validates: Requirements 1.3, 3.1**

### Property 4: VRAM decrease on VLM unload
*For any* VLM unload operation, the VRAM usage should decrease by at least 8 GB.
**Validates: Requirements 1.4**

### Property 5: Status query completeness
*For any* call to get_status(), the returned GPUResourceStatus should contain valid values for all fields including vram, vlm_state, clip_device, and execution_mode.
**Validates: Requirements 2.1**

### Property 6: CLIP migration ordering
*For any* VLM load request when CLIP is on GPU, CLIP should be migrated to CPU before VLM loading completes.
**Validates: Requirements 3.2**

### Property 7: Embedding pipeline consistency
*For any* CLIP device change, the embedding pipeline should produce valid embeddings after reinitialization.
**Validates: Requirements 3.3**

### Property 8: Idle timeout behavior
*For any* configured idle_timeout value, VLM should be unloaded after that duration of inactivity (not the default).
**Validates: Requirements 4.1, 4.3**

### Property 9: On-demand VLM loading
*For any* VLM request when VLM is unloaded, the request should complete successfully after VLM is loaded.
**Validates: Requirements 4.2**

### Property 10: ensure_vlm_loaded blocking
*For any* call to ensure_vlm_loaded(), the function should only return when is_vlm_loaded() returns True.
**Validates: Requirements 5.1**

### Property 11: ensure_vlm_unloaded blocking
*For any* call to ensure_vlm_unloaded(), the function should only return when is_vlm_loaded() returns False.
**Validates: Requirements 5.2**

### Property 12: get_clip_device validity
*For any* call to get_clip_device(), the return value should be either "cpu" or "cuda".
**Validates: Requirements 5.3**

### Property 13: Sequential operation processing
*For any* concurrent model operations, they should be processed sequentially without race conditions.
**Validates: Requirements 5.4**

## Error Handling

### Ollama Unavailable
- Detect via connection timeout or HTTP error
- Set `degraded_mode = True` with reason
- CLIP continues on CPU
- VLM operations return False with logged warning
- Periodic retry every 30 seconds

### GPU Not Available
- Detect via pynvml initialization failure
- Force CPU-only mode for all models
- Log warning at startup
- All GPU migration requests return False gracefully

### VRAM Insufficient
- Check available VRAM before operations
- Return error with current VRAM info
- Suggest unloading other models

### Load/Unload Timeout
- Implement timeout with cancellation
- Retry up to max_load_retries
- Mark model as ERROR state after failures
- Emit error event for monitoring

## Testing Strategy

### Unit Tests
- Test each manager component in isolation
- Mock Ollama API responses
- Mock nvidia-smi/pynvml responses
- Test state machine transitions
- Test event emission

### Property-Based Tests (using Hypothesis)
- Generate random sequences of mode transitions
- Verify invariants hold after each transition
- Test concurrent operation handling
- Test timeout behavior with various configurations

### Integration Tests
- Test with real Ollama instance
- Test with real GPU (if available)
- Test full workflow: RECORDING → AUTOPILOT → RECORDING
- Measure actual VRAM changes

### Test Configuration
```python
# pytest configuration for property tests
HYPOTHESIS_SETTINGS = {
    "max_examples": 100,
    "deadline": 30000,  # 30 seconds for GPU operations
}
```