Files
rpa_vision_v3/.kiro/specs/gpu-resource-manager/design.md
Dom a7de6a488b feat: replay E2E fonctionnel — 25/25 actions, 0 retries, SomEngine via serveur
Validé sur PC Windows (DESKTOP-58D5CAC, 2560x1600) :
- 8 clics résolus visuellement (1 anchor_template, 1 som_text_match, 6 som_vlm)
- Score moyen 0.75, temps moyen 1.6s
- Texte tapé correctement (bonjour, test word, date, email)
- 0 retries, 2 actions non vérifiées (OK)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 14:04:41 +02:00

301 lines
9.1 KiB
Markdown

# Design Document - GPU Resource Manager
## Overview
Le GPU Resource Manager est un composant singleton qui orchestre l'allocation dynamique des ressources GPU entre les modèles ML du système RPA Vision V3. Il gère principalement deux ressources :
1. **Ollama VLM (qwen3-vl:8b)** - ~10.5 GB VRAM, utilisé pour la classification UI pendant l'enregistrement
2. **CLIP (ViT-B-32)** - ~500 MB VRAM, utilisé pour le matching d'embeddings
Le manager optimise l'utilisation de la VRAM en :
- Déchargeant le VLM quand non nécessaire (mode autopilot)
- Migrant CLIP sur GPU quand la VRAM est disponible
- Gérant un idle timeout pour libérer automatiquement les ressources
## Architecture
```mermaid
graph TB
subgraph "GPU Resource Manager"
GRM[GPUResourceManager]
OM[OllamaManager]
CM[CLIPManager]
VM[VRAMMonitor]
EE[EventEmitter]
end
subgraph "External Services"
OL[Ollama API :11434]
NV[nvidia-smi / pynvml]
end
subgraph "Consumers"
EL[ExecutionLoop]
UD[UIDetector]
FE[FusionEngine]
end
GRM --> OM
GRM --> CM
GRM --> VM
GRM --> EE
OM --> OL
VM --> NV
EL --> GRM
UD --> GRM
FE --> GRM
```
## Components and Interfaces
### GPUResourceManager (Singleton)
```python
class GPUResourceManager:
"""Gestionnaire central des ressources GPU."""
# Lifecycle
def __init__(self, config: GPUResourceConfig)
def shutdown(self) -> None
# Mode Management
def set_execution_mode(self, mode: ExecutionMode) -> None
def get_execution_mode(self) -> ExecutionMode
# VLM Management
async def ensure_vlm_loaded(self) -> bool
async def ensure_vlm_unloaded(self) -> bool
def is_vlm_loaded(self) -> bool
def get_vlm_state(self) -> ModelState
# CLIP Management
def get_clip_device(self) -> str # "cpu" or "cuda"
async def migrate_clip_to_gpu(self) -> bool
async def migrate_clip_to_cpu(self) -> bool
# Monitoring
def get_status(self) -> GPUResourceStatus
def get_vram_usage(self) -> VRAMInfo
# Events
def on_resource_changed(self, callback: Callable) -> None
def on_mode_changed(self, callback: Callable) -> None
def on_idle_unload(self, callback: Callable) -> None
```
### OllamaManager
```python
class OllamaManager:
"""Gère le cycle de vie des modèles Ollama."""
def __init__(self, endpoint: str = "http://localhost:11434")
async def load_model(self, model: str, keep_alive: str = "5m") -> bool
async def unload_model(self, model: str) -> bool
async def is_model_loaded(self, model: str) -> bool
async def list_loaded_models(self) -> List[str]
def is_available(self) -> bool
```
### CLIPManager
```python
class CLIPManager:
"""Gère la migration CPU/GPU du modèle CLIP."""
def __init__(self, model_name: str = "ViT-B-32")
def get_current_device(self) -> str
async def migrate_to_device(self, device: str) -> bool
def get_model(self) -> Any # Returns the CLIP model
def reinitialize_pipeline(self) -> None
```
### VRAMMonitor
```python
class VRAMMonitor:
"""Surveille l'utilisation de la VRAM."""
def __init__(self, poll_interval_ms: int = 1000)
def get_vram_info(self) -> VRAMInfo
def get_available_vram_mb(self) -> int
def start_monitoring(self) -> None
def stop_monitoring(self) -> None
def on_vram_changed(self, callback: Callable, threshold_mb: int = 100) -> None
```
## Data Models
```python
from enum import Enum
from dataclasses import dataclass
from typing import Optional, List
from datetime import datetime
class ExecutionMode(str, Enum):
IDLE = "idle"
RECORDING = "recording"
AUTOPILOT = "autopilot"
class ModelState(str, Enum):
UNLOADED = "unloaded"
LOADING = "loading"
LOADED = "loaded"
UNLOADING = "unloading"
ERROR = "error"
@dataclass
class VRAMInfo:
total_mb: int
used_mb: int
free_mb: int
gpu_name: str
gpu_utilization_percent: int
@dataclass
class GPUResourceStatus:
execution_mode: ExecutionMode
vlm_state: ModelState
vlm_model: str
clip_device: str
vram: VRAMInfo
idle_timeout_seconds: int
last_vlm_request: Optional[datetime]
degraded_mode: bool
degraded_reason: Optional[str]
@dataclass
class GPUResourceConfig:
ollama_endpoint: str = "http://localhost:11434"
vlm_model: str = "qwen3-vl:8b"
clip_model: str = "ViT-B-32"
idle_timeout_seconds: int = 300 # 5 minutes
vram_threshold_for_clip_gpu_mb: int = 1024 # 1 GB
max_load_retries: int = 3
load_timeout_seconds: int = 30
unload_timeout_seconds: int = 5
@dataclass
class ResourceChangedEvent:
timestamp: datetime
event_type: str # "vram_changed", "model_loaded", "model_unloaded", "device_changed"
details: dict
```
## Correctness Properties
*A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
### Property 1: Mode transition triggers VLM unload
*For any* GPU Resource Manager in RECORDING mode with VLM loaded, transitioning to AUTOPILOT mode should result in VLM being unloaded within 5 seconds.
**Validates: Requirements 1.1**
### Property 2: Mode transition triggers VLM load
*For any* GPU Resource Manager in AUTOPILOT mode with VLM unloaded, transitioning to RECORDING mode should result in VLM being loaded within 30 seconds.
**Validates: Requirements 1.2**
### Property 3: CLIP on GPU in AUTOPILOT
*For any* GPU Resource Manager in AUTOPILOT mode with available VRAM > 1GB, CLIP should be on GPU device.
**Validates: Requirements 1.3, 3.1**
### Property 4: VRAM decrease on VLM unload
*For any* VLM unload operation, the VRAM usage should decrease by at least 8 GB.
**Validates: Requirements 1.4**
### Property 5: Status query completeness
*For any* call to get_status(), the returned GPUResourceStatus should contain valid values for all fields including vram, vlm_state, clip_device, and execution_mode.
**Validates: Requirements 2.1**
### Property 6: CLIP migration ordering
*For any* VLM load request when CLIP is on GPU, CLIP should be migrated to CPU before VLM loading completes.
**Validates: Requirements 3.2**
### Property 7: Embedding pipeline consistency
*For any* CLIP device change, the embedding pipeline should produce valid embeddings after reinitialization.
**Validates: Requirements 3.3**
### Property 8: Idle timeout behavior
*For any* configured idle_timeout value, VLM should be unloaded after that duration of inactivity (not the default).
**Validates: Requirements 4.1, 4.3**
### Property 9: On-demand VLM loading
*For any* VLM request when VLM is unloaded, the request should complete successfully after VLM is loaded.
**Validates: Requirements 4.2**
### Property 10: ensure_vlm_loaded blocking
*For any* call to ensure_vlm_loaded(), the function should only return when is_vlm_loaded() returns True.
**Validates: Requirements 5.1**
### Property 11: ensure_vlm_unloaded blocking
*For any* call to ensure_vlm_unloaded(), the function should only return when is_vlm_loaded() returns False.
**Validates: Requirements 5.2**
### Property 12: get_clip_device validity
*For any* call to get_clip_device(), the return value should be either "cpu" or "cuda".
**Validates: Requirements 5.3**
### Property 13: Sequential operation processing
*For any* concurrent model operations, they should be processed sequentially without race conditions.
**Validates: Requirements 5.4**
## Error Handling
### Ollama Unavailable
- Detect via connection timeout or HTTP error
- Set `degraded_mode = True` with reason
- CLIP continues on CPU
- VLM operations return False with logged warning
- Periodic retry every 30 seconds
### GPU Not Available
- Detect via pynvml initialization failure
- Force CPU-only mode for all models
- Log warning at startup
- All GPU migration requests return False gracefully
### VRAM Insufficient
- Check available VRAM before operations
- Return error with current VRAM info
- Suggest unloading other models
### Load/Unload Timeout
- Implement timeout with cancellation
- Retry up to max_load_retries
- Mark model as ERROR state after failures
- Emit error event for monitoring
## Testing Strategy
### Unit Tests
- Test each manager component in isolation
- Mock Ollama API responses
- Mock nvidia-smi/pynvml responses
- Test state machine transitions
- Test event emission
### Property-Based Tests (using Hypothesis)
- Generate random sequences of mode transitions
- Verify invariants hold after each transition
- Test concurrent operation handling
- Test timeout behavior with various configurations
### Integration Tests
- Test with real Ollama instance
- Test with real GPU (if available)
- Test full workflow: RECORDING → AUTOPILOT → RECORDING
- Measure actual VRAM changes
### Test Configuration
```python
# pytest configuration for property tests
HYPOTHESIS_SETTINGS = {
"max_examples": 100,
"deadline": 30000, # 30 seconds for GPU operations
}
```