Validé sur PC Windows (DESKTOP-58D5CAC, 2560x1600) : - 8 clics résolus visuellement (1 anchor_template, 1 som_text_match, 6 som_vlm) - Score moyen 0.75, temps moyen 1.6s - Texte tapé correctement (bonjour, test word, date, email) - 0 retries, 2 actions non vérifiées (OK) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9.1 KiB
Design Document - GPU Resource Manager
Overview
Le GPU Resource Manager est un composant singleton qui orchestre l'allocation dynamique des ressources GPU entre les modèles ML du système RPA Vision V3. Il gère principalement deux ressources :
- Ollama VLM (qwen3-vl:8b) - ~10.5 GB VRAM, utilisé pour la classification UI pendant l'enregistrement
- CLIP (ViT-B-32) - ~500 MB VRAM, utilisé pour le matching d'embeddings
Le manager optimise l'utilisation de la VRAM en :
- Déchargeant le VLM quand non nécessaire (mode autopilot)
- Migrant CLIP sur GPU quand la VRAM est disponible
- Gérant un idle timeout pour libérer automatiquement les ressources
Architecture
graph TB
subgraph "GPU Resource Manager"
GRM[GPUResourceManager]
OM[OllamaManager]
CM[CLIPManager]
VM[VRAMMonitor]
EE[EventEmitter]
end
subgraph "External Services"
OL[Ollama API :11434]
NV[nvidia-smi / pynvml]
end
subgraph "Consumers"
EL[ExecutionLoop]
UD[UIDetector]
FE[FusionEngine]
end
GRM --> OM
GRM --> CM
GRM --> VM
GRM --> EE
OM --> OL
VM --> NV
EL --> GRM
UD --> GRM
FE --> GRM
Components and Interfaces
GPUResourceManager (Singleton)
class GPUResourceManager:
"""Gestionnaire central des ressources GPU."""
# Lifecycle
def __init__(self, config: GPUResourceConfig)
def shutdown(self) -> None
# Mode Management
def set_execution_mode(self, mode: ExecutionMode) -> None
def get_execution_mode(self) -> ExecutionMode
# VLM Management
async def ensure_vlm_loaded(self) -> bool
async def ensure_vlm_unloaded(self) -> bool
def is_vlm_loaded(self) -> bool
def get_vlm_state(self) -> ModelState
# CLIP Management
def get_clip_device(self) -> str # "cpu" or "cuda"
async def migrate_clip_to_gpu(self) -> bool
async def migrate_clip_to_cpu(self) -> bool
# Monitoring
def get_status(self) -> GPUResourceStatus
def get_vram_usage(self) -> VRAMInfo
# Events
def on_resource_changed(self, callback: Callable) -> None
def on_mode_changed(self, callback: Callable) -> None
def on_idle_unload(self, callback: Callable) -> None
OllamaManager
class OllamaManager:
"""Gère le cycle de vie des modèles Ollama."""
def __init__(self, endpoint: str = "http://localhost:11434")
async def load_model(self, model: str, keep_alive: str = "5m") -> bool
async def unload_model(self, model: str) -> bool
async def is_model_loaded(self, model: str) -> bool
async def list_loaded_models(self) -> List[str]
def is_available(self) -> bool
CLIPManager
class CLIPManager:
"""Gère la migration CPU/GPU du modèle CLIP."""
def __init__(self, model_name: str = "ViT-B-32")
def get_current_device(self) -> str
async def migrate_to_device(self, device: str) -> bool
def get_model(self) -> Any # Returns the CLIP model
def reinitialize_pipeline(self) -> None
VRAMMonitor
class VRAMMonitor:
"""Surveille l'utilisation de la VRAM."""
def __init__(self, poll_interval_ms: int = 1000)
def get_vram_info(self) -> VRAMInfo
def get_available_vram_mb(self) -> int
def start_monitoring(self) -> None
def stop_monitoring(self) -> None
def on_vram_changed(self, callback: Callable, threshold_mb: int = 100) -> None
Data Models
from enum import Enum
from dataclasses import dataclass
from typing import Optional, List
from datetime import datetime
class ExecutionMode(str, Enum):
IDLE = "idle"
RECORDING = "recording"
AUTOPILOT = "autopilot"
class ModelState(str, Enum):
UNLOADED = "unloaded"
LOADING = "loading"
LOADED = "loaded"
UNLOADING = "unloading"
ERROR = "error"
@dataclass
class VRAMInfo:
total_mb: int
used_mb: int
free_mb: int
gpu_name: str
gpu_utilization_percent: int
@dataclass
class GPUResourceStatus:
execution_mode: ExecutionMode
vlm_state: ModelState
vlm_model: str
clip_device: str
vram: VRAMInfo
idle_timeout_seconds: int
last_vlm_request: Optional[datetime]
degraded_mode: bool
degraded_reason: Optional[str]
@dataclass
class GPUResourceConfig:
ollama_endpoint: str = "http://localhost:11434"
vlm_model: str = "qwen3-vl:8b"
clip_model: str = "ViT-B-32"
idle_timeout_seconds: int = 300 # 5 minutes
vram_threshold_for_clip_gpu_mb: int = 1024 # 1 GB
max_load_retries: int = 3
load_timeout_seconds: int = 30
unload_timeout_seconds: int = 5
@dataclass
class ResourceChangedEvent:
timestamp: datetime
event_type: str # "vram_changed", "model_loaded", "model_unloaded", "device_changed"
details: dict
Correctness Properties
A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.
Property 1: Mode transition triggers VLM unload
For any GPU Resource Manager in RECORDING mode with VLM loaded, transitioning to AUTOPILOT mode should result in VLM being unloaded within 5 seconds. Validates: Requirements 1.1
Property 2: Mode transition triggers VLM load
For any GPU Resource Manager in AUTOPILOT mode with VLM unloaded, transitioning to RECORDING mode should result in VLM being loaded within 30 seconds. Validates: Requirements 1.2
Property 3: CLIP on GPU in AUTOPILOT
For any GPU Resource Manager in AUTOPILOT mode with available VRAM > 1GB, CLIP should be on GPU device. Validates: Requirements 1.3, 3.1
Property 4: VRAM decrease on VLM unload
For any VLM unload operation, the VRAM usage should decrease by at least 8 GB. Validates: Requirements 1.4
Property 5: Status query completeness
For any call to get_status(), the returned GPUResourceStatus should contain valid values for all fields including vram, vlm_state, clip_device, and execution_mode. Validates: Requirements 2.1
Property 6: CLIP migration ordering
For any VLM load request when CLIP is on GPU, CLIP should be migrated to CPU before VLM loading completes. Validates: Requirements 3.2
Property 7: Embedding pipeline consistency
For any CLIP device change, the embedding pipeline should produce valid embeddings after reinitialization. Validates: Requirements 3.3
Property 8: Idle timeout behavior
For any configured idle_timeout value, VLM should be unloaded after that duration of inactivity (not the default). Validates: Requirements 4.1, 4.3
Property 9: On-demand VLM loading
For any VLM request when VLM is unloaded, the request should complete successfully after VLM is loaded. Validates: Requirements 4.2
Property 10: ensure_vlm_loaded blocking
For any call to ensure_vlm_loaded(), the function should only return when is_vlm_loaded() returns True. Validates: Requirements 5.1
Property 11: ensure_vlm_unloaded blocking
For any call to ensure_vlm_unloaded(), the function should only return when is_vlm_loaded() returns False. Validates: Requirements 5.2
Property 12: get_clip_device validity
For any call to get_clip_device(), the return value should be either "cpu" or "cuda". Validates: Requirements 5.3
Property 13: Sequential operation processing
For any concurrent model operations, they should be processed sequentially without race conditions. Validates: Requirements 5.4
Error Handling
Ollama Unavailable
- Detect via connection timeout or HTTP error
- Set
degraded_mode = Truewith reason - CLIP continues on CPU
- VLM operations return False with logged warning
- Periodic retry every 30 seconds
GPU Not Available
- Detect via pynvml initialization failure
- Force CPU-only mode for all models
- Log warning at startup
- All GPU migration requests return False gracefully
VRAM Insufficient
- Check available VRAM before operations
- Return error with current VRAM info
- Suggest unloading other models
Load/Unload Timeout
- Implement timeout with cancellation
- Retry up to max_load_retries
- Mark model as ERROR state after failures
- Emit error event for monitoring
Testing Strategy
Unit Tests
- Test each manager component in isolation
- Mock Ollama API responses
- Mock nvidia-smi/pynvml responses
- Test state machine transitions
- Test event emission
Property-Based Tests (using Hypothesis)
- Generate random sequences of mode transitions
- Verify invariants hold after each transition
- Test concurrent operation handling
- Test timeout behavior with various configurations
Integration Tests
- Test with real Ollama instance
- Test with real GPU (if available)
- Test full workflow: RECORDING → AUTOPILOT → RECORDING
- Measure actual VRAM changes
Test Configuration
# pytest configuration for property tests
HYPOTHESIS_SETTINGS = {
"max_examples": 100,
"deadline": 30000, # 30 seconds for GPU operations
}