chore: sauvegarde complète avant factorisation executor
Some checks failed
security-audit / Bandit (scan statique) (push) Successful in 12s
security-audit / pip-audit (CVE dépendances) (push) Successful in 10s
security-audit / Scan secrets (grep) (push) Successful in 8s
tests / Lint (ruff + black) (push) Successful in 13s
tests / Tests unitaires (sans GPU) (push) Failing after 14s
tests / Tests sécurité (critique) (push) Has been skipped
Some checks failed
security-audit / Bandit (scan statique) (push) Successful in 12s
security-audit / pip-audit (CVE dépendances) (push) Successful in 10s
security-audit / Scan secrets (grep) (push) Successful in 8s
tests / Lint (ruff + black) (push) Successful in 13s
tests / Tests unitaires (sans GPU) (push) Failing after 14s
tests / Tests sécurité (critique) (push) Has been skipped
Point de sauvegarde incluant les fichiers non committés des sessions précédentes (systemd, docs, agents, GPU manager). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -2,7 +2,7 @@
|
||||
GPU Resource Management Module for RPA Vision V3
|
||||
|
||||
This module provides dynamic GPU resource allocation between ML models:
|
||||
- Ollama VLM (qwen3-vl:8b) for UI classification
|
||||
- Ollama VLM (gemma4:e4b par défaut, configurable via RPA_VLM_MODEL) for UI classification
|
||||
- CLIP (ViT-B-32) for embedding matching
|
||||
|
||||
The GPUResourceManager optimizes VRAM usage by:
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
GPU Resource Manager - Central orchestrator for GPU resource allocation
|
||||
|
||||
Manages dynamic allocation of GPU resources between:
|
||||
- Ollama VLM (qwen3-vl:8b) - ~10.5 GB VRAM for UI classification
|
||||
- Ollama VLM (gemma4:e4b par défaut) - ~10 GB VRAM for UI classification
|
||||
- CLIP (ViT-B-32) - ~500 MB VRAM for embedding matching
|
||||
|
||||
Optimizes VRAM usage based on execution mode:
|
||||
@@ -12,13 +12,14 @@ Optimizes VRAM usage based on execution mode:
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
import contextlib
|
||||
import logging
|
||||
import threading
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import datetime
|
||||
from enum import Enum
|
||||
from typing import Any, Callable, Dict, List, Optional
|
||||
from typing import Any, Callable, Dict, Iterator, List, Optional
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@@ -53,7 +54,7 @@ class VRAMInfo:
|
||||
class GPUResourceConfig:
|
||||
"""Configuration for GPU resource management."""
|
||||
ollama_endpoint: str = "http://localhost:11434"
|
||||
vlm_model: str = "qwen3-vl:8b"
|
||||
vlm_model: str = "gemma4:e4b"
|
||||
clip_model: str = "ViT-B-32"
|
||||
idle_timeout_seconds: int = 300 # 5 minutes
|
||||
vram_threshold_for_clip_gpu_mb: int = 1024 # 1 GB
|
||||
@@ -126,6 +127,12 @@ class GPUResourceManager:
|
||||
# Operation queue for sequential processing
|
||||
self._operation_queue: asyncio.Queue = asyncio.Queue()
|
||||
self._operation_lock = asyncio.Lock()
|
||||
|
||||
# Lock d'inférence synchrone : sérialise les appels GPU concurrents
|
||||
# (ScreenAnalyzer.analyze, UIDetector, CLIP.encode) entre
|
||||
# ExecutionLoop et stream_processor pour éviter la saturation VRAM
|
||||
# sur RTX 5070 (12 Go). Un seul analyze à la fois sur le GPU.
|
||||
self._inference_lock = threading.Lock()
|
||||
|
||||
# Event callbacks
|
||||
self._on_resource_changed: List[Callable[[ResourceChangedEvent], None]] = []
|
||||
@@ -207,7 +214,45 @@ class GPUResourceManager:
|
||||
def get_execution_mode(self) -> ExecutionMode:
|
||||
"""Get the current execution mode."""
|
||||
return self._execution_mode
|
||||
|
||||
|
||||
# =========================================================================
|
||||
# Inference serialization (sync)
|
||||
# =========================================================================
|
||||
|
||||
@contextlib.contextmanager
|
||||
def acquire_inference(self, timeout: Optional[float] = None) -> Iterator[bool]:
|
||||
"""
|
||||
Context manager synchrone pour sérialiser les inférences GPU.
|
||||
|
||||
Garantit qu'un seul appel d'inférence (ScreenAnalyzer.analyze,
|
||||
UIDetector.detect, CLIP.encode…) tourne à la fois sur le GPU.
|
||||
Évite la saturation VRAM quand ExecutionLoop et stream_processor
|
||||
appellent analyze() simultanément sur une RTX 5070 (12 Go).
|
||||
|
||||
Args:
|
||||
timeout: Délai max d'attente (secondes). None = bloquant.
|
||||
|
||||
Yields:
|
||||
True si le lock est acquis, False en cas de timeout.
|
||||
|
||||
Example:
|
||||
>>> with gpu_manager.acquire_inference(timeout=30.0) as acquired:
|
||||
... if not acquired:
|
||||
... logger.warning("GPU lock timeout")
|
||||
... state = analyzer.analyze(path)
|
||||
"""
|
||||
if timeout is None:
|
||||
self._inference_lock.acquire()
|
||||
acquired = True
|
||||
else:
|
||||
acquired = self._inference_lock.acquire(timeout=timeout)
|
||||
|
||||
try:
|
||||
yield acquired
|
||||
finally:
|
||||
if acquired:
|
||||
self._inference_lock.release()
|
||||
|
||||
# =========================================================================
|
||||
# VLM Management
|
||||
# =========================================================================
|
||||
|
||||
@@ -32,7 +32,7 @@ class OllamaManager:
|
||||
def __init__(
|
||||
self,
|
||||
endpoint: str = "http://localhost:11434",
|
||||
model: str = "qwen3-vl:8b",
|
||||
model: str = "gemma4:e4b",
|
||||
default_keep_alive: str = "5m"
|
||||
):
|
||||
"""
|
||||
|
||||
Reference in New Issue
Block a user