chore(dgx): snapshot consolidation WIP pour transfert poc DGX
Regroupe le WIP non committé requis pour le clone/runtime DGX (Option A) : - api_stream.py : préflight replay + smoke santé modèles + handler 403 WP-B - de-hardcode VLM : vlm_config, gpu/*, vram_orchestrator, ollama_manager - stream_processor, semantic_matcher, agent_chat (app/planner/intent) - workflows.db (acquis ; le transfert artifacts le mettra à jour + rewrite chemins) - docs : plans DGX, benchmarks VLM/grounders, recherche SOTA, coordination 8 juin Snapshot destiné à la branche poc-dgx poussée sur Gitea pour cloner le DGX. Scan anti-secret : clean. graphify (repo embarqué) exclu. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -2,7 +2,7 @@
|
||||
GPU Resource Management Module for RPA Vision V3
|
||||
|
||||
This module provides dynamic GPU resource allocation between ML models:
|
||||
- Ollama VLM (gemma4:e4b par défaut, configurable via RPA_VLM_MODEL) for UI classification
|
||||
- Ollama VLM (modèle central configurable via RPA_VLM_MODEL) for UI classification
|
||||
- CLIP (ViT-B-32) for embedding matching
|
||||
|
||||
The GPUResourceManager optimizes VRAM usage by:
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
GPU Resource Manager - Central orchestrator for GPU resource allocation
|
||||
|
||||
Manages dynamic allocation of GPU resources between:
|
||||
- Ollama VLM (gemma4:e4b par défaut) - ~10 GB VRAM for UI classification
|
||||
- Ollama VLM (modèle reasoning/VLM central) - ~10 GB VRAM for UI classification
|
||||
- CLIP (ViT-B-32) - ~500 MB VRAM for embedding matching
|
||||
|
||||
Optimizes VRAM usage based on execution mode:
|
||||
@@ -21,6 +21,8 @@ from datetime import datetime
|
||||
from enum import Enum
|
||||
from typing import Any, Callable, Dict, Iterator, List, Optional
|
||||
|
||||
from core.detection.vlm_config import get_reasoning_model
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@@ -54,7 +56,7 @@ class VRAMInfo:
|
||||
class GPUResourceConfig:
|
||||
"""Configuration for GPU resource management."""
|
||||
ollama_endpoint: str = "http://localhost:11434"
|
||||
vlm_model: str = "gemma4:e4b"
|
||||
vlm_model: str = field(default_factory=get_reasoning_model)
|
||||
clip_model: str = "ViT-B-32"
|
||||
idle_timeout_seconds: int = 300 # 5 minutes
|
||||
vram_threshold_for_clip_gpu_mb: int = 1024 # 1 GB
|
||||
|
||||
@@ -13,6 +13,8 @@ from typing import List, Optional
|
||||
|
||||
import aiohttp
|
||||
|
||||
from core.detection.vlm_config import get_reasoning_model
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
@@ -32,7 +34,7 @@ class OllamaManager:
|
||||
def __init__(
|
||||
self,
|
||||
endpoint: str = "http://localhost:11434",
|
||||
model: str = "gemma4:e4b",
|
||||
model: Optional[str] = None,
|
||||
default_keep_alive: str = "5m"
|
||||
):
|
||||
"""
|
||||
@@ -44,7 +46,7 @@ class OllamaManager:
|
||||
default_keep_alive: Default keep-alive duration
|
||||
"""
|
||||
self._endpoint = endpoint.rstrip("/")
|
||||
self._model = model
|
||||
self._model = model or get_reasoning_model()
|
||||
self._default_keep_alive = default_keep_alive
|
||||
self._session: Optional[aiohttp.ClientSession] = None
|
||||
|
||||
|
||||
Reference in New Issue
Block a user