rpa_vision_v3/deploy/VRAM_BUDGET.md

# Budget VRAM — RPA Vision V3

GPU cible : **NVIDIA RTX 5070, 12 GB VRAM**.

Ce document fige les choix d'allocation pour éviter les chevauchements et crashs CUDA quand plusieurs services tournent en parallèle. À jour au 26 avril 2026 (Phase 1A — service grounding persistant).

## Hypothèse d'usage

Dom fait soit du **RECORD** (création de workflow dans le VWB), soit du **REPLAY** (exécution d'un workflow), mais pas les deux en même temps. C'est cette hypothèse qui rend les modes ci-dessous compatibles.

## Modèles résidents permanents (3.6 GB)

Ces modèles restent en VRAM en permanence pour éviter le coût de chargement à chaque appel.

| Modèle | VRAM | Hébergement | Pourquoi permanent |
|---|---|---|---|
| **InfiGUI-G1-3B** (4-bit NF4) | 2.4 GB | `rpa-grounding.service` (Unix socket) | Sinon ~15 s de chargement par clic. |
| **CLIP ViT-B/32** | 0.7 GB | Singleton in-process (`core/gpu/clip_manager.py`) | Appelé sur chaque frame du pipeline. |
| **EasyOCR fr+en** | 0.5 GB | Singleton in-process (DialogHandler, TitleVerifier, input_handler) | Polling 500 ms, CPU = 5–10× plus lent. |

**Note** : CLIP et EasyOCR ne sont PAS mutualisés dans le service. Le gain serait nul tant qu'un seul processus consommateur tourne à la fois (record OU replay). À envisager en Phase 1B si on mesure une vraie contention multi-process.

## VLM Ollama — un seul à la fois (config `ollama-vram-policy.conf`)

`OLLAMA_MAX_LOADED_MODELS=1` force Ollama à ne garder qu'un VLM en VRAM. Le swap a lieu au changement de mode (~5 s, une fois).

| Mode | VLM Ollama | VRAM Ollama | Usage |
|---|---|---|---|
| **REPLAY** | `qwen2.5vl:7b` | ~5.5 GB | Raisonnement haut niveau (`observe_reason_act`) |
| **RECORD** | `qwen2.5vl:3b` | ~3.0 GB | Description d'ancres (`capture.py`) |

Modèles référencés mais NON résidents (chargés à la demande, soumis au `MAX_LOADED_MODELS=1`) : `qwen3-vl:8b`, `gemma4:e4b`, `gemma4:latest`.

## Matrice VRAM totale

| Mode | Résidents | Ollama | **Total** | **Marge sur 12 GB** |
|---|---|---|---|---|
| Idle (lazy) | 2.4 (InfiGUI seul) | 0 | **2.4 GB** | 9.6 GB |
| RECORD actif | 3.6 | 3.0 (qwen 3b) | **6.6 GB** | 5.4 GB ✅ |
| REPLAY actif | 3.6 | 5.5 (qwen 7b) | **9.1 GB** | 2.9 GB ✅ |

La marge couvre les pics CUDA et la fragmentation VRAM observée à long run.

## Configuration systemd

### 1. Service grounding (déjà créé)

```
deploy/systemd/rpa-grounding.service
```

Crée le `RuntimeDirectory=rpa` (= `/run/rpa/`) qui héberge :
- `grounding.sock` — socket de communication
- `infigui_screen.png`, `infigui_anchor.png` — images d'inférence (passées par fichier, pas par socket)

### 2. Services consommateurs

Patchés pour pointer sur le socket et partager `/run/rpa/` :

```ini
Environment="RPA_GROUNDING_SOCKET=/run/rpa/grounding.sock"
Environment="RPA_GROUNDING_IMG_DIR=/run/rpa"
RuntimeDirectory=rpa
RuntimeDirectoryMode=0755
RuntimeDirectoryPreserve=yes
```

Units patchées : `rpa-vision-v3-api.service`, `rpa-vision-v3-worker.service`, `rpa-streaming.service`.

### 3. Drop-in Ollama

```
deploy/systemd/ollama-vram-policy.conf
```

Installation :
```bash
sudo mkdir -p /etc/systemd/system/ollama.service.d/
sudo cp deploy/systemd/ollama-vram-policy.conf /etc/systemd/system/ollama.service.d/vram-policy.conf
sudo systemctl daemon-reload
sudo systemctl restart ollama
```

## Activation complète

```bash
# 1. Service grounding
sudo cp deploy/systemd/rpa-grounding.service /etc/systemd/system/
# 2. Drop-in Ollama
sudo mkdir -p /etc/systemd/system/ollama.service.d/
sudo cp deploy/systemd/ollama-vram-policy.conf /etc/systemd/system/ollama.service.d/vram-policy.conf
# 3. Recharger les units patchées (api, worker, streaming)
sudo cp deploy/systemd/rpa-vision-v3-api.service /etc/systemd/system/
sudo cp deploy/systemd/rpa-vision-v3-worker.service /etc/systemd/system/
sudo cp deploy/systemd/rpa-streaming.service /etc/systemd/system/

sudo systemctl daemon-reload
sudo systemctl enable --now rpa-grounding
sudo systemctl restart ollama
sudo systemctl restart rpa-vision-v3-api rpa-vision-v3-worker rpa-streaming
```

## Vérification

```bash
# 1. Service grounding actif et socket en place
systemctl status rpa-grounding
ls -la /run/rpa/grounding.sock

# 2. VRAM réelle après warm-up
nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader

# 3. Test d'un ground via socket
python -c "
from core.grounding.ui_tars_grounder import UITarsGrounder
from PIL import Image
g = UITarsGrounder.get_instance()
# Devrait afficher [InfiGUI/server/...] (et non [InfiGUI/subprocess/...])
print(g.ground(target_text='Save', screen_pil=Image.new('RGB', (800, 600))))
"
```

Si tu vois `[InfiGUI/subprocess/...]` dans les logs, c'est que le client est tombé en fallback — vérifier `journalctl -u rpa-grounding`.

## Garde-fous

- **Pas de chevauchement** : InfiGUI tourne dans son process, Ollama dans le sien. Aucun ne peut écraser l'autre.
- **Pas de double VLM** : `OLLAMA_MAX_LOADED_MODELS=1` empêche qu'un appel qwen 3b traîne pendant qu'on demande qwen 7b.
- **Pas de crash CUDA héritage parent** : l'unit `rpa-grounding` force `CUDA_VISIBLE_DEVICES=0` explicitement et démarre dans son propre process tree (pas hérité de systemd directement).
- **Fallback subprocess** : si `rpa-grounding` n'est pas démarré ou crash, le code client retombe sur l'ancien subprocess one-shot (cf. `ui_tars_grounder.py`). Aucune régression possible.