Files

Dom e0b47e4518 docs(refs): commit groupé docs de référence session 2026-05-08

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-09 11:32:52 +02:00

15 KiB

Raw Blame History

Historique des implémentations VLM — Audit 2026-05-08

Branche : feature/qw-suite-mai HEAD : 731b5bcae Périmètre : tout backend VLM (Ollama, vLLM, Transformers, services dédiés) — code actif, archivé, ou disparu de l'historique.

1. Implémentations VLM actuellement actives

1.1. Transformers in-process (Qwen2.5-VL family)

Fichier	Fonction(s)	Modèle / Backend	Commentaire
`core/grounding/server.py`	`load_model`, `ground` (Flask `/ground`)	`InfiX-ai/InfiGUI-G1-3B` 4-bit NF4 via `Qwen2_5_VLForConditionalGeneration` + `qwen_vl_utils.process_vision_info`	Serveur Flask single-thread port 8200, contient `_smart_resize` (factor 28, MIN_PIXELS=100·28², MAX_PIXELS=5600·28²).
`core/grounding/infigui_worker.py`	`load_model`, `infer`, `main` (one-shot stdin/stdout)	Idem (`InfiX-ai/InfiGUI-G1-3B` 4-bit NF4, transformers + qwen_vl_utils)	Mode subprocess one-shot : lit JSON sur stdin, écrit sur stdout. Pas de `_smart_resize` complet (formule courte L99-L101 sans clamp min/max).
`core/grounding/infigui_server.py`	`InfiGUIServer.start`, `_do_ground`, `_do_ping`	Réutilise `infigui_worker.load_model` / `infer`	Daemon Unix socket (`/run/rpa/grounding.sock`), protocole length-prefixed JSON. Service systemd `rpa-grounding.service`.
`core/grounding/ui_tars_grounder.py`	`UITarsGrounder.ground`, `_send_socket_request`, fallback subprocess	Client : socket → fallback subprocess (`python -m core.grounding.infigui_worker`)	Ne charge plus rien in-process. Coordonne socket+subprocess. Fichier mis à jour 2026-05-05.
`core/grounding/think_arbiter.py`	`ThinkArbiter.arbitrate`	Délègue à `UITarsGrounder`	Layer THINK du pipeline FAST→SMART→THINK.
`core/detection/owl_detector.py`	`OwlDetector`	`Owlv2Processor` + `Owlv2ForObjectDetection` (Google OWL-v2) via transformers	Câblé dans `core/detection/ui_detector.py` (L31, L113, L126). Pas un VLM grounding GUI mais détecteur open-vocabulary.
`core/detection/seeclick_adapter.py`	`SeeClickAdapter._load_model`, `ground`	`cckevinn/SeeClick` (Qwen-VL) via `AutoModelForCausalLM` + `AutoTokenizer`	Encore exporté par `core/detection/__init__.py` mais signalé "cassé" par le commit `d1b556b6c` (avril 2026) qui l'a retiré de `intelligent_executor.py`. Pas d'autre call site actif.

1.2. HTTP OpenAI-compatible (vLLM)

Fichier	Fonction	Détails
`agent_v0/server_v1/resolve_engine.py` (L785-L816)	`_resolve_by_grounding`	Essai 1 vLLM `http://localhost:${VLLM_PORT}/v1/chat/completions`, modèle `Qwen/Qwen2.5-VL-7B-Instruct-AWQ` (env `VLLM_PORT=8100`, `VLLM_MODEL`). Format : POST OpenAI chat.completions avec `image_url: data:image/jpeg;base64`. Fallback Ollama si échec.

Verbatim L789-L816 :

    # Port vLLM configurable via env
    _vllm_port = os.environ.get("VLLM_PORT", "8100")
    _vllm_model = os.environ.get("VLLM_MODEL", "Qwen/Qwen2.5-VL-7B-Instruct-AWQ")

    # Essai 1 : vLLM (API OpenAI-compatible, GPU)
    try:
        vllm_resp = _requests.post(
            f"http://localhost:{_vllm_port}/v1/chat/completions",
            json={
                "model": _vllm_model,
                "messages": [
                    {"role": "system", "content": "You locate UI elements on screenshots. Return coordinates."},
                    {"role": "user", "content": [
                        {"type": "text", "text": prompt},
                        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{shot_b64}"}},
                    ]},
                ],
                "temperature": 0.1,
                "max_tokens": 80,
            },
            timeout=30,
        )
        if vllm_resp.ok:
            content = vllm_resp.json().get("choices", [{}])[0].get("message", {}).get("content", "")
            if content:
                logger.debug("Grounding via vLLM OK")
    except Exception as e:
        logger.debug("vLLM non disponible (%s), fallback Ollama", e)

1.3. HTTP Ollama (état dominant en prod aujourd'hui)

Fichier	Fonction	Modèle
`agent_v0/server_v1/resolve_engine.py` (L818-L832)	`_resolve_by_grounding` (fallback de vLLM)	`qwen2.5vl:7b` via `/api/chat` Ollama.
`agent_v0/server_v1/resolve_engine.py` (L2536-L2585)	`_locate_popup_button`	`qwen2.5vl:7b` via `/api/chat`.
`core/detection/ollama_client.py`	`OllamaClient`	`qwen3-vl:8b`, `gemma4:e4b`, etc. — utilisé par `core/detection/ui_detector.py`, `core/detection/som_engine.py`, `core/cognition/vram_orchestrator.py`.
`core/detection/vlm_config.py`	`FALLBACK_VLM_MODELS`	`["qwen3-vl:8b", "0000/ui-tars-1.5-7b-q8_0:7b"]`
`visual_workflow_builder/backend/vlm_provider.py`	`VLMProvider.detect_ui_element`	Hub Ollama prioritaire + cloud opt-in (OpenAI/Gemini/Anthropic) si `VLM_ALLOW_CLOUD=true`.
`visual_workflow_builder/backend/api_v3/capture.py` (L245)	description anchor	`qwen2.5vl:3b`
`visual_workflow_builder/backend/api_v3/dag_execute.py` (L468)	LLMActionHandler	`qwen3-vl:8b`
`visual_workflow_builder/backend/catalog_routes_v2_vlm.py`	détection visuelle catalog	`qwen2.5vl`
`core/llm/ocr_extractor.py`, `core/llm/t2a_decision.py`	LLM text only	Ollama (modèles non-vision).

2. Implémentations VLM archivées dans le filesystem

Chemin	Taille	Mtime	Backend identifié
`_archive/dead_code_20260424/...` (9 fichiers, ~6300 lignes)	divers	2026-04-24	Aucun fichier VLM — il s'agit de modules workflow/visual non liés à un backend VLM (ex. `visual_persistence_manager.py`, `workflow_simulation_report.py`). Recherche `vllm
`archive/business_docs/`, `archive/historical_recall/`	3 fichiers .md	2026-01 / 2026-05-04	Pas de code (Markdown business / mémoire).

Aucun fichier *_old.py, *_v1.py, *_backup.py, *.py.bak ou tests_disabled/ détecté. Les seuls *_v1.py existants sont agent_v0/run_agent_v1.py (non VLM).

3. Commits historiques mentionnant VLM/vLLM/Transformers/grounding

Liste chronologique inverse (≤ 25 commits pertinents). SHA court · date · message · fichiers VLM touchés (résumé --stat).

SHA	Date	Message	Fichiers VLM clé
`487bcb861`	2026-04-26	feat(execution): cascade post-raccourci pilotée par DialogHandler/OCR	`core/grounding/{dialog_handler,infigui_worker,think_arbiter,ui_tars_grounder}.py`
`3d6868f02`	2026-04-26	docs: cartographie + worker InfiGUI fichiers	`core/grounding/{server,ui_tars_grounder,infigui_worker,dialog_handler}.py` (création worker, refonte server.py de 494→124 lignes)
`343d6fbe9`	2026-04-26	perf(ocr): EasyOCR remplace docTR	`core/grounding/{fast_detector,title_verifier}.py`
`cc6443973`	2026-04-26	feat(grounding): vérification titre OCR post-action	`core/grounding/title_verifier.py` (+158)
`90007cc7c`	2026-04-26	perf(grounding): réflexe pHash-only + max_new_tokens 64	`core/grounding/server.py`
`77faa03ec`	2026-04-26	feat(grounding): InfiGUI-G1-3B remplace UI-TARS 7B	`core/grounding/server.py` (-75/+67)
`73cea2385`	2026-04-25	feat(grounding): Phase 6 Shadow Learning Hook	`core/grounding/shadow_learning_hook.py` (+156)
`e2046837c`	2026-04-25	feat(grounding): Phase 5 intégration FAST→SMART→THINK dans ORA	`core/execution/observe_reason_act.py`
`b30d4b665`	2026-04-25	feat(grounding): Phase 4 pipeline orchestré FAST→SMART→THINK	`core/grounding/fast_pipeline.py`
`e4a48e78b`	2026-04-25	feat(grounding): Phase 3 ThinkArbiter + SignatureStore	`core/grounding/{think_arbiter,element_signature}.py`
`ea36bba5c`	2026-04-25	feat(grounding): Phase 1-2 FAST→SMART détection + matching	`core/grounding/{fast_detector,smart_matcher,fast_types}.py`
`9da589c8c`	2026-04-25	feat(grounding): pipeline centralisé + serveur UI-TARS transformers	Création `core/grounding/{server,pipeline,template_matcher,ui_tars_grounder,target,__init__}.py` + `tools/start_grounding_server.sh`. server.py FastAPI port 8200, modèle `ByteDance-Seed/UI-TARS-1.5-7B` 4-bit NF4.
`73ddcdb29`	2026-04-21	feat: chaîne de grounding 3 niveaux + refonte capture	`core/execution/input_handler.py`, `visual_workflow_builder/.../execute.py`
`d1b556b6c`	2026-04-21	fix(grounding): supprimer SeeClick cassé	`intelligent_executor.py` (-46)
`91614fbff`	2026-04-04	fix: prompt natif bbox_2d Qwen2.5-VL	`agent_v0/server_v1/api_stream.py`
`c1ce6a396`	(avril)	fix: séparer grounding (qwen2.5vl) et compréhension (gemma4)	api_stream.py
`394342be7`	2026-03-31	feat: support vLLM (GPU) comme moteur de grounding, Ollama en fallback	`agent_v0/server_v1/api_stream.py` (+47/-14) — c'est l'unique commit qui ajoute vLLM.
`d99b17394`	2026-03-31	feat: VLM grounding direct (Qwen2.5-VL) — nouvelle stratégie de résolution	`agent_v0/server_v1/api_stream.py` (+230)
`cbe8dc95d`	(mars)	feat(cognition): timing + auto-apprentissage Shadow + VLM qwen2.5vl	—
`ad15237fe`	(mars)	feat: smart systray Léa + support qwen3-vl	—
`38966de0d`	(antérieur)	Feat: Action analyser_avec_ia (Ollama qwen2.5-vl)	—
`728fac3b5`	(antérieur)	Feat: Actions validation avec OCR Ollama (qwen2.5-vl:7b)	—
`21bfa3b33`	2026-01-24	feat(vwb): SeeClick + Self-Healing	`core/detection/seeclick_adapter.py` (+)
`4509038bf`	2026-04-09	refactor: éclater api_stream.py 6400→3350	déplace le code vLLM/Ollama vers `agent_v0/server_v1/resolve_engine.py`

Sur git reflog | head -100 : aucune trace d'opération destructive (pas de reset --hard, pas de checkout détruit) qui aurait perdu un commit lié au VLM. Toutes les opérations sont des commits propres.

4. Code dans des stashes ou branches non mergées

git stash list : aucun stash.

Branches existantes :

main
master (remote gitea uniquement)
feature/qw-suite-mai (HEAD courant)
feature/feedback-bus
backup/pre-qw-suite-mai-2026-05-05
demo/ght-2026-05-08
dev/ia-tools-improvement

Aucune branche divergente n'apparaît dans git log --all -S "vllm" au-delà des deux commits déjà recensés (tous accessibles depuis feature/qw-suite-mai). Idem pour Qwen2_5_VL, smart_resize, qwen_vl_utils, BitsAndBytesConfig : tous les commits qui les introduisent ou les modifient sont accessibles depuis HEAD.

→ Aucun code VLM unique perdu dans une branche divergente ou un stash.

5. Code potentiellement perdu (commits de suppression VLM)

SHA	Date	Action	Résumé
`d1b556b6c`	2026-04-21	suppression	Retire SeeClick de `intelligent_executor.py` (-46). Le fichier `core/detection/seeclick_adapter.py` n'a jamais été supprimé : il vit toujours dans `core/detection/` (11 421 octets, mtime 2026-01-24) et est encore exporté par `core/detection/__init__.py`. → Code utilisable mais signalé "cassé" (config QWenConfig incompatible).
`3d6868f02`	2026-04-26	refonte	Réduit `core/grounding/server.py` de 494 → 124 lignes en sortant la logique d'inférence vers `infigui_worker.py`. La logique transformers complète est conservée dans `infigui_worker.py` (et reprise par `infigui_server.py`). → Aucune perte.
`77faa03ec`	2026-04-26	remplacement modèle	UI-TARS-1.5-7B remplacé par InfiGUI-G1-3B dans `core/grounding/server.py`. Le prompt UI-TARS officiel (`Thought:/Action: click(start_box='(x1, y1)')`) et la fonction `_evict_ollama_models()` ont disparu mais restent récupérables via `git show 9da589c8c:core/grounding/server.py`.
`9da589c8c`	2026-04-25	nettoyage	"9 fichiers morts archivés dans `_archive/` (~6300 lignes)". Vérifié : aucun fichier VLM dans `_archive/dead_code_20260424/`. Ces 9 fichiers sont du visual/workflow, pas du grounding. → Aucune perte VLM.
(autres)			Pas d'autre commit qui supprime du code VLM exploitable.

→ Pas de code VLM utile irrémédiablement perdu : tout est récupérable via git show. Le seul élément à signaler est le prompt officiel UI-TARS présent dans la version 9da589c8c:core/grounding/server.py, utile si on veut comparer un modèle UI-TARS reload.

6. Synthèse factuelle

Nombre d'implémentations distinctes ayant existé :
- 7 implémentations actives aujourd'hui (cf. §1.1 + §1.2 + §1.3 modèles distincts).
- 2 implémentations historiques fortes ayant été remplacées en-place : UI-TARS-1.5-7B (transformers) → InfiGUI-G1-3B ; SoM+VLM intermédiaire → grounding direct Qwen2.5-VL.
Backends testés au fil du temps : Ollama (HTTP), vLLM (HTTP OpenAI-compat), Transformers in-process (Flask server.py, subprocess one-shot infigui_worker.py, daemon Unix socket infigui_server.py), HuggingFace direct (SeeClick standalone, OWL-v2 standalone), Cloud opt-in (OpenAI/Gemini/Anthropic via vlm_provider.py).
Code directement utilisable pour la migration vers vLLM ou Transformers :
- Oui pour Transformers : core/grounding/server.py (loader + _smart_resize complet avec MIN/MAX_PIXELS) et core/grounding/infigui_worker.py (load_model, infer mode classique + fusion image+anchor) sont quasi clé-en-main pour Qwen2.5-VL / Qwen3-VL. Il suffit de changer MODEL_ID (env GROUNDING_MODEL déjà supporté).
- Oui pour vLLM : agent_v0/server_v1/resolve_engine.py lignes 785-816 contient déjà l'appel HTTP OpenAI-compatible avec image_url: data:image/jpeg;base64. Il manque uniquement le passage explicite de resized_width/resized_height (extension OpenAI vLLM) — le bug d'échelle bbox_2d documenté dans docs/MIGRATION_VLM_PLAN_2026-05-09.md.
- L'infrastructure socket persistant + fallback subprocess (infigui_server.py + ui_tars_grounder.py) est réutilisable telle quelle pour servir un autre modèle Transformers ou pour wrapper un client vLLM.

7. À clarifier avec Dom

core/detection/seeclick_adapter.py est encore exporté par core/detection/__init__.py mais le commit d1b556b6c indique qu'il est cassé. Faut-il le sortir de l'import et l'archiver, ou tenter de le réparer pour Qwen3-VL ?
core/detection/owl_detector.py (Owlv2) est câblé via core/detection/ui_detector.py (L31, L113, L126) mais aucun trace de bench récent. Est-il encore appelé en prod ou candidat à l'archivage ?
tools/start_grounding_server.sh parle encore de UI-TARS-1.5-7B dans son banner alors que le serveur charge InfiGUI depuis le commit 77faa03ec. Doc obsolète mais sans impact runtime — à fixer si on documente la migration.
core/grounding/server.py (Flask port 8200) vs core/grounding/infigui_server.py (Unix socket) vs core/grounding/infigui_worker.py (subprocess one-shot) : trois entry-points distincts pour la même logique transformers. Le service systemd rpa-grounding.service ne lance que infigui_server. Confirmer que server.py (Flask) est conservé volontairement comme alternative dev / test.
Modèle vLLM par défaut hardcodé Qwen/Qwen2.5-VL-7B-Instruct-AWQ (resolve_engine L791) alors que le plan migration cible Qwen3-VL — env VLLM_MODEL permet le switch sans toucher au code, à confirmer comme méthode de migration.

15 KiB Raw Blame History