# Axe A1 — État de l'art VLM Grounding UI (2025-2026)

**Date :** 2026-05-23
**Auteur :** Agent recherche dispatché (Claude Opus 4.7 1M)
**Périmètre :** modèles VLM de grounding d'éléments UI graphiques, focus 2025-2026, candidats déployables sur RTX 5070 12 GB VRAM, healthtech (licence permissive).
**Source maître interne :** [`SYNTHESE_TECHNOS_REPLAY_2026-05-23.md`](../SYNTHESE_TECHNOS_REPLAY_2026-05-23.md), [`MIGRATION_VLM_PLAN_2026-05-09.md`](../MIGRATION_VLM_PLAN_2026-05-09.md), [`HISTORIQUE_VLM_IMPLEMENTATIONS_2026-05-08.md`](../HISTORIQUE_VLM_IMPLEMENTATIONS_2026-05-08.md)

> Recherche documentaire — aucun test runtime. Chaque chiffre vient d'un papier, fiche HF ou leaderboard linké. Les scores ScreenSpot-Pro varient parfois de ±3 points entre sources (papier vs leaderboard tiers vs reproduction utilisateur). On affiche systématiquement le chiffre déclaré par les auteurs ou la fiche HF officielle.

---

## 1. TL;DR

1. **Le SOTA open-source 7B sur ScreenSpot-Pro a doublé en 12 mois** : 18.9 % (OS-Atlas-7B, oct 2024) → 61.6 % (UI-TARS-1.5-7B, avr 2025) → 51.9 % (InfiGUI-G1-7B, aoû 2025, AAAI 2026 Oral). Les fermés (GPT-5.2 à 86 %) creusent encore l'écart mais inutilisables on-premise.
2. **Notre InfiGUI-G1-3B actuel (45.2 % SSPro / 91.1 % SSv2) reste compétitif** pour 3 GB VRAM 4-bit. Le ratio perf/VRAM est excellent. Migration vers le 7B (51.9 % SSPro) faisable sans changer d'architecture (même `Qwen2_5_VLForConditionalGeneration`).
3. **Qwen3-VL-8B-Instruct (oct 2025, Apache 2.0) ne résout PAS le bug d'échelle bbox seul** : même convention post-resize que Qwen2.5-VL. Le fix est dans le **backend** (vLLM/Transformers in-process expose `resized_height/resized_width`), pas dans le modèle.
4. **Approche coordinate-free montante** (GUI-Actor, MolmoPoint-GUI, InfiGUI-G1) : la cible n'est plus du texte JSON mais un token de patch visuel ou des grounding-tokens. Élimine structurellement le bug d'échelle. Mais demande un fork Transformers ou un head custom.

**Recommandation top 3 pour notre cas (12 GB VRAM, healthtech, licence commerciale OK) :**

| # | Modèle | Pourquoi |
|---|---|---|
| 1 | **`InfiX-ai/InfiGUI-G1-7B`** ([HF](https://huggingface.co/InfiX-ai/InfiGUI-G1-7B), Apache 2.0) | Continuité totale avec notre stack `core/grounding/`, +5 pts SSPro vs G1-3B, tient en 4-bit NF4 (~6 GB), même format point post-resize que G1-3B donc le bug d'échelle est déjà géré côté `_smart_resize` |
| 2 | **`Hcompany/Holo1.5-7B`** ([HF](https://huggingface.co/Hcompany/Holo1.5-7B), Apache 2.0) | Qwen2.5-VL-7B-Instruct base, **57.9 % SSPro / 93.3 % SSv2**, natif 3840×2160 (utile fenêtres 2560×1600 Easily), entraîné GRPO sur UI réelles |
| 3 | **`ByteDance-Seed/UI-TARS-1.5-7B`** ([HF](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B), Apache 2.0) | **61.6 % SSPro / 94.2 % SSv2** déclarés (mais reproduction utilisateur à ~40-48 % selon [issue #215](https://github.com/bytedance/UI-TARS/issues/215)) — fallback si InfiGUI déçoit en réel |

---

## 2. Table comparative complète

Légende :
- VRAM : approximation pour inférence single-batch, base BF16 sans optim. `(4-bit ≈ /3)` pour quantif NF4 type bitsandbytes.
- SS = ScreenSpot v1 (1200+ instructions, multi-OS), SSv2 = re-annoté par OS-Atlas (11% corrections), SSPro = professional high-res (1581 instructions, 23 apps, 5 industries, 3 OS, papier ICLR 2025).
- Conv. coord : `0-1000` = normalisé 0-1000 indép. taille image (Qwen2-VL natif). `post-resize` = bbox dans la résolution **après smart_resize** côté modèle (Qwen2.5-VL). `point-token` = grounding via attention sur tokens visuels, pas de texte coord. `abs` = pixel image originale.
- "non trouvé" = pas de chiffre publié dans les sources consultées.

| Modèle | Params | VRAM BF16 | SS | SSv2 | SSPro | Sortie | Conv. coord | vLLM | Transformers | Licence | Release | HF |
|---|---:|---:|---:|---:|---:|---|---|:-:|:-:|---|---|---|
| **InfiGUI-G1-3B** ⭐ *(actuel)* | 3B | ~6 GB (~3 GB 4-bit) | 90.3 | 91.1 | 45.2 | point JSON | post-resize | ✅ | ✅ | Apache 2.0 | 2025-08-11 | [InfiX-ai/InfiGUI-G1-3B](https://huggingface.co/InfiX-ai/InfiGUI-G1-3B) |
| **InfiGUI-G1-7B** | 7B | ~14 GB (~6 GB 4-bit) | non trouvé | 93.5 | 51.9 | point JSON | post-resize | ✅ | ✅ | Apache 2.0 | 2025-08-11 | [InfiX-ai/InfiGUI-G1-7B](https://huggingface.co/InfiX-ai/InfiGUI-G1-7B) |
| **InfiGUI-R1-3B** *(prédécesseur)* | 3B | ~6 GB | 87.5 | non trouvé | 35.7 | point JSON | post-resize | ✅ | ✅ | Apache 2.0 | 2025-04-20 | [InfiX-ai/InfiGUI-R1-3B](https://huggingface.co/InfiX-ai/InfiGUI-R1-3B) |
| **UI-TARS-1.5-7B** | 7B | ~14 GB | non trouvé | 94.2 | 61.6 *(48 reprod.)* | action DSL `click(x,y)` | abs px | ✅ | ✅ | Apache 2.0 | 2025-04-16 | [ByteDance-Seed/UI-TARS-1.5-7B](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B) |
| **Qwen3-VL-8B-Instruct** | 9B | ~18 GB (~6 GB 4-bit) | ~94 (déclaré) | non trouvé | 54.6 *(leaderboard llm-stats)* / 61.8 *(papier)* | bbox_2d ou point JSON | post-resize (multiples 32) | ✅ (vllm≥0.11) | ✅ | Apache 2.0 | 2025-10-15 | [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) |
| **Qwen3-VL-4B-Instruct** | 4B | ~8 GB | non trouvé | non trouvé | 59.5 *(leaderboard)* | bbox_2d ou point JSON | post-resize | ✅ | ✅ | Apache 2.0 | 2025-10-15 | [Qwen/Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) |
| **Qwen2.5-VL-7B-Instruct** *(legacy)* | 7B | ~14 GB | 88.8 | 88.8 | 26.8 | bbox_2d JSON | post-resize (multiples 28) | ✅ | ✅ | Apache 2.0 | 2025-01 | [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) |
| **Holo1.5-7B** ⭐ | 7B | ~14 GB | non trouvé | 93.31 | 57.94 | non documenté (probable point) | non documenté | ✅ | ✅ | Apache 2.0 | 2025-09 | [Hcompany/Holo1.5-7B](https://huggingface.co/Hcompany/Holo1.5-7B) |
| **Holo1.5-3B** | 3B | ~6 GB | non trouvé | non trouvé | non trouvé | idem | idem | ✅ | ✅ | Apache 2.0 | 2025-09 | [Hcompany/Holo1.5-3B](https://huggingface.co/Hcompany/Holo1.5-3B) |
| **Holo1-7B** *(v1)* | 7B | ~14 GB | non trouvé (avg UI 76.2) | non trouvé | non trouvé | non documenté | non documenté | ✅ | ✅ | Apache 2.0 | 2025-06 | [Hcompany/Holo1-7B](https://huggingface.co/Hcompany/Holo1-7B) |
| **OS-Atlas-Base-7B** | 8B | ~16 GB | 82.5 *(papier)* | 85.1 *(InfiGUI eval)* | 18.9 | bbox + point JSON | 0-1000 normalisé | ✅ | ✅ | Apache 2.0 | 2024-10-30 | [OS-Copilot/OS-Atlas-Base-7B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-7B) |
| **OS-Atlas-Base-4B** | 4B | ~8 GB | non trouvé | non trouvé | non trouvé | bbox + point JSON | 0-1000 normalisé | ✅ | ✅ | Apache 2.0 | 2024-10-30 | [OS-Copilot/OS-Atlas-Base-4B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-4B) |
| **UGround-V1-7B** ⭐ | 7B | ~14 GB | 86.3 | non trouvé | non trouvé (probable ~36 papier) | point `(x,y)` | 0-1000 normalisé | ✅ | ✅ | Apache 2.0 | 2024-10-07 / révisé 2025-01 | [osunlp/UGround-V1-7B](https://huggingface.co/osunlp/UGround-V1-7B) |
| **UGround-V1-2B** | 2B | ~4 GB | non trouvé | non trouvé | non trouvé | point `(x,y)` | 0-1000 | ✅ | ✅ | Apache 2.0 | 2025-01 | [osunlp/UGround-V1-2B](https://huggingface.co/osunlp/UGround-V1-2B) |
| **UGround-V1-72B** | 72B | ~144 GB | non trouvé | non trouvé | 34.5 *(papier SSPro orig.)* | point `(x,y)` | 0-1000 | ✅ | ✅ | Apache 2.0 | 2025-01 | [osunlp/UGround-V1-72B](https://huggingface.co/osunlp/UGround-V1-72B) |
| **Magma-8B** | 9B | ~18 GB | mobile 59.5 / desktop 64.1 / web 60.6 | non trouvé | non trouvé | Set-of-Mark + bbox | non documenté | ⚠️ fork transfo | ✅ (fork) | **MIT** | 2025-02-18 | [microsoft/Magma-8B](https://huggingface.co/microsoft/Magma-8B) |
| **GUI-Actor-7B-Qwen2.5-VL** | 8B | ~16 GB | non trouvé | 92.1 | 44.6 | **special token attention head** → `topk_points` normalisés | normalisé 0-1 (sans texte coord) | ⚠️ pas mentionné | ✅ (fork) | **MIT** | 2025-06-03 | [microsoft/GUI-Actor-7B-Qwen2.5-VL](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2.5-VL) |
| **GUI-Actor-7B-Qwen2-VL** | 8B | ~16 GB | non trouvé | 89.5 | 40.7 | idem | idem | ⚠️ | ✅ (fork) | MIT | 2025-06 | [microsoft/GUI-Actor-7B-Qwen2-VL](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2-VL) |
| **MolmoPoint-GUI-8B** ⭐ | 9B | ~18 GB | non trouvé | non trouvé | **61.1** (open SOTA) | grounding-tokens `[id,img,x,y]` | abs px | ❌ (logits processor custom) | ✅ | Apache 2.0 | 2026-03-18 | [allenai/MolmoPoint-GUI-8B](https://huggingface.co/allenai/MolmoPoint-GUI-8B) |
| **AGUVIS-7B-720P** | 8B | ~16 GB | 84.4 *(papier)* | non trouvé | 22.9 | bbox + action plan | non documenté (probable post-resize Qwen2-VL) | ✅ | ✅ | non trouvé (probable Apache via base) | 2024-12 | [xlangai/Aguvis-7B-720P](https://huggingface.co/xlangai/Aguvis-7B-720P) |
| **ShowUI-2B** | 2B | ~4 GB | 75.1 | non trouvé | 7.7 | point + action dict | normalisé 0-1 | ✅ | ✅ | **MIT** | 2024-11-26 | [showlab/ShowUI-2B](https://huggingface.co/showlab/ShowUI-2B) |
| **CogAgent-9B-20241220** | 14B (9B lang + 5B vision) | ~28 GB | leader cité, score précis non publié | non trouvé | non trouvé | `CLICK(box=[x1,y1,x2,y2])` action DSL | non documenté (probable abs sur 1120×1120) | ⚠️ partiel | ✅ | **Other** (custom, non-Apache) | 2024-12-20 | [zai-org/cogagent-9b-20241220](https://huggingface.co/zai-org/cogagent-9b-20241220) |
| **SeeClick** *(historique)* | 9.6B | ~19 GB | 53.4 | non trouvé | <10 (papier ScreenSpot-Pro) | bbox via Qwen-VL | non documenté | ❌ | ✅ | Apache 2.0 | 2024-04 (ACL 2024) | [cckevinn/SeeClick](https://huggingface.co/cckevinn/SeeClick) |
| **GUI-G2-7B** | 7B | ~14 GB | SOTA déclaré | SOTA déclaré | SOTA déclaré (Gaussian reward GRPO) | non documenté | non documenté | ✅ | ✅ | non trouvé | 2026-01 (AAAI 2026) | [inclusionAI/GUI-G2-7B](https://huggingface.co/inclusionAI/GUI-G2-7B) |
| **GPT-5.2** *(fermé)* | n/a | n/a (cloud) | — | — | **86.3** | n/a | n/a | n/a | n/a | OpenAI propriétaire | 2026 | n/a |
| **Gemini 3 Pro** *(fermé)* | n/a | n/a (cloud) | — | — | 72.7 | n/a | n/a | n/a | n/a | Google propriétaire | 2026 | n/a |

**Sources des scores principaux :** [ScreenSpot-Pro leaderboard llm-stats](https://llm-stats.com/benchmarks/screenspot-pro), [papier ScreenSpot-Pro arXiv:2504.07981](https://arxiv.org/abs/2504.07981), [InfiGUI-G1 paper arXiv:2508.05731](https://arxiv.org/abs/2508.05731), fiches HF citées colonne droite.

---

## 3. Fiches détaillées par modèle

### 3.1. InfiGUI-G1-3B / 7B (notre modèle actuel + upgrade direct)

- **Repo HF :** [InfiX-ai/InfiGUI-G1-3B](https://huggingface.co/InfiX-ai/InfiGUI-G1-3B), [InfiX-ai/InfiGUI-G1-7B](https://huggingface.co/InfiX-ai/InfiGUI-G1-7B)
- **Papier :** [arXiv:2508.05731](https://arxiv.org/abs/2508.05731) — *InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization*, AAAI 2026 Oral
- **GitHub :** [InfiXAI/InfiGUI-G1](https://github.com/InfiXAI/InfiGUI-G1)
- **Release :** 2025-08-11
- **Licence :** Apache 2.0
- **Base :** Qwen2.5-VL-3B (resp. 7B), GRPO via AEPO (Adaptive Exploration Policy Optimization)
- **Bench :** ScreenSpot 90.3 (3B) / 92.5+ (7B), SSv2 91.1 / 93.5, **SSPro 45.2 (3B) / 51.9 (7B)**, MMBench-GUI L2 73.4 (3B) / 80.8 (7B)
- **Sortie :** point JSON `[{"point_2d": [x, y]}, …]`, coordonnées **post-resize** (le prompt expose `{new_width}x{new_height}` au modèle, mapping à faire client side)
- **Code grounding minimal :**
  ```python
  # Sortie typique
  # [{"point_2d": [421, 612], "label": "OK button"}]
  # Mapping coords:
  original_x = int(coords[0] / new_width * original_width)
  original_y = int(coords[1] / new_height * original_height)
  ```
- **Pourquoi pertinent chez nous :** déjà câblé (`core/grounding/server.py` + `infigui_worker.py` + `infigui_server.py`), `_smart_resize` factor 28 calibré. Passage 3B → 7B = changement de `MODEL_ID` (env `GROUNDING_MODEL`). VRAM 4-bit ≈ 6 GB, tient sur RTX 5070.

### 3.2. UI-TARS-1.5-7B (ByteDance)

- **Repo HF :** [ByteDance-Seed/UI-TARS-1.5-7B](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B)
- **Papier :** [arXiv:2501.12326](https://arxiv.org/abs/2501.12326), UI-TARS-2 technical report [arXiv:2509.02544](https://arxiv.org/html/2509.02544v1)
- **GitHub :** [bytedance/UI-TARS](https://github.com/bytedance/ui-tars), desktop [bytedance/UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop)
- **Release :** 2025-04-16
- **Licence :** Apache 2.0
- **Bench déclarés :** SSv2 94.2, **SSPro 61.6** — MAIS [issue #215](https://github.com/bytedance/UI-TARS/issues/215) signale reproduction à ~40-48 % selon prompt
- **Sortie :** action DSL natif `click(start_box='[x1,y1,x2,y2]')`, coordonnées **pixels absolues** sur image originale
- **Note :** UI-TARS-2 (sept 2025) existe mais pas open-source à date des sources consultées (technical report only). Continuer sur 1.5.
- **Risque :** asymétrie déclaré/reproduit. Tester localement avant migration.

### 3.3. Qwen3-VL-8B-Instruct (cible migration plan 9 mai)

- **Repo HF :** [Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct), quantif [cpatonn/Qwen3-VL-8B-Instruct-AWQ-8bit](https://huggingface.co/cpatonn/Qwen3-VL-8B-Instruct-AWQ-8bit), [cyankiwi/Qwen3-VL-8B-Instruct-AWQ-4bit](https://huggingface.co/cyankiwi/Qwen3-VL-8B-Instruct-AWQ-4bit)
- **GitHub :** [QwenLM/Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
- **Release :** 2025-10-15
- **Licence :** Apache 2.0
- **Bench :** [llm-stats SSPro 54.6](https://llm-stats.com/benchmarks/screenspot-pro) (rank 16), mais codersera blog cite ~94 % ScreenSpot et 61.8 % SSPro pour la variante computer-use de Qwen3-VL
- **Sortie :** flexible, supporte `bbox_2d` ET `point` selon prompt. Conv. **post-resize multiples de 32** (≠ Qwen2.5-VL qui était multiples de 28 — **DETTE-014 du repo s'aligne désormais sur cette nouvelle factor 32**)
- **Resize :** le modèle expose `resized_width` et `resized_height` en paramètres directs (cf. GitHub Qwen3-VL "Directly set resized_height and resized_width. These values will be rounded to the nearest multiple of 32")
- **Support vLLM :** `vllm>=0.11.0` requis
- **Pourquoi vigilant :** le bench llm-stats positionne Qwen3-VL-8B-Instruct à 54.6 % SSPro, **moins bon que InfiGUI-G1-7B (51.9 %)... attendez non, 54.6 > 51.9**. À 3 pts d'écart, dans la marge d'erreur protocole. Le 4B (59.5 % rank 12) est curieusement meilleur que le 8B (54.6 %), à investiguer.

### 3.4. Holo1.5-7B (H Company)

- **Repo HF :** [Hcompany/Holo1.5-7B](https://huggingface.co/Hcompany/Holo1.5-7B), variantes [3B](https://huggingface.co/Hcompany/Holo1.5-3B) et [72B](https://huggingface.co/Hcompany/Holo1.5-72B)
- **Blog :** [HF blog Holo1](https://huggingface.co/blog/Hcompany/holo1), [GRPO for GUI Grounding](https://huggingface.co/blog/HelloKKMe/grounding-r1)
- **Papier (Holo1) :** [arXiv:2506.02865](https://arxiv.org/pdf/2506.02865) *Surfer-H Meets Holo1*
- **Release :** v1 juin 2025, v1.5 septembre 2025
- **Licence :** Apache 2.0
- **Base :** Qwen2.5-VL-7B-Instruct
- **Bench :** **SSv2 93.31 %, SSPro 57.94 %**, WebClick 90.24 % — natif 3840×2160
- **Sortie :** non documenté dans la fiche HF directement (probable point format Qwen2.5-VL-like)
- **Pourquoi pertinent :** entraîné spécifiquement multi-environnements (web + desktop + mobile) avec GRPO, score SSPro très solide pour Apache 2.0. Probable swap drop-in dans `core/grounding/server.py` (même architecture Qwen2_5_VLForConditionalGeneration).

### 3.5. UGround-V1 (OSU NLP, ICLR'25 Oral)

- **Repo HF :** [osunlp/UGround-V1-7B](https://huggingface.co/osunlp/UGround-V1-7B), [2B](https://huggingface.co/osunlp/UGround-V1-2B), [72B](https://huggingface.co/osunlp/UGround-V1-72B)
- **Papier :** [arXiv:2410.05243](https://arxiv.org/abs/2410.05243), ICLR 2025 Oral
- **GitHub :** [OSU-NLP-Group/UGround](https://github.com/OSU-NLP-Group/UGround)
- **Licence :** Apache 2.0
- **Base :** Qwen2-VL-7B-Instruct
- **Bench :** ScreenSpot 86.3 % moyenne (texte/icône desktop/mobile/web 76-93 %), UGround-V1-72B cité 34.5 % sur SSPro (papier original SSPro)
- **Sortie :** point unique `(x, y)` en string, **convention normalisée [0, 1000)** indépendante de l'image (héritage Qwen2-VL)
- **Avantage :** convention 0-1000 = pas de bug d'échelle post-resize. Le modèle a appris à raisonner dans un espace canonique.
- **Inconvénient :** absence d'évaluation SSPro publique pour le 7B (à part le 72B). Compatible bbox = non (point only).

### 3.6. OS-Atlas (UCSD / Shanghai AI Lab)

- **Repo HF :** [OS-Copilot/OS-Atlas-Base-7B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-7B), [Base-4B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-4B), [Pro-7B](https://huggingface.co/OS-Copilot/OS-Atlas-Pro-7B), [Pro-4B](https://huggingface.co/OS-Copilot/OS-Atlas-Pro-4B)
- **Papier :** [arXiv:2410.23218](https://arxiv.org/abs/2410.23218), NeurIPS 2024
- **GitHub :** [OS-Copilot/OS-Atlas](https://github.com/OS-Copilot/OS-Atlas)
- **Release :** 2024-10-30
- **Licence :** Apache 2.0
- **Base Base-7B :** Qwen2-VL-7B-Instruct, dataset 13M éléments GUI cross-platform
- **Bench :** ScreenSpot 82.5 % avg, SSv2 85.1 %, **SSPro 18.9 %** *(plus bas que tous les modèles 2025)*
- **Sortie :** bbox + point JSON, **normalisé 0-1000** (Qwen2-VL natif)
- **Statut :** point de référence historique. Surclassé par tous les modèles 2025 sur SSPro. À retenir uniquement pour SSv2 / desktop low-res "tabs simples".

### 3.7. AGUVIS-7B (Salesforce + HKU)

- **Repo HF :** [xlangai/Aguvis-7B-720P](https://huggingface.co/xlangai/Aguvis-7B-720P)
- **Papier :** voir [paper list OSU](https://github.com/OSU-NLP-Group/GUI-Agents-Paper-List/blob/main/paper_by_key/paper_visual_grounding.md)
- **Release :** 2024-12
- **Licence :** non explicite sur fiche HF (probable Apache 2.0 via base)
- **Base :** Qwen2-VL
- **Bench :** ScreenSpot 84.4 % papier, **SSPro 22.9 %**
- **Sortie :** bbox + action plan (training en 2 étapes : grounding puis action)
- **Statut :** intéressant historiquement (pure-vision unified framework Salesforce). Score SSPro faible vs cohorte 2025. Pas prioritaire.

### 3.8. Magma-8B (Microsoft, CVPR 2025)

- **Repo HF :** [microsoft/Magma-8B](https://huggingface.co/microsoft/Magma-8B)
- **Papier :** [arXiv:2502.13130](https://arxiv.org/abs/2502.13130), CVPR 2025
- **GitHub :** [microsoft/Magma](https://github.com/microsoft/Magma)
- **Release :** 2025-02-18
- **Licence :** **MIT** (très permissive)
- **Bench :** ScreenSpot mobile 59.5 / desktop 64.1 / web 60.6 — **pas d'éval SSPro publiée**
- **Sortie :** Set-of-Mark (marques numérotées sur image) + Trace-of-Mark (vidéo). Hybride GUI + robotique
- **Inconvénient majeur :** nécessite **fork custom de Transformers** (`git+https://github.com/jwyang/transformers.git@dev/jwyang-v4.48.2`), pas de support vLLM standard
- **Pertinent si :** intérêt cross-domaine (GUI + robotique). Pour pure GUI, autres modèles font mieux.

### 3.9. GUI-Actor-7B-Qwen2.5-VL (Microsoft, NeurIPS 2025)

- **Repo HF :** [microsoft/GUI-Actor-7B-Qwen2.5-VL](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2.5-VL), variante [Qwen2-VL](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2-VL)
- **Papier :** [arXiv:2506.03143](https://arxiv.org/abs/2506.03143), NeurIPS 2025 — *GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents*
- **GitHub :** [microsoft/GUI-Actor](https://github.com/microsoft/GUI-Actor)
- **Release :** 2025-06-03
- **Licence :** MIT
- **Bench :** SSv2 92.1, **SSPro 44.6** (sans verifier), avec verifier monte
- **Sortie :** **coordinate-free** — attention-based action head qui pointe directement vers les patches visuels. Output décodé en `topk_points` (coordonnées normalisées 0-1, sans génération texte)
- **Avantage théorique majeur :** élimine structurellement le bug d'échelle. Le modèle aligne directement un token spécial avec les patches visuels pertinents.
- **Inconvénient :** demande fork custom (`Qwen2_5_VLForConditionalGenerationWithPointer`), pas de support vLLM standard mentionné.
- **Intérêt R&D :** valider la direction "coordinate-free" comme architecturale pour la v2 grounding.

### 3.10. MolmoPoint-GUI-8B (Allen AI, mars 2026)

- **Repo HF :** [allenai/MolmoPoint-GUI-8B](https://huggingface.co/allenai/MolmoPoint-GUI-8B)
- **Blog :** [MolmoPoint blog Ai2](https://allenai.org/blog/molmopoint)
- **Papier :** voir blog (référence papier non explicite dans nos sources)
- **GitHub :** [allenai/molmo2](https://github.com/allenai/molmo2)
- **Release :** mars 2026
- **Licence :** Apache 2.0 (recherche/éducation, Responsible Use Guidelines Ai2)
- **Base :** Qwen3-8B + MolmoPoint-8B finetuning
- **Bench :** **SSPro 61.1 (SOTA open)**, OSWorldG 70.0
- **Sortie :** grounding-tokens `[object_id, image_num, x, y]`, **coords pixels absolues** (pas post-resize !)
- **Données training :** MolmoPoint-GUISyn = 36k screenshots synthétiques HR (desktop + web + mobile)
- **Inconvénient :** **pas de support vLLM** (logits processor custom requis), single-image only, pas de support training prod
- **Note pour nous :** **score SSPro le plus élevé parmi les open-source**, et conv. coord absolue = AUCUN bug d'échelle. Mais intégration plus lourde (custom logits processor).

### 3.11. CogAgent-9B-20241220 (Zhipu / THUDM)

- **Repo HF :** [zai-org/cogagent-9b-20241220](https://huggingface.co/zai-org/cogagent-9b-20241220)
- **Papier :** [arXiv:2312.08914](https://arxiv.org/abs/2312.08914) (v1), v2 dec 2024 sans papier dédié
- **GitHub :** [zai-org/CogAgent](https://github.com/zai-org/CogAgent)
- **Release :** 2024-12-20 (v2)
- **Licence :** **Other** (Custom Zhipu License, non Apache — vérifier compat commerciale healthtech !)
- **Base :** GLM-4V-9B (14B total : 9B language + 5B vision)
- **Bench :** "leader cité" sur ScreenSpot vs GPT-4o/Claude/SeeClick mais **chiffre SSPro précis non publié dans les sources consultées**
- **Sortie :** action DSL `CLICK(box=[[x1,y1,x2,y2]], element_info='...')`, conv. probablement absolue sur 1120×1120
- **Risque licence :** "Other" custom, à valider ligne par ligne avant production commerciale.

### 3.12. ShowUI-2B (Show Lab, CVPR 2025)

- **Repo HF :** [showlab/ShowUI-2B](https://huggingface.co/showlab/ShowUI-2B)
- **Papier :** [arXiv:2411.17465](https://arxiv.org/abs/2411.17465), CVPR 2025
- **GitHub :** [showlab/ShowUI](https://github.com/showlab/ShowUI)
- **Release :** 2024-11-26
- **Licence :** MIT
- **Base :** Qwen2-VL-2B-Instruct
- **Bench :** ScreenSpot 75.1 % (zéro-shot), **SSPro 7.7 %** *(très faible — modèle 2B léger)*
- **Sortie :** point normalisé 0-1 + action dict structuré
- **Pertinent si :** contrainte VRAM extrême (4 GB), workflow simple, fenêtres low-res. Pas pour Easily 2560×1600.
- **Successeur FocusUI** ([CVPR 2026](https://github.com/showlab/FocusUI)) : framework token pruning sur Qwen2.5-VL / Qwen3-VL multi-sizes, outperforme SOTA précédents.

### 3.13. SeeClick (référence historique, ACL 2024)

- **Repo HF :** [cckevinn/SeeClick](https://huggingface.co/cckevinn/SeeClick)
- **Papier :** [arXiv:2401.10935](https://arxiv.org/abs/2401.10935), ACL 2024
- **GitHub :** [njucckevin/SeeClick](https://github.com/njucckevin/SeeClick)
- **Release :** 2024-04
- **Licence :** Apache 2.0
- **Base :** Qwen-VL ≈9.6B + LoRA finetune
- **Bench :** ScreenSpot 53.4 % moyenne (Windows text 55.7, Windows icon/widget 32.5), **SSPro <10 % (papier SSPro orig.)**
- **Statut chez nous :** déjà testé, retiré de `intelligent_executor.py` au commit `d1b556b6c` (avril 2026, "cassé"). À NE PAS réutiliser.

### 3.14. GUI-G2-7B (Zhejiang Univ / inclusionAI, AAAI 2026)

- **Repo HF :** [inclusionAI/GUI-G2-7B](https://huggingface.co/inclusionAI/GUI-G2-7B)
- **GitHub :** [ZJU-REAL/GUI-G2](https://github.com/ZJU-REAL/GUI-G2)
- **Papier :** AAAI 2026 *GUI-G²: Gaussian Reward Modeling for GUI Grounding*
- **Innovation :** Gaussian reward modeling pour RL — récompense continue scalée selon la taille de l'élément cible (≠ binaire). Pertinent pour icônes petites en haute-res (cas Easily Assure).
- **Bench :** SOTA sur ScreenSpot/SSv2/SSPro déclaré (papier InfiGUI cite GUI-G2-7B à 47.5 % SSPro)
- **Statut :** récent (jan 2026), à surveiller mais pas encore largement reproduit publiquement.

### 3.15. UI-Venus (inclusionAI, 2025-2026)

- **GitHub :** [inclusionAI/UI-Venus](https://github.com/inclusionAI/UI-Venus)
- **Statut :** signalé dans recherche comme native UI agent screenshot-only. Pas d'évaluation détaillée trouvée dans nos sources.

### 3.16. Florence-2 (Microsoft) — hors scope GUI

- **Note :** modèle 0.27B encodant les coords comme tokens, **non entraîné sur UI** (object/phrase grounding général). Cité pour complétude — **PAS adapté** au cas GUI, à éliminer.

---

## 4. Analyse Qwen3-VL vs InfiGUI-G1 vs OS-Atlas vs Magma sur notre cas usage

Périmètre concret : Windows desktop Easily Assure, fenêtre 2560×1600 souvent croppée par mss à 2560×60 (bug DETTE séparé), 22+ steps mixant tabs, dropdowns, dialogues modaux, boutons de toolbar, champs de saisie.

| Critère | InfiGUI-G1-7B (upgrade direct) | Qwen3-VL-8B-Instruct (plan migration) | OS-Atlas-Base-7B (référence 2024) | Magma-8B (Microsoft hybrid) |
|---|---|---|---|---|
| **VRAM 4-bit RTX 5070 12 GB** | ~6 GB ✅ | ~6 GB ✅ | ~6 GB ✅ | ~8 GB ⚠️ (fork transfo) |
| **ScreenSpot-Pro (SSPro)** | 51.9 ✅ | 54.6 ✅ | 18.9 ❌ | non publié SSPro |
| **Convention coords** | post-resize (factor 28) — `_smart_resize` déjà en place | post-resize (factor 32) — DETTE-014 à recaler | 0-1000 normalisé — **pas de bug d'échelle** | SoM/marks, complexe |
| **Bug d'échelle bbox_2d évité ?** | non par construction, mais `_smart_resize` côté serveur OK si bien calibré | non par construction, idem (factor 32 ≠ 28 → recalibration) | **OUI** (0-1000 indépendant) | non documenté |
| **Format sortie** | point JSON `point_2d` | bbox_2d OU point JSON | bbox + point JSON | SoM (numéros sur image) |
| **vLLM support** | ✅ natif | ✅ (vllm≥0.11) | ✅ | ❌ fork custom |
| **Continuité code existant** | **maximale** — même architecture `Qwen2_5_VLForConditionalGeneration`, mêmes prompts, juste `MODEL_ID` à changer | moyenne — Qwen3-VL = nouvelle architecture, factor 32 ≠ 28, prompts à adapter (think:false, num_predict≥128) | bonne (Qwen2-VL base) — mais format coord 0-1000 → tout le parsing à refaire | faible — fork transfo, head SoM, parser custom |
| **Healthtech licence commerciale** | ✅ Apache 2.0 | ✅ Apache 2.0 | ✅ Apache 2.0 | ✅ MIT (encore plus permissive) |
| **Risque démo (Easily 2560×1600)** | bas | moyen (recalage factor 32 + DETTE-014 + nouveaux prompts) | élevé (SSPro 18.9 = grosses erreurs sur dialogues complexes) | élevé (intégration custom) |
| **Effort migration** | ~1 jour | ~3-5 jours | ~2-3 jours (réécrire parser 0-1000) | ~1 semaine + intégration spéciale |

**Conclusion comparative :** **InfiGUI-G1-7B est l'upgrade le plus rapide et le moins risqué**. Qwen3-VL-8B est techniquement aussi bon mais demande de recalibrer `_smart_resize` (DETTE-014 documente déjà le piège factor 28 vs 32). OS-Atlas perd 30+ pts SSPro vs cohorte 2025 mais offre la convention 0-1000 qui élimine le bug d'échelle. Magma intéressant en R&D, pas en production court terme.

---

## 5. Bug d'échelle bbox_2d : quels modèles l'évitent

Rappel du bug (cf. [`MIGRATION_VLM_PLAN_2026-05-09.md`](../MIGRATION_VLM_PLAN_2026-05-09.md) §1.2) : les coordonnées renvoyées sont dans la résolution **post-`smart_resize`** appliquée par le modèle, mais le code prod divise par `orig_w` au lieu de `resized_w` → toutes les coords shiftées top-left. Ollama n'expose pas `resized_dimensions`, d'où impossibilité de fixer côté client.

### Modèles SANS bug d'échelle (par construction)

1. **UGround-V1 (toutes tailles)** — sortie en `[0, 1000)` normalisé, parser officiel `actual_x = (x / 1000) * image_width`. Le modèle a appris à raisonner dans un espace canonique indépendant de la résolution réelle.
2. **OS-Atlas-Base-7B / Base-4B** — sortie normalisée 0-1000 (héritage Qwen2-VL). Pas d'aller-retour resize → coord.
3. **MolmoPoint-GUI-8B** — sortie en pixels absolus (le grounding-token est décodé en (x, y) image originale). Aucune transformation à faire côté client.
4. **GUI-Actor-7B-Qwen2.5-VL** — sortie en `topk_points` normalisés 0-1, sans texte coordonnées (attention head sur patches visuels). **Architecturalement coordinate-free** = élimination radicale du bug.
5. **UI-TARS-1.5-7B** — sortie en pixels absolus dans le DSL `click(start_box='[x1,y1,x2,y2]')`. Documenté ainsi, mais le modèle a un smart_resize interne dont la cohérence avec son DSL est à vérifier en réel (issue #215 GitHub suggère reproduction inconstante).

### Modèles AVEC bug d'échelle latent (à gérer côté client)

6. **InfiGUI-G1-3B / 7B** — `point_2d` post-resize, mais la fiche HF expose **explicitement** `{new_width}x{new_height}` dans le prompt et fournit le mapping. Si on lit la doc, pas de surprise. Notre `core/grounding/server.py` a déjà `_smart_resize` calibré.
7. **Qwen3-VL-8B-Instruct** — `bbox_2d` post-resize (factor **32**, pas 28 !). Avec backend in-process (vLLM ou Transformers), on peut passer `resized_width/resized_height` au modèle. Avec Ollama → impossible (cf. plan migration).
8. **Qwen2.5-VL-7B-Instruct** *(legacy)* — racine du bug actuel chez nous via Ollama. À abandonner.
9. **AGUVIS-7B-720P** — `720P` dans le nom suggère resize fixe vers 720p, mais convention coord non documentée.

### Recommandation

Pour éliminer **définitivement** le bug d'échelle :
- **Court terme (continuité code)** : passer en backend Transformers in-process avec exposition explicite de `resized_width/resized_height` (déjà en place dans `core/grounding/server.py` pour InfiGUI). Migrer 3B → 7B.
- **Moyen terme (architecture)** : évaluer GUI-Actor ou MolmoPoint-GUI en R&D pour l'approche coordinate-free / absolue.

---

## 6. Recommandation actionnable

Si on devait migrer maintenant pour la démo cliente suivante (post-GHT) :

### Option A — Continuité chirurgicale (recommandée)

**Modèle :** `InfiX-ai/InfiGUI-G1-7B`
**Backend :** Transformers in-process via `core/grounding/server.py` (déjà en place), changer `MODEL_ID` (env `GROUNDING_MODEL`)
**Effort :** ~1 jour
**Gain attendu :** +6.7 pts SSPro vs InfiGUI-G1-3B actuel (45.2 → 51.9), même format sortie, `_smart_resize` factor 28 inchangé
**Risque :** bas — même architecture, mêmes prompts, juste +3 GB VRAM (~6 GB en 4-bit NF4, tient)

### Option B — Saut SOTA open (R&D parallèle)

**Modèle :** `allenai/MolmoPoint-GUI-8B` (SSPro 61.1, open SOTA)
**Backend :** Transformers in-process avec logits processor custom (pas vLLM)
**Effort :** ~3-5 jours (intégration spéciale, training/eval pipelines)
**Gain attendu :** +15 pts SSPro vs actuel, **convention coord absolue → ZÉRO bug d'échelle**
**Risque :** moyen — pas de vLLM, single-image, intégration non standard

### Option C — Aligner sur le plan migration existant (Qwen3-VL)

**Modèle :** `Qwen/Qwen3-VL-8B-Instruct` (cible documentée dans [`MIGRATION_VLM_PLAN_2026-05-09.md`](../MIGRATION_VLM_PLAN_2026-05-09.md))
**Backend :** vLLM ≥0.11 (déjà câblé `resolve_engine.py:785-816`) ou Transformers
**Effort :** ~3-5 jours
**Gain attendu :** +9.4 pts SSPro (54.6 vs 45.2), `resized_width/resized_height` passable explicitement
**Risque :** moyen — factor 32 ≠ 28 (DETTE-014), nouveaux prompts (`think:false`, num_predict≥128), nouvelle architecture Qwen3-VL

### Choix recommandé : **A maintenant, B en R&D parallèle, C reporté tant que A fonctionne**

Raisons :
- A minimise le risque court terme et capitalise sur l'infra `core/grounding/` déjà investie depuis 2026-04-26.
- B teste l'hypothèse "coordinate-free / absolue" qui pourrait être le pattern d'avenir.
- C demande de recalibrer le smart_resize sur factor 32 (DETTE-014 explicite), opération à faire UNE fois et qui mérite le timing post-démo.

**Question ouverte pour Dom :** est-ce que l'écart 45.2 → 51.9 SSPro (option A) suffit pour débloquer les cas Easily où le grounding échoue actuellement ? Si la cause primaire est transport (cf. diagnostic 8 mai, [`REPLAY_BLOCAGE_NOTES_MEDICALES_2026-05-08.md`](../REPLAY_BLOCAGE_NOTES_MEDICALES_2026-05-08.md)), un modèle SOTA ne corrigera rien.

---

## 7. Sources

### Benchmarks et leaderboards

- [ScreenSpot-Pro paper arXiv:2504.07981](https://arxiv.org/abs/2504.07981) — *ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use* (ICLR 2025)
- [ScreenSpot-Pro leaderboard llm-stats](https://llm-stats.com/benchmarks/screenspot-pro) — leaderboard tiers (21 modèles, GPT-5.2 leader 86.3 %)
- [gui-agent.github.io grounding-leaderboard](https://gui-agent.github.io/grounding-leaderboard/) — infra leaderboard académique
- [ScreenSpot-Pro GitHub likaixin2000](https://github.com/likaixin2000/ScreenSpot-Pro-GUI-Grounding) — repo officiel benchmark
- [HF blog Ziyang ScreenSpot-Pro](https://huggingface.co/blog/Ziyang/screenspot-pro) — annonce HF du benchmark
- [WindowsAgentArena Microsoft](https://microsoft.github.io/WindowsAgentArena/) — environnement Windows benchmark
- [Awesome Agents Computer Use leaderboard](https://awesomeagents.ai/leaderboards/computer-use-leaderboard/) — leaderboard tiers
- [OSU GUI-Agents Paper List](https://github.com/OSU-NLP-Group/GUI-Agents-Paper-List/blob/main/paper_by_key/paper_visual_grounding.md) — recensement papiers

### Modèles open-source (Apache / MIT) — repos et papiers

- **InfiGUI-G1** : [HF 3B](https://huggingface.co/InfiX-ai/InfiGUI-G1-3B), [HF 7B](https://huggingface.co/InfiX-ai/InfiGUI-G1-7B), [paper arXiv:2508.05731](https://arxiv.org/abs/2508.05731), [GitHub InfiXAI/InfiGUI-G1](https://github.com/InfiXAI/InfiGUI-G1)
- **InfiGUI-R1** : [paper arXiv:2504.14239](https://arxiv.org/abs/2504.14239), [GitHub InfiXAI/InfiGUI-R1](https://github.com/InfiXAI/InfiGUI-R1)
- **UI-TARS-1.5** : [HF ByteDance-Seed/UI-TARS-1.5-7B](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B), [GitHub bytedance/UI-TARS](https://github.com/bytedance/ui-tars), [GitHub UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop), [paper arXiv:2501.12326](https://arxiv.org/abs/2501.12326), [UI-TARS-2 tech report arXiv:2509.02544](https://arxiv.org/html/2509.02544v1)
- **Qwen3-VL** : [HF Qwen/Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct), [GitHub QwenLM/Qwen3-VL](https://github.com/QwenLM/Qwen3-VL), [HF Qwen/Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct)
- **Qwen2.5-VL** : [HF Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), [discussion #13 bbox_2d resize bug](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct/discussions/13)
- **Holo1 / Holo1.5** : [HF Hcompany/Holo1.5-7B](https://huggingface.co/Hcompany/Holo1.5-7B), [3B](https://huggingface.co/Hcompany/Holo1.5-3B), [HF blog Holo1](https://huggingface.co/blog/Hcompany/holo1), [paper Surfer-H arXiv:2506.02865](https://arxiv.org/pdf/2506.02865), [HF blog GRPO grounding-r1](https://huggingface.co/blog/HelloKKMe/grounding-r1)
- **UGround** : [HF osunlp/UGround-V1-7B](https://huggingface.co/osunlp/UGround-V1-7B), [2B](https://huggingface.co/osunlp/UGround-V1-2B), [72B](https://huggingface.co/osunlp/UGround-V1-72B), [paper arXiv:2410.05243](https://arxiv.org/abs/2410.05243), [GitHub OSU-NLP-Group/UGround](https://github.com/OSU-NLP-Group/UGround)
- **OS-Atlas** : [HF OS-Copilot/OS-Atlas-Base-7B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-7B), [Base-4B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-4B), [Pro-7B](https://huggingface.co/OS-Copilot/OS-Atlas-Pro-7B), [paper arXiv:2410.23218](https://arxiv.org/abs/2410.23218), [GitHub OS-Copilot/OS-Atlas](https://github.com/OS-Copilot/OS-Atlas)
- **AGUVIS** : [HF xlangai/Aguvis-7B-720P](https://huggingface.co/xlangai/Aguvis-7B-720P)
- **Magma** : [HF microsoft/Magma-8B](https://huggingface.co/microsoft/Magma-8B), [paper arXiv:2502.13130](https://arxiv.org/abs/2502.13130), [GitHub microsoft/Magma](https://github.com/microsoft/Magma), [Microsoft blog](https://www.microsoft.com/en-us/research/blog/magma-a-foundation-model-for-multimodal-ai-agents-across-digital-and-physical-worlds/)
- **GUI-Actor** : [HF microsoft/GUI-Actor-7B-Qwen2.5-VL](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2.5-VL), [Qwen2-VL variant](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2-VL), [paper arXiv:2506.03143](https://arxiv.org/abs/2506.03143), [GitHub microsoft/GUI-Actor](https://github.com/microsoft/GUI-Actor), [project page](https://microsoft.github.io/GUI-Actor/)
- **MolmoPoint-GUI** : [HF allenai/MolmoPoint-GUI-8B](https://huggingface.co/allenai/MolmoPoint-GUI-8B), [blog Ai2 MolmoPoint](https://allenai.org/blog/molmopoint), [GitHub allenai/molmo2](https://github.com/allenai/molmo2), [MolmoWeb blog](https://allenai.org/blog/molmoweb)
- **CogAgent v2** : [HF zai-org/cogagent-9b-20241220](https://huggingface.co/zai-org/cogagent-9b-20241220), [GitHub zai-org/CogAgent](https://github.com/zai-org/CogAgent), [paper v1 arXiv:2312.08914](https://arxiv.org/abs/2312.08914), [MarkTechPost announcement](https://www.marktechpost.com/2024/12/25/tsinghua-university-researchers-just-open-sourced-cogagent-9b-20241220-the-latest-version-of-cogagent/)
- **ShowUI / FocusUI** : [HF showlab/ShowUI-2B](https://huggingface.co/showlab/ShowUI-2B), [paper arXiv:2411.17465](https://arxiv.org/abs/2411.17465), [GitHub showlab/ShowUI](https://github.com/showlab/showui), [GitHub showlab/FocusUI](https://github.com/showlab/FocusUI)
- **SeeClick** : [HF cckevinn/SeeClick](https://huggingface.co/cckevinn/SeeClick), [paper arXiv:2401.10935](https://arxiv.org/abs/2401.10935), [GitHub njucckevin/SeeClick](https://github.com/njucckevin/SeeClick)
- **GUI-G2** : [HF inclusionAI/GUI-G2-7B](https://huggingface.co/inclusionAI/GUI-G2-7B), [GitHub ZJU-REAL/GUI-G2](https://github.com/ZJU-REAL/GUI-G2)
- **OmniParser V2** : [GitHub microsoft/OmniParser](https://github.com/microsoft/omniparser), [Microsoft Research V2 blog](https://www.microsoft.com/en-us/research/articles/omniparser-v2-turning-any-llm-into-a-computer-use-agent/)

### Annexes

- [Codersera Qwen3-VL Instruct vs Thinking guide 2026](https://codersera.com/blog/qwen3-vl-8b-instruct-vs-qwen3-vl-8b-thinking-2025-guide/)
- [Skywork blog Qwen3-VL GUI Automation 2025](https://skywork.ai/blog/llm/qwen3-vl-gui-automation-2025-visual-agent-revolution/)
- [BinaryVerse Qwen3-VL benchmarks](https://binaryverseai.com/qwen3-vl-benchmarks-local-installation-guide-use/)
- [The Decoder Qwen3-VL videos](https://the-decoder.com/qwen3-vl-can-scan-two-hour-videos-and-pinpoint-nearly-every-detail/)
- [HF discussion #13 Qwen2.5-VL bbox resize bug](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct/discussions/13) — racine documentée du bug que nous vivons
- [GitHub QwenLM/Qwen3-VL issue #1831 — image zoom factor 32 vs 28](https://github.com/QwenLM/Qwen3-VL/issues/1831) — directement lié à notre DETTE-014

---

## 8. Liens avec autres axes de recherche du projet

| Axe | Lien |
|---|---|
| **A2 — smart_resize** | Le choix de modèle conditionne le `factor` à utiliser : Qwen2.5-VL = 28, Qwen3-VL = 32, OS-Atlas/UGround = pas de smart_resize (espace 0-1000). DETTE-014 du repo (`feedback_reread_before_code.md`) doit être recalibrée selon le modèle final retenu. |
| **A3 — Bench grounding bbox cible** | Le test à refaire (`MIGRATION_VLM_PLAN_2026-05-09.md` §5) doit inclure les 3 candidats top : InfiGUI-G1-7B, Qwen3-VL-8B-Instruct, Holo1.5-7B, sur la fixture `heartbeat_1773792436.png` 2560×1600. Critère : OK button à cx ≈ 0.45-0.55. |
| **B2 — Validator (Planner-Actor-Validator)** | GUI-Actor inclut un grounding verifier pour évaluer les candidats. MolmoPoint-GUI retourne topk points. Pattern à intégrer dans notre `replay_verifier.py` actuellement laxiste (cf. synthèse §5.2). |
| **B3 — Coordinate-free architecture** | GUI-Actor (NeurIPS 2025) et MolmoPoint-GUI (Ai2 2026) ouvrent une voie post-coordonnée. À explorer pour v2 grounding, indépendant de l'urgence démo. |
| **Démo GHT post-mortem** | Le bug primaire de la démo 8-19 mai était transport HTTP, pas grounding (cf. [`LESSONS_LEARNED_GHT_2026-05.md`](../LESSONS_LEARNED_GHT_2026-05.md)). Migrer le VLM n'a de sens qu'après stabilisation transport (5 bugs P0 toujours ouverts). |

---

*Document destiné à informer la décision de migration VLM post-démo GHT. Pas de modification de code. La décision opérationnelle (option A / B / C) doit être validée par Dom.*