# Bench LLM décision T2A — 11 dossiers GHT Sud 95

_Généré le 2026-05-05 19:00 — **18 modèles** × 11 DPI consolidés (5 UHCD / 6 Forfait)_

> ⚠️ **Vérité-terrain corrigée 2026-05-05** : `25003284` reclassé **FORFAIT** (sortie domicile en 3h37, J12.1 VRS) — auparavant à tort UHCD. Tous les scores ci-dessous reflètent la nouvelle vérité-terrain.

## 🏆 Classement complet

| # | Modèle | Acc | p50 | Err HTTP | Parse | Calib | Verdict |
|---|---|---:|---:|:---:|:---:|:---:|---|
| #1 | `gemma3:27b-cloud` | **8/11 (73%)** | 10.6s | ✅ | ✅ | ❌ 2 | 🟢 **Recommandé démo** |
| #2 | `qwen3:8b` | **7/11 (64%)** | 7.6s | ✅ | ✅ | ❌ 1 | 🟡 Acceptable + garde-fou |
| #3 | `qwen2.5:7b` | **7/11 (64%)** | 10.0s | ✅ | ✅ | ❌ 2 | 🟡 Acceptable + garde-fou |
| #4 | `qwen3-vl:235b-instruct-cloud` | **7/11 (64%)** | 20.3s | ✅ | ✅ | ❌ 4 | 🟡 Acceptable + garde-fou |
| #5 | `qwen3.5:9b` | **7/11 (64%)** | 25.8s | ✅ | ⚠️1 | ❌ 1 | 🟡 Acceptable + garde-fou |
| #6 | `t2a-gemma3-27b-q4:latest` | **7/11 (64%)** | 152.2s | ⚠️1 | ✅ | ❌ 3 | 🟡 Acceptable + garde-fou |
| #7 | `thiagomoraes/medgemma-27b-it:Q4_K_S` | **7/11 (64%)** | 181.6s | ⚠️1 | ✅ | ❌ 3 | 🟡 Acceptable + garde-fou |
| #8 | `gemma4:latest` | **6/11 (55%)** | 17.9s | ✅ | ✅ | ❌ 3 | 🔴 Insuffisant |
| #9 | `pmsi-runpod:latest` | **6/11 (55%)** | 18.8s | ✅ | ⚠️1 | ❌ 4 | 🔴 Insuffisant |
| #10 | `qwen2.5:14b` | **6/11 (55%)** | 68.0s | ✅ | ✅ | ❌ 1 | 🔴 Insuffisant |
| #11 | `pmsi-coder:latest` | **5/11 (45%)** | 18.1s | ✅ | ⚠️1 | ❌ 5 | 🔴 Insuffisant |
| #12 | `medgemma:4b` | **4/11 (36%)** | 4.4s | ✅ | ⚠️6 | ❌ 1 | 🟠 Format JSON cassé via Ollama |
| #13 | `gpt-oss:120b-cloud` | **3/11 (27%)** | 16.0s | ✅ | ⚠️5 | ❌ 3 | 🟠 Format JSON cassé via Ollama |
| #14 | `pmsi-coder-v2:latest` | **3/11 (27%)** | 23.4s | ✅ | ⚠️5 | ✅ | 🟠 Format JSON cassé via Ollama |
| #15 | `qwen3:14b` | **1/11 (9%)** | 1.2s | ✅ | ⚠️1 | ❌ 1 | 🔴 Insuffisant |
| #16 | `charlestang06/openbiollm:latest` | **1/11 (9%)** | 6.0s | ✅ | ✅ | ❌ 3 | 🔴 Insuffisant |
| #17 | `gpt-oss:20b-cloud` | **0/11 (0%)** | 0.0s | ✅ | ⚠️11 | ✅ | 🟠 Format JSON cassé via Ollama |
| #18 | `qwen3-next:80b-cloud` | **0/11 (0%)** | 0.0s | ✅ | ⚠️11 | ✅ | 🟠 Format JSON cassé via Ollama |

## 🎯 Recommandation pour la démo

**Modèle retenu : `gemma3:27b-cloud`** — 8/11 (73%), p50 10.6s

**Backup local si cloud KO** : `qwen3:8b` (7/11, 7.6s, 5 GB seulement → tient large dans 12 GB GPU).

## 📊 Détail dossier-par-dossier (top 5)

| IPP | Cas | Vérité | `gemma3` | `qwen3` | `qwen2.5` | `qwen3-vl` | `qwen3.5` |
|---|---|:---:|:---:|:---:|:---:|:---:|:---:|
| 25003284 | Pneumo VRS — Forfait | **Forfait** | ❌ UHCD | ❌ UHCD | ❌ UHCD | ❌ UHCD | ✅ Forfait |
| 25003362 | Intox PE2 | **Forfait** | ✅ Forfait | ✅ Forfait | ✅ Forfait | ✅ Forfait | ✅ Forfait |
| 25003364 | Pneumo SLA — UHCD | **UHCD** | ✅ UHCD | ✅ UHCD | ✅ UHCD | ✅ UHCD | ✅ UHCD |
| 25003451 | Plaie SU2 | **Forfait** | ✅ Forfait | ✅ Forfait | ❌ UHCD | ✅ Forfait | ✅ Forfait |
| 25003475 | Aura migr. — UHCD | **UHCD** | ✅ UHCD | ✅ UHCD | ❌ Forfait | ❌ Forfait | ❌ Forfait |
| 25005866 | Trauma hockey — UHCD | **UHCD** | ✅ UHCD | ❌ Forfait | ✅ UHCD | ✅ UHCD | ✅ UHCD |
| 25010621 | Laryngite PE2 | **Forfait** | ✅ Forfait | ✅ Forfait | ✅ Forfait | ✅ Forfait | ✅ Forfait |
| 25012257 | Douleur abdo — UHCD | **UHCD** | ❌ Forfait | ✅ UHCD | ✅ UHCD | ✅ UHCD | ⚠️ parse |
| 25048485 | Convulsion PE2 | **Forfait** | ✅ Forfait | ❌ UHCD | ✅ Forfait | ✅ Forfait | ✅ Forfait |
| 25056615 | Salpingite std | **Forfait** | ❌ UHCD | ❌ UHCD | ❌ UHCD | ❌ UHCD | ❌ UHCD |
| 25151530 | Colique std | **Forfait** | ✅ Forfait | ✅ Forfait | ✅ Forfait | ❌ UHCD | ❌ UHCD |

## ⚠️ Cas problématiques universels (3+ modèles top se trompent)

- **`25003284` (Pneumo VRS — Forfait, vérité Forfait)** : 4/5 modèles top se trompent → DPI à enrichir OU vérité-terrain à challenger avec Pauline
- **`25003475` (Aura migr. — UHCD, vérité UHCD)** : 3/5 modèles top se trompent → DPI à enrichir OU vérité-terrain à challenger avec Pauline
- **`25056615` (Salpingite std, vérité Forfait)** : 5/5 modèles top se trompent → DPI à enrichir OU vérité-terrain à challenger avec Pauline

## 🔬 Limites du bench (transparence)

Ce bench est **suffisant pour trier les modèles candidats** mais **PAS rigoureusement validant** :

- **n=11** dossiers — échantillon trop petit pour statistique robuste (cible : 50-100)
- **1 inférence par dossier** — pas de variance mesurée (un même DPI peut donner 2 réponses différentes)
- **Vérité-terrain dérivée** — partiellement corrigée mais pas encore validée à 100% par DIM
- **DPI source partiellement fictif** — voir `REVUE_DOSSIERS_PAULINE.md` (40+ noms inventés, 4 hallucinations cliniques graves, constantes tronquées). **Le bench tourne donc sur du contenu non-fidèle.** Re-bench prévu après reconstruction `data.js`.
- **Pas de cross-validation** — pas de split train/test
- **Pas de calibration formelle** — % de "elevee" fausses noté mais pas Calibration Error Score
- **Bench externes non utilisés** — MedQA, MedFrenchBenchmark, etc. pourraient compléter

**Pour un vrai bench de validation produit (post-démo)** :
1. Étendre à 50-100 dossiers anonymisés diversifiés (Pauline + DIM partenaires)
2. 3 inférences par dossier (mesure variance)
3. Cross-validation k-fold
4. Inter-rater agreement humain comme baseline
5. Tests robustesse (DPI avec fautes, abréviations atypiques)

## 🧪 Modèles non testés intéressants à explorer

- **MedGemma 1.5** (jan 2026, 91% MedQA — surpasse Med-PaLM 2) — disponible HuggingFace, à pull si compatible Ollama
- **DeepSeek-R1** (top open-source reasoning 2026) — cloud Ollama 403 sans abonnement
- **Modèles vllm** : qwen3-next, gpt-oss-120b ont eu 100% parse errors via Ollama, peut-être fonctionnels via vllm avec format JSON propre
- **Fine-tune T2A custom étendu** : `t2a-gemma3-27b` (28 GB non-quantized) sur DGX Spark

## 📦 Annexes

- Trace brute local : `/tmp/bench_t2a_full.json`
- Trace brute cloud : `/tmp/bench_t2a_cloud.json`
- Trace brute extra : `/tmp/bench_t2a_extra.json`
- Trace brute retry : `/tmp/bench_t2a_retry.json`
- Scripts : `/tmp/bench_t2a*.py`
- DPI consolidés : `/tmp/dpis.json`