feat: T2A-Extractor pipeline with CIM-10 normalizer (31→0 warnings)

Initial commit with full extraction pipeline: PDF OCR (docTR), text segmentation, LLM extraction (Ollama), deterministic post-processing normalizer, validation, and Excel/CSV export. The normalizer fixes OCR/LLM errors on CIM-10 codes: - OCR digit→letter confusion in position 1 (1→I, 0→O, 5→S, 2→Z, 8→B) - Missing dot separator (F050→F05.0, R410→R41.0) - '+' instead of '.' (B99+1→B99.1, J961+0→J96.10) - Excess decimals (Z04.880→Z04.88) - OCR letter→digit in positions 2-3 (LO2.2→L02.2) - Literal "null" string purge - Auto-fill codes_retenus from decision context Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 20:44:32 +01:00
commit f70d138db3
13 changed files with 1699 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,63 @@
+# T2A Extractor
+
+Extraction structurée de rapports de contrôle T2A (décisions UCR) depuis des PDF natifs et scannés.
+
+## Architecture
+
+```
+PDF UCR → [pymupdf/docTR] → texte brut → [regex] → blocs OGC → [VLM Ollama] → JSON → [validation] → Excel/CSV
+```
+
+### Pipeline
+
+1. **Extraction texte** — Détection automatique natif/scanné par page. pymupdf pour le natif, docTR pour l'OCR.
+2. **Segmentation** — Découpage en blocs par Champ et par OGC (individuels et groupés) via regex.
+3. **Extraction structurée** — Chaque bloc est envoyé au VLM local (Ollama) qui retourne un JSON structuré.
+4. **Validation** — Vérification des codes CIM-10/CCAM, cohérence des décisions.
+5. **Export** — Excel formaté (avec coloration des décisions) et CSV optionnel.
+
+## Schéma de sortie (11 colonnes)
+
+| Colonne | Description |
+|---|---|
+| `champ` | Numéro de champ |
+| `num_ogc` | Numéro OGC |
+| `type_desaccord` | DP / DAS / DP+DAS / Actes |
+| `codes_etablissement` | Codes CIM-10/CCAM de l'établissement |
+| `libelle_etablissement` | Libellé du codage établissement |
+| `codes_controleurs` | Codes CIM-10/CCAM des contrôleurs |
+| `libelle_controleurs` | Libellé du codage contrôleurs |
+| `decision_ucr` | Favorable / Défavorable (pour l'établissement) |
+| `codes_retenus` | Codes finalement retenus |
+| `ghm_ghs` | GHM/GHS si mentionnés |
+| `texte_decision` | Texte intégral de la décision UCR |
+
+## Installation
+
+```bash
+chmod +x setup.sh
+./setup.sh
+```
+
+## Usage
+
+```bash
+source .venv/bin/activate
+python main.py rapport_ucr.pdf
+python main.py rapport_ucr.pdf --csv --verbose
+python main.py rapport_ucr.pdf -o /chemin/sortie --csv -v
+```
+
+## Prérequis
+
+- Python 3.12+
+- Ollama avec un VLM (gemma3:27b-it-qat par défaut)
+- GPU recommandé pour docTR (fonctionne aussi en CPU)
+
+## Configuration
+
+Éditer `config.py` pour ajuster :
+- `OLLAMA_MODEL` — modèle à utiliser
+- `OLLAMA_BASE_URL` — URL du serveur Ollama
+- `OCR_DPI` — résolution OCR (défaut: 200)
+- `NATIVE_TEXT_MIN_CHARS` — seuil de détection natif/scanné