Go to file

dom f70d138db3 feat: T2A-Extractor pipeline with CIM-10 normalizer (31→0 warnings)

Initial commit with full extraction pipeline: PDF OCR (docTR), text
segmentation, LLM extraction (Ollama), deterministic post-processing
normalizer, validation, and Excel/CSV export.

The normalizer fixes OCR/LLM errors on CIM-10 codes:
- OCR digit→letter confusion in position 1 (1→I, 0→O, 5→S, 2→Z, 8→B)
- Missing dot separator (F050→F05.0, R410→R41.0)
- '+' instead of '.' (B99+1→B99.1, J961+0→J96.10)
- Excess decimals (Z04.880→Z04.88)
- OCR letter→digit in positions 2-3 (LO2.2→L02.2)
- Literal "null" string purge
- Auto-fill codes_retenus from decision context

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-23 20:44:32 +01:00

extractor

feat: T2A-Extractor pipeline with CIM-10 normalizer (31→0 warnings)

2026-02-23 20:44:32 +01:00

.gitignore

feat: T2A-Extractor pipeline with CIM-10 normalizer (31→0 warnings)

2026-02-23 20:44:32 +01:00

config.py

feat: T2A-Extractor pipeline with CIM-10 normalizer (31→0 warnings)

2026-02-23 20:44:32 +01:00

main.py

feat: T2A-Extractor pipeline with CIM-10 normalizer (31→0 warnings)

2026-02-23 20:44:32 +01:00

README.md

feat: T2A-Extractor pipeline with CIM-10 normalizer (31→0 warnings)

2026-02-23 20:44:32 +01:00

requirements.txt

feat: T2A-Extractor pipeline with CIM-10 normalizer (31→0 warnings)

2026-02-23 20:44:32 +01:00

setup.sh

feat: T2A-Extractor pipeline with CIM-10 normalizer (31→0 warnings)

2026-02-23 20:44:32 +01:00

README.md

T2A Extractor

Extraction structurée de rapports de contrôle T2A (décisions UCR) depuis des PDF natifs et scannés.

Architecture

PDF UCR → [pymupdf/docTR] → texte brut → [regex] → blocs OGC → [VLM Ollama] → JSON → [validation] → Excel/CSV

Pipeline

Extraction texte — Détection automatique natif/scanné par page. pymupdf pour le natif, docTR pour l'OCR.
Segmentation — Découpage en blocs par Champ et par OGC (individuels et groupés) via regex.
Extraction structurée — Chaque bloc est envoyé au VLM local (Ollama) qui retourne un JSON structuré.
Validation — Vérification des codes CIM-10/CCAM, cohérence des décisions.
Export — Excel formaté (avec coloration des décisions) et CSV optionnel.

Schéma de sortie (11 colonnes)

Colonne	Description
`champ`	Numéro de champ
`num_ogc`	Numéro OGC
`type_desaccord`	DP / DAS / DP+DAS / Actes
`codes_etablissement`	Codes CIM-10/CCAM de l'établissement
`libelle_etablissement`	Libellé du codage établissement
`codes_controleurs`	Codes CIM-10/CCAM des contrôleurs
`libelle_controleurs`	Libellé du codage contrôleurs
`decision_ucr`	Favorable / Défavorable (pour l'établissement)
`codes_retenus`	Codes finalement retenus
`ghm_ghs`	GHM/GHS si mentionnés
`texte_decision`	Texte intégral de la décision UCR

Installation

chmod +x setup.sh
./setup.sh

Usage

source .venv/bin/activate
python main.py rapport_ucr.pdf
python main.py rapport_ucr.pdf --csv --verbose
python main.py rapport_ucr.pdf -o /chemin/sortie --csv -v

Prérequis

Python 3.12+
Ollama avec un VLM (gemma3:27b-it-qat par défaut)
GPU recommandé pour docTR (fonctionne aussi en CPU)

Configuration

Éditer config.py pour ajuster :

OLLAMA_MODEL — modèle à utiliser
OLLAMA_BASE_URL — URL du serveur Ollama
OCR_DPI — résolution OCR (défaut: 200)
NATIVE_TEXT_MIN_CHARS — seuil de détection natif/scanné