feat: pipeline T2A - anonymisation, extraction CIM-10 et intégration edsnlp · 4a12cd2676 - t2a

Dom/t2a

feat: pipeline T2A - anonymisation, extraction CIM-10 et intégration edsnlp

Pipeline complet de traitement de documents médicaux PDF :
- Extraction texte (pdfplumber) et classification (Trackare/CRH)
- Anonymisation multi-couche (regex + NER CamemBERT + sweep)
- Extraction médicale CIM-10 hybride : edsnlp (AP-HP) enrichit les
  diagnostics, médicaments (codes ATC via Romedi) et négation,
  avec fallback regex pour les patterns spécifiques
- Fix sentencepiece pinné à <0.2.0 pour compatibilité CamemBERT

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This commit is contained in:

dom

2026-02-10 15:24:12 +01:00

commit 4a12cd2676

25 changed files with 7592 additions and 0 deletions

4075

rapport_analyse_pdfs.md Normal file

View File

File diff suppressed because it is too large Load Diff

feat: pipeline T2A - anonymisation, extraction CIM-10 et intégration edsnlp

4075 rapport_analyse_pdfs.md Normal file View File

4075

rapport_analyse_pdfs.md Normal file

View File