feat: réduction FP + gazetteers adresses FINESS + batch parallèle + corrections multi-axes

- Token min length relevé de 2-3 → 4 chars (élimine FP EPO, IRC, SIB...)
- Stop-words enrichis : acronymes médicaux 3 lettres, termes pharma, soins infirmiers
- BDPM stop-words : ~7300 noms commerciaux + DCI/substances actives
- Gazetteers adresses FINESS : 63K patterns Aho-Corasick (position-preserving normalization)
- Filtre contextuel anatomique pour FINESS établissements
- Nouvelles regex : RE_CIVILITE_COMMA_LIST, RE_EXTRACT_NOM_UTILISE, RE_EXTRACT_PRENOM,
  RE_NUM_EXAMEN_PATIENT, RE_ADRESSE_LIEU_DIT, RE_CIVILITE_INITIALE, Dr X.NOM
- URLs complètes (RE_URL) + détection multiline
- N° venue inversé (layout-aware) + EPISODE/NDA dans _CRITICAL_PII_TYPES
- HospitalFilter désactivé pour ADRESSE/TEL/VILLE/EPISODE (identifient le patient)
- Batch silver export parallélisé (multiprocessing spawn, N workers)
- Seuil sur-masquage relevé à 8%, server.py enrichi (source regex/ner)
- Blacklist villes : COURANT, PARIS ; contexte villes étendu (UHCD, spécialités)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-16 09:26:56 +01:00
parent a827d860f1
commit 49ff464e6e
18 changed files with 358579 additions and 232 deletions

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff