feat: réduction FP + gazetteers adresses FINESS + batch parallèle + corrections multi-axes
- Token min length relevé de 2-3 → 4 chars (élimine FP EPO, IRC, SIB...) - Stop-words enrichis : acronymes médicaux 3 lettres, termes pharma, soins infirmiers - BDPM stop-words : ~7300 noms commerciaux + DCI/substances actives - Gazetteers adresses FINESS : 63K patterns Aho-Corasick (position-preserving normalization) - Filtre contextuel anatomique pour FINESS établissements - Nouvelles regex : RE_CIVILITE_COMMA_LIST, RE_EXTRACT_NOM_UTILISE, RE_EXTRACT_PRENOM, RE_NUM_EXAMEN_PATIENT, RE_ADRESSE_LIEU_DIT, RE_CIVILITE_INITIALE, Dr X.NOM - URLs complètes (RE_URL) + détection multiline - N° venue inversé (layout-aware) + EPISODE/NDA dans _CRITICAL_PII_TYPES - HospitalFilter désactivé pour ADRESSE/TEL/VILLE/EPISODE (identifient le patient) - Batch silver export parallélisé (multiprocessing spawn, N workers) - Seuil sur-masquage relevé à 8%, server.py enrichi (source regex/ner) - Blacklist villes : COURANT, PARIS ; contexte villes étendu (UHCD, spécialités) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -1,18 +1,18 @@
|
||||
{
|
||||
"date": "2026-03-12T10:24:59.261417",
|
||||
"date": "2026-03-12T17:16:25.993851",
|
||||
"scores": {
|
||||
"global_score": 97.0,
|
||||
"leak_score": 100.0,
|
||||
"fp_score": 90,
|
||||
"totals": {
|
||||
"documents": 29,
|
||||
"audit_hits": 2797,
|
||||
"name_tokens_known": 461,
|
||||
"audit_hits": 3186,
|
||||
"name_tokens_known": 457,
|
||||
"leak_audit": 0,
|
||||
"leak_occurrences": 0,
|
||||
"leak_regex": 0,
|
||||
"leak_insee_high": 0,
|
||||
"leak_insee_medium": 569,
|
||||
"leak_insee_medium": 570,
|
||||
"fp_medical": 0,
|
||||
"fp_overmasking": 2
|
||||
}
|
||||
@@ -110,7 +110,7 @@
|
||||
"leak_audit": 0,
|
||||
"leak_regex": 0,
|
||||
"leak_insee_high": 0,
|
||||
"leak_insee_medium": 23,
|
||||
"leak_insee_medium": 24,
|
||||
"fp_medical": 0,
|
||||
"fp_overmasking": 0
|
||||
},
|
||||
@@ -206,7 +206,7 @@
|
||||
"leak_audit": 0,
|
||||
"leak_regex": 0,
|
||||
"leak_insee_high": 0,
|
||||
"leak_insee_medium": 32,
|
||||
"leak_insee_medium": 33,
|
||||
"fp_medical": 0,
|
||||
"fp_overmasking": 0
|
||||
},
|
||||
@@ -222,7 +222,7 @@
|
||||
"leak_audit": 0,
|
||||
"leak_regex": 0,
|
||||
"leak_insee_high": 0,
|
||||
"leak_insee_medium": 34,
|
||||
"leak_insee_medium": 32,
|
||||
"fp_medical": 0,
|
||||
"fp_overmasking": 0
|
||||
},
|
||||
@@ -246,7 +246,7 @@
|
||||
"leak_audit": 0,
|
||||
"leak_regex": 0,
|
||||
"leak_insee_high": 0,
|
||||
"leak_insee_medium": 26,
|
||||
"leak_insee_medium": 27,
|
||||
"fp_medical": 0,
|
||||
"fp_overmasking": 0
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user