feat: réduction FP + gazetteers adresses FINESS + batch parallèle + corrections multi-axes
- Token min length relevé de 2-3 → 4 chars (élimine FP EPO, IRC, SIB...) - Stop-words enrichis : acronymes médicaux 3 lettres, termes pharma, soins infirmiers - BDPM stop-words : ~7300 noms commerciaux + DCI/substances actives - Gazetteers adresses FINESS : 63K patterns Aho-Corasick (position-preserving normalization) - Filtre contextuel anatomique pour FINESS établissements - Nouvelles regex : RE_CIVILITE_COMMA_LIST, RE_EXTRACT_NOM_UTILISE, RE_EXTRACT_PRENOM, RE_NUM_EXAMEN_PATIENT, RE_ADRESSE_LIEU_DIT, RE_CIVILITE_INITIALE, Dr X.NOM - URLs complètes (RE_URL) + détection multiline - N° venue inversé (layout-aware) + EPISODE/NDA dans _CRITICAL_PII_TYPES - HospitalFilter désactivé pour ADRESSE/TEL/VILLE/EPISODE (identifient le patient) - Batch silver export parallélisé (multiprocessing spawn, N workers) - Seuil sur-masquage relevé à 8%, server.py enrichi (source regex/ner) - Blacklist villes : COURANT, PARIS ; contexte villes étendu (UHCD, spécialités) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
27
server.py
27
server.py
@@ -210,17 +210,34 @@ async def anonymize_text(
|
||||
final_text = selective_rescan(final_text, cfg=cfg)
|
||||
|
||||
elapsed = time.time() - t0
|
||||
audit_list = [
|
||||
{"kind": h.kind, "original": h.original, "placeholder": h.placeholder, "page": h.page}
|
||||
for h in anon.audit
|
||||
if h.page != -1 # exclure les propagations globales
|
||||
]
|
||||
|
||||
# Inclure tous les hits (regex page≥0 + NER page=-1) avec source
|
||||
ner_prefixes = ("NER_", "EDS_")
|
||||
audit_list = []
|
||||
ner_count = 0
|
||||
regex_count = 0
|
||||
for h in anon.audit:
|
||||
is_ner = h.kind.startswith(ner_prefixes) or h.page == -1
|
||||
entry = {
|
||||
"kind": h.kind,
|
||||
"original": h.original,
|
||||
"placeholder": h.placeholder,
|
||||
"page": h.page,
|
||||
"source": "ner" if is_ner else "regex",
|
||||
}
|
||||
audit_list.append(entry)
|
||||
if is_ner:
|
||||
ner_count += 1
|
||||
else:
|
||||
regex_count += 1
|
||||
|
||||
return {
|
||||
"text_anonymized": final_text,
|
||||
"audit": audit_list,
|
||||
"stats": {
|
||||
"pii_detected": len(audit_list),
|
||||
"regex_count": regex_count,
|
||||
"ner_count": ner_count,
|
||||
"elapsed_seconds": round(elapsed, 3),
|
||||
"ner_active": use_ner and _eds_manager is not None,
|
||||
},
|
||||
|
||||
Reference in New Issue
Block a user