feat(phase2): Détection établissements par Aho-Corasick sur 108K noms FINESS

- Nouveau script build_finess_gazetteers.py : extraction noms distinctifs, villes, numéros depuis CSV open data
- Automate Aho-Corasick (pyahocorasick) pour matching multi-pattern en ~1.7ms/page
- 108K patterns indexés (noms composés >= 8 chars, mots uniques >= 10 chars)
- Blacklist mots génériques (clinique, pharmacie, etc.) et stop words médicaux
- Normalisation position-preserving (sans accents, même longueur)
- Construction lazy de l'AC (après chargement des stop words)
- Intégration dans _mask_line_by_regex et selective_rescan
- Nouveau gazetteer villes_finess.txt (11,660 villes)
- Résultats : "Girandières" → masqué, "Côte Basque" → masqué, 0 FP sur termes médicaux courants

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-09 22:56:43 +01:00
parent 4488a1d4a0
commit 7a2af5c905
7 changed files with 132575 additions and 3504 deletions

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -101938,4 +101938,4 @@
980503346
980503395
980600027
980600035
980600035

View File

@@ -113234,14 +113234,3 @@
0989324787
0989450002
0999991377
1473143031
1545915013
1677708604
1698408775
1749771871
2561752635
3080428670
3088102831
3134335540
4242745138
4326549224

11660
data/finess/villes_finess.txt Normal file

File diff suppressed because it is too large Load Diff