feat(phase2): Détection établissements par Aho-Corasick sur 108K noms FINESS
- Nouveau script build_finess_gazetteers.py : extraction noms distinctifs, villes, numéros depuis CSV open data - Automate Aho-Corasick (pyahocorasick) pour matching multi-pattern en ~1.7ms/page - 108K patterns indexés (noms composés >= 8 chars, mots uniques >= 10 chars) - Blacklist mots génériques (clinique, pharmacie, etc.) et stop words médicaux - Normalisation position-preserving (sans accents, même longueur) - Construction lazy de l'AC (après chargement des stop words) - Intégration dans _mask_line_by_regex et selective_rescan - Nouveau gazetteer villes_finess.txt (11,660 villes) - Résultats : "Girandières" → masqué, "Côte Basque" → masqué, 0 FP sur termes médicaux courants Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
116606
data/finess/etablissements_distinctifs.txt
Normal file
116606
data/finess/etablissements_distinctifs.txt
Normal file
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -101938,4 +101938,4 @@
|
||||
980503346
|
||||
980503395
|
||||
980600027
|
||||
980600035
|
||||
980600035
|
||||
|
||||
@@ -113234,14 +113234,3 @@
|
||||
0989324787
|
||||
0989450002
|
||||
0999991377
|
||||
1473143031
|
||||
1545915013
|
||||
1677708604
|
||||
1698408775
|
||||
1749771871
|
||||
2561752635
|
||||
3080428670
|
||||
3088102831
|
||||
3134335540
|
||||
4242745138
|
||||
4326549224
|
||||
11660
data/finess/villes_finess.txt
Normal file
11660
data/finess/villes_finess.txt
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user