feat(phase2): Fine-tuning CamemBERT-bio v2 (F1=0.90) + enrichissement données

- Fine-tuning camembert-bio-base : F1=0.903, Recall=0.930 (vs 0.89/0.85)
- Data augmentation : substitution noms INSEE (219K patronymes, x3 copies)
- Hard negatives BDPM (5.7K médicaments) + QUAERO (1319 termes médicaux)
- Annotations silver enrichies par gazetteers (+612 VILLE, +5 HOPITAL)
- Export silver avec support multi-répertoires (--extra-dir)
- Gazetteers QUAERO : CHEM, DISO, PROC, ANAT depuis DrBenchmark/QUAERO
- Gazetteers INSEE : noms de famille fréquents (96K) et complets (219K)
- Batch silver 1194 PDFs (run_batch_silver_export.py) pour dataset v3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-10 02:06:08 +01:00
parent 274e2fa586
commit c9572c383a
38 changed files with 318811 additions and 1406 deletions

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff