Initial commit with full extraction pipeline: PDF OCR (docTR), text segmentation, LLM extraction (Ollama), deterministic post-processing normalizer, validation, and Excel/CSV export. The normalizer fixes OCR/LLM errors on CIM-10 codes: - OCR digit→letter confusion in position 1 (1→I, 0→O, 5→S, 2→Z, 8→B) - Missing dot separator (F050→F05.0, R410→R41.0) - '+' instead of '.' (B99+1→B99.1, J961+0→J96.10) - Excess decimals (Z04.880→Z04.88) - OCR letter→digit in positions 2-3 (LO2.2→L02.2) - Literal "null" string purge - Auto-fill codes_retenus from decision context Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
18 lines
232 B
Plaintext
18 lines
232 B
Plaintext
# T2A Extractor - Dependencies
|
|
# PDF
|
|
PyMuPDF>=1.24.0
|
|
|
|
# OCR
|
|
python-doctr[torch]>=0.9.0
|
|
torch>=2.0.0
|
|
torchvision>=0.15.0
|
|
|
|
# LLM
|
|
requests>=2.31.0
|
|
|
|
# Export
|
|
openpyxl>=3.1.0
|
|
|
|
# Validation (optionnel, pour usage futur)
|
|
# pydantic>=2.0.0
|