t2a-extractor/requirements.txt at f70d138db3d50d8776ddb359cf99e904f54442ae - t2a-extractor - Gitea Aivanov : Git with a cup of tea

Dom/t2a-extractor

Files

dom f70d138db3 feat: T2A-Extractor pipeline with CIM-10 normalizer (31→0 warnings)

Initial commit with full extraction pipeline: PDF OCR (docTR), text
segmentation, LLM extraction (Ollama), deterministic post-processing
normalizer, validation, and Excel/CSV export.

The normalizer fixes OCR/LLM errors on CIM-10 codes:
- OCR digit→letter confusion in position 1 (1→I, 0→O, 5→S, 2→Z, 8→B)
- Missing dot separator (F050→F05.0, R410→R41.0)
- '+' instead of '.' (B99+1→B99.1, J961+0→J96.10)
- Excess decimals (Z04.880→Z04.88)
- OCR letter→digit in positions 2-3 (LO2.2→L02.2)
- Literal "null" string purge
- Auto-fill codes_retenus from decision context

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-23 20:44:32 +01:00

18 lines

232 B

Plaintext

Raw Blame History

 # T2A Extractor - Dependencies
 # PDF
 PyMuPDF>=1.24.0
 # OCR
 python-doctr[torch]>=0.9.0
 torch>=2.0.0
 torchvision>=0.15.0
 # LLM
 requests>=2.31.0
 # Export
 openpyxl>=3.1.0
 # Validation (optionnel, pour usage futur)
 # pydantic>=2.0.0