feat(pipeline): extraction OGC via Qwen2.5-VL-3B

Pipeline modulaire remplaçant le monolithe extract_ogc.py (conservé en legacy pour comparaison). Modules : - ingest.py : PDF → PNG 300dpi avec cache par SHA256 - ocr_qwen.py : wrapper singleton Qwen2.5-VL-3B (bfloat16, ~7 Go VRAM) - ocr_glm.py : wrapper GLM-OCR 0.9B (alternatif, conservé) - classify.py : détection type de page + routing par index standard (ordre des 6 pages OGC → -50% d'appels OCR) - prompts.py : JSON schemas par type (recueil, concertation 1/2/2/2, preuves) + mots-clés de classification - checkboxes.py : détection Accord/Désaccord par densité de pixels (inner-frac 0.35, 17/17 corrects sur échantillon vérifié ; GLM-OCR et Qwen échouent sur les checkboxes, cf. scratch/test_prompt_crop_v2.py) - extract.py : orchestration 1 dossier (ingest → classify → OCR → parse JSON tolérant aux boucles + validation ATIH) - persist.py : sauvegarde JSON + metadata (pipeline_version, ocr_model, timestamp) - cli.py : `python -m pipeline.cli <pdf|dir>` Temps mesuré : ~35s/dossier (6 pages) sur RTX 5070. Qwen2.5-VL-3B retenu après comparaison avec GLM-OCR 0.9B, GOT-OCR2.0, Surya, PaddleOCR (cf. scratch/). Il extrait correctement dp_libelle, praticien_conseil et les 4 GHM/GHS là où les autres échouent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 15:05:40 +02:00
parent ddebd8dfbf
commit ed4d9bd765
10 changed files with 704 additions and 0 deletions
--- a/pipeline/cli.py
+++ b/pipeline/cli.py
@@ -0,0 +1,53 @@
+"""CLI : traite un PDF ou un répertoire de PDFs.
+
+Usage :
+    python -m pipeline.cli <pdf|dir> [--out output/v2]
+"""
+import argparse
+import glob
+import sys
+import time
+from pathlib import Path
+
+from .extract import extract_dossier
+from .persist import save_result
+
+
+def main():
+    p = argparse.ArgumentParser(description="Pipeline OGC v1 (GLM-OCR)")
+    p.add_argument("input", help="PDF unique ou répertoire contenant des PDFs")
+    p.add_argument("--out", default="output/v2", help="Répertoire de sortie JSON")
+    p.add_argument("--quiet", action="store_true")
+    args = p.parse_args()
+
+    input_path = Path(args.input)
+    if input_path.is_dir():
+        pdfs = sorted(input_path.glob("*.pdf"))
+    elif input_path.is_file() and input_path.suffix.lower() == ".pdf":
+        pdfs = [input_path]
+    else:
+        # Globbing si chemin avec espaces/motifs
+        pdfs = [Path(p) for p in sorted(glob.glob(str(input_path))) if p.lower().endswith(".pdf")]
+
+    if not pdfs:
+        print(f"Aucun PDF trouvé pour : {args.input}")
+        return 1
+
+    print(f"{len(pdfs)} PDF(s) à traiter → {args.out}")
+    t0 = time.time()
+    for pdf in pdfs:
+        t_pdf = time.time()
+        try:
+            result = extract_dossier(pdf, verbose=not args.quiet)
+            out_path = save_result(result, args.out)
+            print(f"  ✓ {pdf.name} → {out_path}  ({time.time()-t_pdf:.1f}s)")
+        except Exception as e:
+            print(f"  ✗ {pdf.name} : {e}")
+            import traceback
+            traceback.print_exc()
+    print(f"Terminé en {time.time()-t0:.1f}s")
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())