feat(pipeline): extraction OGC via Qwen2.5-VL-3B

Pipeline modulaire remplaçant le monolithe extract_ogc.py (conservé en legacy pour comparaison). Modules : - ingest.py : PDF → PNG 300dpi avec cache par SHA256 - ocr_qwen.py : wrapper singleton Qwen2.5-VL-3B (bfloat16, ~7 Go VRAM) - ocr_glm.py : wrapper GLM-OCR 0.9B (alternatif, conservé) - classify.py : détection type de page + routing par index standard (ordre des 6 pages OGC → -50% d'appels OCR) - prompts.py : JSON schemas par type (recueil, concertation 1/2/2/2, preuves) + mots-clés de classification - checkboxes.py : détection Accord/Désaccord par densité de pixels (inner-frac 0.35, 17/17 corrects sur échantillon vérifié ; GLM-OCR et Qwen échouent sur les checkboxes, cf. scratch/test_prompt_crop_v2.py) - extract.py : orchestration 1 dossier (ingest → classify → OCR → parse JSON tolérant aux boucles + validation ATIH) - persist.py : sauvegarde JSON + metadata (pipeline_version, ocr_model, timestamp) - cli.py : `python -m pipeline.cli <pdf|dir>` Temps mesuré : ~35s/dossier (6 pages) sur RTX 5070. Qwen2.5-VL-3B retenu après comparaison avec GLM-OCR 0.9B, GOT-OCR2.0, Surya, PaddleOCR (cf. scratch/). Il extrait correctement dp_libelle, praticien_conseil et les 4 GHM/GHS là où les autres échouent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 15:05:40 +02:00
parent ddebd8dfbf
commit ed4d9bd765
10 changed files with 704 additions and 0 deletions
--- a/pipeline/ocr_glm.py
+++ b/pipeline/ocr_glm.py
@@ -0,0 +1,60 @@
+"""Wrapper singleton pour GLM-OCR 0.9B."""
+import time
+from pathlib import Path
+import torch
+from transformers import AutoProcessor, AutoModelForImageTextToText
+
+MODEL_PATH = "zai-org/GLM-OCR"
+
+
+class GLMOCR:
+    """Charge GLM-OCR une fois, réutilise le modèle pour toutes les pages."""
+    _instance = None
+
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+            cls._instance._init_model()
+        return cls._instance
+
+    def _init_model(self):
+        t0 = time.time()
+        self.processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)
+        self.model = AutoModelForImageTextToText.from_pretrained(
+            MODEL_PATH,
+            torch_dtype="auto",
+            device_map="auto",
+            trust_remote_code=True,
+        )
+        self.load_time = time.time() - t0
+        self.vram_gb = torch.cuda.memory_allocated() / 1e9 if torch.cuda.is_available() else 0.0
+
+    def run(self, image_path: str | Path, prompt: str, max_new_tokens: int = 4096) -> dict:
+        """Exécute GLM-OCR sur une image avec un prompt, retourne {text, elapsed_s}."""
+        image_path = str(image_path)
+        messages = [{
+            "role": "user",
+            "content": [
+                {"type": "image", "url": image_path},
+                {"type": "text", "text": prompt},
+            ],
+        }]
+        t0 = time.time()
+        inputs = self.processor.apply_chat_template(
+            messages,
+            tokenize=True,
+            add_generation_prompt=True,
+            return_dict=True,
+            return_tensors="pt",
+        ).to(self.model.device)
+        inputs.pop("token_type_ids", None)
+
+        with torch.no_grad():
+            generated_ids = self.model.generate(**inputs, max_new_tokens=max_new_tokens)
+        output = self.processor.decode(
+            generated_ids[0][inputs["input_ids"].shape[1]:],
+            skip_special_tokens=False,
+        )
+        # Nettoyer le marqueur de fin utilisateur
+        output = output.replace("<|user|>", "").strip()
+        return {"text": output, "elapsed_s": time.time() - t0}