spec: Architecture complète avec VLM (5 couches détection)

- Ajout documentation VLM (Ollama qwen2.5vl:7b) - Pipeline complet: Regex → VLM → EDS-Pseudo → CamemBERT → Contextuel - Nouvelles exigences REQ-013/REQ-014 pour optimisation VLM - Tâches Phase 2.5: amélioration prompt, validation croisée, perf - Document ARCHITECTURE_REELLE.md avec détails complets - Matériel: AMD Ryzen 9 9950X, 128GB RAM, RTX 5070 12GB - Objectifs: Rappel ≥99.5%, Précision ≥97%, F1 ≥0.98
2026-03-02 09:52:49 +01:00
parent cb84698c2d
commit 0067738df6
8 changed files with 3251 additions and 0 deletions
--- a/.kiro/specs/anonymization-quality-optimization/QUICKSTART.md
+++ b/.kiro/specs/anonymization-quality-optimization/QUICKSTART.md
@@ -0,0 +1,571 @@
+# Guide de Démarrage Rapide
+
+## Installation des Dépendances
+
+```bash
+# Installer les nouvelles dépendances
+pip install pytest pytest-cov pydantic structlog jinja2 matplotlib
+
+# Vérifier l'installation
+python -c "import pytest, pydantic, structlog, jinja2, matplotlib; print('✅ Toutes les dépendances sont installées')"
+
+# Vérifier la disponibilité CUDA
+python -c "import torch; print(f'CUDA disponible: {torch.cuda.is_available()}'); print(f'GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else \"N/A\"}')"
+```
+
+## Préparation de l'Environnement
+
+```bash
+# Créer les répertoires nécessaires
+mkdir -p tests/ground_truth
+mkdir -p tests/unit
+mkdir -p tests/regression
+mkdir -p reports/quality
+mkdir -p evaluation
+mkdir -p detectors
+mkdir -p tools
+mkdir -p docs
+mkdir -p config
+
+# Vérifier la structure
+tree -L 2 tests/ evaluation/ detectors/ tools/
+```
+
+## Phase 1 : Création du Dataset de Test
+
+### Étape 1.1 : Sélectionner les Documents
+
+```bash
+# Aller dans le répertoire source
+cd "/home/dom/Téléchargements/II-1 Ctrl_T2A_2025_CHCB_DocJustificatifs (1)/"
+
+# Lister tous les PDFs
+find . -name "*.pdf" -type f > /tmp/all_pdfs.txt
+wc -l /tmp/all_pdfs.txt  # Compter le nombre total
+
+# Analyser la répartition par dossier OGC
+for dir in */; do
+  count=$(find "$dir" -name "*.pdf" -type f | wc -l)
+  echo "$dir: $count PDFs"
+done | sort -t: -k2 -n
+
+# Sélectionner manuellement 30 documents :
+# - 10 simples (1-2 pages, peu de PII)
+# - 15 moyens (3-5 pages, PII variés)
+# - 5 complexes (>5 pages, nombreux PII)
+
+# Copier les documents sélectionnés
+# Exemple :
+cp "257_23209962/FC14.pdf" ~/path/to/project/tests/ground_truth/ogc_257_fc14.pdf
+cp "257_23209962/FC16.pdf" ~/path/to/project/tests/ground_truth/ogc_257_fc16.pdf
+# ... répéter pour les 30 documents
+```
+
+### Étape 1.2 : Créer l'Outil d'Annotation
+
+```bash
+# Créer le fichier
+cat > tools/annotation_tool.py << 'EOF'
+#!/usr/bin/env python3
+"""
+Outil d'annotation CLI pour créer le dataset de test.
+Usage: python tools/annotation_tool.py tests/ground_truth/
+"""
+import json
+from pathlib import Path
+import sys
+
+# TODO: Implémenter l'outil d'annotation
+# Voir design.md section 2.1.2 pour les spécifications
+
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        print("Usage: python tools/annotation_tool.py <ground_truth_dir>")
+        sys.exit(1)
+    
+    ground_truth_dir = Path(sys.argv[1])
+    print(f"Annotation des documents dans : {ground_truth_dir}")
+    # TODO: Implémenter la logique d'annotation
+EOF
+
+chmod +x tools/annotation_tool.py
+```
+
+### Étape 1.3 : Annoter les Documents
+
+```bash
+# Lancer l'outil d'annotation
+python tools/annotation_tool.py tests/ground_truth/
+
+# Pour chaque PDF, l'outil doit :
+# 1. Extraire et afficher le texte
+# 2. Demander de saisir les PII (type, texte, page, contexte)
+# 3. Demander les termes médicaux à préserver
+# 4. Sauvegarder en JSON
+
+# Format de sortie attendu :
+# tests/ground_truth/ogc_257_fc14.pdf
+# tests/ground_truth/ogc_257_fc14.annotations.json
+```
+
+## Phase 2 : Implémentation de l'Évaluation
+
+### Étape 2.1 : Créer l'Évaluateur de Qualité
+
+```bash
+# Créer le fichier
+cat > evaluation/quality_evaluator.py << 'EOF'
+#!/usr/bin/env python3
+"""
+Évaluateur de qualité d'anonymisation.
+Compare les détections avec les annotations manuelles.
+"""
+from dataclasses import dataclass
+from pathlib import Path
+from typing import List, Dict
+import json
+
+@dataclass
+class EvaluationResult:
+    true_positives: int
+    false_positives: int
+    false_negatives: int
+    precision: float
+    recall: float
+    f1_score: float
+    missed_pii: List[Dict]
+    false_detections: List[Dict]
+
+class QualityEvaluator:
+    def __init__(self, ground_truth_dir: Path):
+        self.ground_truth_dir = ground_truth_dir
+    
+    def evaluate(self, pdf_path: Path, audit_path: Path) -> EvaluationResult:
+        # TODO: Implémenter l'évaluation
+        # Voir design.md section 2.2.1 pour les spécifications
+        pass
+
+if __name__ == "__main__":
+    # TODO: Ajouter CLI
+    pass
+EOF
+
+chmod +x evaluation/quality_evaluator.py
+```
+
+### Étape 2.2 : Créer le Scanner de Fuite
+
+```bash
+# Créer le fichier
+cat > evaluation/leak_scanner.py << 'EOF'
+#!/usr/bin/env python3
+"""
+Scanner de fuite de PII dans les documents anonymisés.
+"""
+from dataclasses import dataclass
+from pathlib import Path
+from typing import List, Dict
+
+@dataclass
+class LeakReport:
+    is_safe: bool
+    leak_count: int
+    leaks: List[Dict]
+
+class LeakScanner:
+    def scan(self, anonymized_pdf: Path, original_audit: Path) -> LeakReport:
+        # TODO: Implémenter le scan
+        # Voir design.md section 2.2.2 pour les spécifications
+        pass
+
+if __name__ == "__main__":
+    # TODO: Ajouter CLI
+    pass
+EOF
+
+chmod +x evaluation/leak_scanner.py
+```
+
+### Étape 2.3 : Créer le Benchmark
+
+```bash
+# Créer le fichier
+cat > evaluation/benchmark.py << 'EOF'
+#!/usr/bin/env python3
+"""
+Benchmark de performance du système d'anonymisation.
+"""
+from pathlib import Path
+import time
+import json
+
+class Benchmark:
+    def __init__(self, test_data_dir: Path):
+        self.test_data_dir = test_data_dir
+    
+    def run(self) -> Dict:
+        # TODO: Implémenter le benchmark
+        # Voir design.md section 2.2.3 pour les spécifications
+        pass
+
+if __name__ == "__main__":
+    # TODO: Ajouter CLI
+    pass
+EOF
+
+chmod +x evaluation/benchmark.py
+```
+
+## Phase 3 : Mesure de la Baseline
+
+### Étape 3.1 : Anonymiser les Documents de Test
+
+```bash
+# Option 1 : Via GUI
+python Pseudonymisation_Gui_V5.py
+# Sélectionner le dossier tests/ground_truth/
+# Lancer l'anonymisation
+
+# Option 2 : Via CLI (si disponible)
+python anonymizer_core_refactored_onnx.py \
+  tests/ground_truth/*.pdf \
+  --output tests/ground_truth/anonymized/ \
+  --hf \
+  --raster
+```
+
+### Étape 3.2 : Évaluer la Baseline
+
+```bash
+# Évaluer chaque document
+python evaluation/quality_evaluator.py \
+  --ground-truth tests/ground_truth/ \
+  --anonymized tests/ground_truth/anonymized/ \
+  --output reports/baseline_evaluation.json
+
+# Générer le rapport HTML
+python evaluation/quality_evaluator.py \
+  --ground-truth tests/ground_truth/ \
+  --anonymized tests/ground_truth/anonymized/ \
+  --output reports/baseline_report.html \
+  --format html
+```
+
+### Étape 3.3 : Scanner les Fuites
+
+```bash
+# Scanner tous les documents anonymisés
+for pdf in tests/ground_truth/anonymized/*.pdf; do
+  audit="${pdf%.pdf}.audit.jsonl"
+  python evaluation/leak_scanner.py \
+    --anonymized "$pdf" \
+    --audit "$audit" \
+    --output "reports/leak_$(basename $pdf .pdf).json"
+done
+
+# Générer un rapport consolidé
+python evaluation/leak_scanner.py \
+  --batch tests/ground_truth/anonymized/ \
+  --output reports/leak_report.html
+```
+
+### Étape 3.4 : Benchmarker les Performances
+
+```bash
+# Exécuter le benchmark
+python evaluation/benchmark.py \
+  --test-dir tests/ground_truth/ \
+  --iterations 3 \
+  --output reports/baseline_benchmark.json
+
+# Afficher les résultats
+python evaluation/benchmark.py \
+  --show reports/baseline_benchmark.json
+```
+
+## Phase 4 : Amélioration des Détecteurs
+
+### Étape 4.1 : Créer les Regex Améliorées
+
+```bash
+# Créer le fichier
+cat > detectors/improved_regex.py << 'EOF'
+#!/usr/bin/env python3
+"""
+Regex améliorées pour la détection de PII.
+"""
+import re
+
+# Téléphone amélioré (formats fragmentés)
+RE_TEL_IMPROVED = re.compile(
+    r"(?<!\d)"
+    r"(?:"
+        r"(?:\+33|0033)\s*[1-9](?:[\s.\-]?\d){8}"
+        r"|"
+        r"0[1-9](?:[\s.\-]?\d){8}"
+        r"|"
+        r"0[1-9][\s.\-]?\d{1,2}[\s.\-]?\d{1,2}[\s\n]{1,3}\d{1,2}[\s.\-]?\d{1,2}[\s.\-]?\d{1,2}"
+    r")"
+    r"(?!\d)",
+    re.MULTILINE
+)
+
+# Email amélioré (domaines médicaux)
+RE_EMAIL_IMPROVED = re.compile(
+    r"\b[A-Za-z0-9._%+-]+"
+    r"@"
+    r"(?:"
+        r"(?:chu|ch|aphp|ap-hm|hospices-civils|clinique|hopital|ehpad)"
+        r"[\w\-]*\.[a-z]{2,}"
+        r"|"
+        r"[A-Za-z0-9.-]+\.[A-Za-z]{2,}"
+    r")\b",
+    re.IGNORECASE
+)
+
+# TODO: Ajouter les autres regex améliorées
+# Voir design.md section 2.3.1 pour les spécifications
+EOF
+```
+
+### Étape 4.2 : Créer le Détecteur Contextuel
+
+```bash
+# Créer le fichier
+cat > detectors/contextual.py << 'EOF'
+#!/usr/bin/env python3
+"""
+Détecteur contextuel pour les noms de personnes.
+"""
+import re
+from typing import List, Dict
+
+class ContextualDetector:
+    def __init__(self):
+        self.strong_contexts = [
+            r"(?:Dr\.?|Docteur|Pr\.?|Professeur)\s+{name}",
+            r"(?:Mme|Madame|M\.|Monsieur)\s+{name}",
+            r"Patient(?:e)?\s*:\s*{name}",
+        ]
+    
+    def detect(self, text: str, page: int) -> List[Dict]:
+        # TODO: Implémenter la détection contextuelle
+        # Voir design.md section 2.3.2 pour les spécifications
+        pass
+EOF
+```
+
+### Étape 4.3 : Créer le Détecteur Hybride
+
+```bash
+# Créer le fichier
+cat > detectors/hybrid.py << 'EOF'
+#!/usr/bin/env python3
+"""
+Détecteur hybride combinant plusieurs méthodes.
+"""
+from typing import List, Dict
+
+class HybridDetector:
+    def __init__(self):
+        # TODO: Initialiser les détecteurs
+        pass
+    
+    def detect(self, text: str, page: int) -> List[Dict]:
+        # TODO: Implémenter le pipeline hybride
+        # Voir design.md section 2.3.3 pour les spécifications
+        pass
+EOF
+```
+
+## Phase 5 : Tests et Validation
+
+### Étape 5.1 : Créer les Tests Unitaires
+
+```bash
+# Créer les fichiers de test
+mkdir -p tests/unit
+
+# Tests des regex
+cat > tests/unit/test_improved_regex.py << 'EOF'
+import pytest
+from detectors.improved_regex import RE_TEL_IMPROVED, RE_EMAIL_IMPROVED
+
+class TestImprovedRegex:
+    @pytest.mark.parametrize("phone,should_match", [
+        ("06 12 34 56 78", True),
+        ("0612345678", True),
+        ("06 12 34\n56 78", True),
+        ("12345678901", False),
+    ])
+    def test_phone_detection(self, phone, should_match):
+        match = RE_TEL_IMPROVED.search(phone)
+        assert (match is not None) == should_match
+    
+    # TODO: Ajouter plus de tests
+EOF
+
+# Exécuter les tests
+pytest tests/unit/ -v --cov=detectors
+```
+
+### Étape 5.2 : Créer les Tests de Régression
+
+```bash
+# Créer le fichier
+cat > tests/regression/test_regression.py << 'EOF'
+import pytest
+from pathlib import Path
+from evaluation.quality_evaluator import QualityEvaluator
+
+class TestRegression:
+    def test_quality_metrics(self):
+        """Vérifie que les métriques de qualité sont atteintes"""
+        evaluator = QualityEvaluator(Path("tests/ground_truth"))
+        # TODO: Implémenter le test
+        # Vérifier Rappel >= 99.5%, Précision >= 97%
+        pass
+    
+    def test_no_performance_degradation(self):
+        """Vérifie qu'il n'y a pas de dégradation de performance"""
+        # TODO: Implémenter le test
+        # Vérifier temps < 30s par PDF
+        pass
+EOF
+
+# Exécuter les tests de régression
+pytest tests/regression/ -v
+```
+
+### Étape 5.3 : Valider sur le Corpus Complet
+
+```bash
+# Anonymiser tous les documents
+python Pseudonymisation_Gui_V5.py
+# Sélectionner : /home/dom/Téléchargements/II-1 Ctrl_T2A_2025_CHCB_DocJustificatifs (1)/
+# Lancer l'anonymisation
+
+# Scanner toutes les fuites
+python evaluation/leak_scanner.py \
+  --batch "/home/dom/Téléchargements/II-1 Ctrl_T2A_2025_CHCB_DocJustificatifs (1)/anonymise/" \
+  --output reports/full_corpus_leak_report.html
+
+# Vérifier qu'aucune fuite critique n'est détectée
+grep -i "CRITIQUE" reports/full_corpus_leak_report.html
+```
+
+## Commandes Utiles
+
+### Vérifier la Qualité du Code
+
+```bash
+# Linter
+pylint detectors/ evaluation/ tools/
+
+# Formatter
+black detectors/ evaluation/ tools/
+
+# Type checker
+mypy detectors/ evaluation/ tools/
+
+# Tout en une fois
+pylint detectors/ && black --check detectors/ && mypy detectors/
+```
+
+### Générer la Documentation
+
+```bash
+# Générer la doc API avec Sphinx (optionnel)
+sphinx-apidoc -o docs/api detectors/ evaluation/ tools/
+sphinx-build -b html docs/ docs/_build/
+
+# Ou simplement documenter avec des docstrings
+python -m pydoc detectors.hybrid
+```
+
+### Exporter les Résultats
+
+```bash
+# Exporter les métriques en CSV
+python evaluation/quality_evaluator.py \
+  --ground-truth tests/ground_truth/ \
+  --output reports/metrics.csv \
+  --format csv
+
+# Exporter les graphiques
+python evaluation/quality_evaluator.py \
+  --ground-truth tests/ground_truth/ \
+  --output reports/charts/ \
+  --format charts
+```
+
+## Troubleshooting
+
+### Problème : Annotation trop longue
+
+**Solution** : Paralléliser avec 2 annotateurs
+```bash
+# Annotateur 1 : documents 1-15
+python tools/annotation_tool.py tests/ground_truth/ --range 1-15
+
+# Annotateur 2 : documents 16-30
+python tools/annotation_tool.py tests/ground_truth/ --range 16-30
+```
+
+### Problème : GPU non détecté
+
+**Solution** : Vérifier l'installation CUDA
+```bash
+# Vérifier CUDA
+nvidia-smi
+
+# Vérifier PyTorch CUDA
+python -c "import torch; print(torch.cuda.is_available())"
+
+# Réinstaller PyTorch avec CUDA si nécessaire
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+```
+
+### Problème : VRAM insuffisante
+
+**Solution** : Réduire la taille du batch
+```yaml
+# Dans config/quality_config.yml
+gpu:
+  batch_size: 8  # Réduire de 16 à 8
+  max_vram_gb: 8  # Réduire la limite
+```
+
+### Problème : Traitement trop lent même avec GPU
+
+**Solution** : Vérifier que le GPU est bien utilisé
+```python
+# Ajouter du logging dans le code
+import torch
+print(f"Device utilisé: {torch.cuda.get_device_name(0)}")
+print(f"VRAM utilisée: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
+```
+
+### Problème : Mémoire RAM insuffisante (peu probable avec 128 GB)
+
+**Solution** : Réduire le nombre de workers
+```yaml
+# Dans config/quality_config.yml
+performance:
+  max_workers: 4  # Réduire de 8 à 4
+```
+
+## Ressources
+
+- **Documentation** : `docs/`
+- **Spécifications** : `.kiro/specs/anonymization-quality-optimization/`
+- **Tests** : `tests/`
+- **Rapports** : `reports/`
+
+## Support
+
+Pour toute question :
+1. Consulter `requirements.md` pour les exigences
+2. Consulter `design.md` pour l'architecture
+3. Consulter `tasks.md` pour le plan détaillé
+4. Consulter `SUMMARY.md` pour le résumé exécutif