chore: add .gitignore
This commit is contained in:
365
docs/analyse_t2a_v2.md
Normal file
365
docs/analyse_t2a_v2.md
Normal file
@@ -0,0 +1,365 @@
|
||||
# Rapport d'analyse — t2a_v2
|
||||
|
||||
**Date** : 2026-02-19
|
||||
**Codebase** : `/home/dom/ai/t2a_v2/`
|
||||
**Branche** : `master` (HEAD: `5c8c281`)
|
||||
|
||||
---
|
||||
|
||||
## 1. Vue d'ensemble
|
||||
|
||||
| Indicateur | Valeur |
|
||||
|-----------|--------|
|
||||
| Fichiers source (.py) | 46 |
|
||||
| Lignes de code (src/) | 12 596 |
|
||||
| Fichiers tests | 24 |
|
||||
| Lignes de tests | 7 095 |
|
||||
| Fonctions test | 685 |
|
||||
| Ratio tests/code | 0.56 |
|
||||
| Monolithes (>500L) | 10 fichiers |
|
||||
| Modules (sous-packages) | 7 |
|
||||
|
||||
### Comparaison t2a v1 -> v2
|
||||
|
||||
| Aspect | t2a (v1) | t2a_v2 | Delta |
|
||||
|--------|----------|--------|-------|
|
||||
| Lignes source | 10 508 | 12 596 | +2 088 (+20%) |
|
||||
| Fichiers source | 44 | 46 | +2 |
|
||||
| Ratio tests/code | 0.68 | 0.56 | -0.12 |
|
||||
| Monolithe max | 1 227L | 1 352L | +125L |
|
||||
| Config YAML | Aucun | 6 fichiers | +Flexibilite |
|
||||
| Module quality/ | - | 1 226L | NOUVEAU |
|
||||
|
||||
---
|
||||
|
||||
## 2. Structure des modules
|
||||
|
||||
```
|
||||
src/ 12 596L total
|
||||
|
|
||||
+-- config.py (746L) -- Config unifiee + modeles Pydantic + chargement YAML
|
||||
+-- main.py (640L) -- Orchestrateur CLI principal
|
||||
|
|
||||
+-- anonymization/ (904L) -- Anonymisation NER + regex
|
||||
| +-- anonymizer.py (529L)
|
||||
| +-- entity_registry.py (86L)
|
||||
| +-- ner_anonymizer.py (95L)
|
||||
| +-- regex_patterns.py (194L)
|
||||
|
|
||||
+-- control/ (1161L) -- Controles CPAM
|
||||
| +-- cpam_parser.py (115L) -- Parsing Excel CPAM
|
||||
| +-- cpam_response.py (1046L) -- Contre-argumentation multi-pass
|
||||
|
|
||||
+-- export/ (190L) -- Export RUM
|
||||
| +-- rum_export.py (190L)
|
||||
|
|
||||
+-- extraction/ (928L) -- Extraction documents PDF
|
||||
| +-- trackare_parser.py (424L)
|
||||
| +-- crh_parser.py (129L)
|
||||
| +-- document_splitter.py (124L)
|
||||
| +-- document_classifier.py (94L)
|
||||
| +-- page_tracker.py (91L)
|
||||
| +-- pdf_extractor.py (66L)
|
||||
|
|
||||
+-- medical/ (5323L) -- Coeur metier CIM-10/CCAM/RAG
|
||||
| +-- cim10_extractor.py (1352L) -- Extraction codes CIM-10 (MONOLITHE)
|
||||
| +-- rag_search.py (849L) -- RAG FAISS + embedding + reranking
|
||||
| +-- rag_index.py (803L) -- Index FAISS dual (ref + proc)
|
||||
| +-- clinical_context.py (315L) -- Enrichissement contexte clinique
|
||||
| +-- fusion.py (294L) -- Merge multi-PDFs
|
||||
| +-- cim10_dict.py (243L) -- Dictionnaire CIM-10
|
||||
| +-- severity.py (242L) -- Calcul severite + niveaux CMA
|
||||
| +-- ghm.py (231L) -- Estimation GHM
|
||||
| +-- ccam_dict.py (191L) -- Dictionnaire CCAM
|
||||
| +-- exclusion_rules.py (169L) -- Filtrage codes impossibles
|
||||
| +-- das_filter.py (152L) -- 11 regles filtrage DAS bruit
|
||||
| +-- edsnlp_pipeline.py (140L) -- Wrapper edsnlp (optionnel)
|
||||
| +-- ollama_client.py (135L) -- Client Ollama + retry + JSON
|
||||
| +-- ccam_noncumul.py (122L) -- Non-cumulativite CCAM
|
||||
| +-- ollama_cache.py (85L) -- Cache JSON persistant
|
||||
|
|
||||
+-- quality/ (1226L) -- NOUVEAU : qualite deterministe post-LLM
|
||||
| +-- decision_engine.py (609L) -- Decisions KEEP/DOWNGRADE/REMOVE
|
||||
| +-- veto_engine.py (411L) -- Vetos + contestabilite
|
||||
| +-- rules_router.py (205L) -- Routage dynamique packs regles
|
||||
|
|
||||
+-- viewer/ (1478L) -- Interface web Flask
|
||||
+-- app.py (872L) -- Routes + dashboard
|
||||
+-- validation.py (272L) -- Validation manager (mode DIM)
|
||||
+-- referentiels.py (160L) -- Upload/indexation referentiels
|
||||
+-- pdf_redactor.py (154L) -- Redaction source PDF
|
||||
+-- __main__.py (20L)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Pipeline d'execution
|
||||
|
||||
### 3.1 CLI (`python -m src.main`)
|
||||
|
||||
```
|
||||
main()
|
||||
+-- Pour chaque PDF :
|
||||
| +-- extract_text_with_pages() [extraction/pdf_extractor.py]
|
||||
| +-- classify() [extraction/document_classifier.py]
|
||||
| +-- split_documents() [extraction/document_splitter.py]
|
||||
| +-- parse_trackare() ou parse_crh() [extraction/]
|
||||
| +-- Anonymizer.anonymize() [anonymization/anonymizer.py]
|
||||
| +-- _run_edsnlp() [medical/edsnlp_pipeline.py] (optionnel)
|
||||
| +-- extract_medical_info() [medical/cim10_extractor.py] << MONOLITHE
|
||||
| | +-- RAG FAISS + Ollama (si --rag)
|
||||
| | +-- Validation CIM-10 dict + supplements
|
||||
| | +-- Extraction actes CCAM
|
||||
| | +-- Scoring confiance biologie
|
||||
| +-- build_rules_runtime_context() [quality/rules_router.py]
|
||||
| +-- apply_vetos() [quality/veto_engine.py]
|
||||
| +-- apply_decisions() [quality/decision_engine.py]
|
||||
| +-- _compute_metrics() [main.py]
|
||||
| +-- estimate_ghm() [medical/ghm.py]
|
||||
| +-- write_outputs() [main.py]
|
||||
|
|
||||
+-- FUSION (si multi-PDFs meme groupe)
|
||||
| +-- merge_dossiers() [medical/fusion.py]
|
||||
| +-- Re-application vetos/decisions
|
||||
| +-- Re-estimation GHM
|
||||
|
|
||||
+-- CONTROLE CPAM (si Excel detecte)
|
||||
+-- match_dossier_ogc() [control/cpam_parser.py]
|
||||
+-- generate_cpam_response() [control/cpam_response.py] << MONOLITHE
|
||||
+-- Passe 1 : extraction structuree
|
||||
+-- 5 requetes RAG ciblees
|
||||
+-- Passe 2 : argumentation 3 axes
|
||||
+-- Passe 3 : validation adversariale
|
||||
```
|
||||
|
||||
### 3.2 Viewer (`python -m src.viewer --debug`)
|
||||
|
||||
- Dashboard : `/` -- liste dossiers + stats
|
||||
- Detail : `/dossier/<nom>` -- codes CIM-10, DAS, CPAM, GHM
|
||||
- Admin : `/admin/models` -- gestion modeles Ollama
|
||||
- Referentiels : `/referentiels` -- upload/indexation PDFs
|
||||
- Validation : mode DIM pour valider/corriger les codes
|
||||
|
||||
### 3.3 Flags CLI
|
||||
|
||||
```bash
|
||||
--no-ner # Desactiver anonymisation NER
|
||||
--no-edsnlp # Desactiver pipeline edsnlp
|
||||
--no-rag # Desactiver RAG (LLM seul)
|
||||
--build-dict # Reconstruire dictionnaire CIM-10
|
||||
--build-ccam-dict # Reconstruire dictionnaire CCAM
|
||||
--rebuild-index # Reconstruire index FAISS
|
||||
--export-rum # Export RUM V016
|
||||
--control-cpam # Excel CPAM pour contre-argumentation
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Nouveau module : quality/
|
||||
|
||||
Le module `src/quality/` est l'ajout architectural majeur de la v2. Il implementa la couche de validation deterministe post-LLM.
|
||||
|
||||
### decision_engine.py (609L)
|
||||
|
||||
Post-traitement des codes proposes par le LLM. Chaque code recoit une decision :
|
||||
- **KEEP** : code valide, maintenu
|
||||
- **DOWNGRADE** : confiance reduite (ex: symptome R00-R99 avec diagnostic precis)
|
||||
- **REMOVE** : code rejete (invalide, redondant, non pertinent)
|
||||
|
||||
### veto_engine.py (411L)
|
||||
|
||||
Detection de vetos deterministes :
|
||||
- Negation dans le texte source
|
||||
- Conditionnel (diagnostics non confirmes)
|
||||
- Antecedents non pertinents pour le sejour
|
||||
- Conflits entre codes
|
||||
|
||||
### rules_router.py (205L)
|
||||
|
||||
Routage dynamique des packs de regles selon les signaux du dossier :
|
||||
- Pack biologie active si valeurs bio presentes
|
||||
- Pack CPAM active si controle CPAM detecte
|
||||
- Configuration via `config/rules/router.yaml`
|
||||
|
||||
### Configuration YAML associee
|
||||
|
||||
```
|
||||
config/
|
||||
+-- reference_ranges.yaml -- Valeurs normales biologiques (adulte/enfant)
|
||||
+-- bio_rules.yaml -- Regles hyponatremie, hyperkaliemie, etc.
|
||||
+-- lab_value_sanity.yaml -- Garde-fous OCR (K, Na, Plaquettes, Hb, etc.)
|
||||
+-- rules/
|
||||
+-- base.yaml -- Catalogue complet des regles
|
||||
+-- enabled.yaml -- Overlay d'activation
|
||||
+-- router.yaml -- Routage packs par signaux
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Croissance par module (v1 -> v2)
|
||||
|
||||
| Module | t2a (v1) | t2a_v2 | Delta | % |
|
||||
|--------|----------|--------|-------|---|
|
||||
| anonymization | 904 | 904 | 0 | 0% |
|
||||
| control | 1 062 | 1 161 | +99 | +9% |
|
||||
| extraction | 928 | 928 | 0 | 0% |
|
||||
| medical | 4 912 | 5 323 | +411 | +8% |
|
||||
| viewer | 1 486 | 1 478 | -8 | -0.5% |
|
||||
| **quality** | **0** | **1 226** | **+1 226** | **NOUVEAU** |
|
||||
| root (config+main) | 669 | 1 386 | +717 | +107% |
|
||||
| **TOTAL** | **10 508** | **12 596** | **+2 088** | **+20%** |
|
||||
|
||||
La croissance vient principalement de :
|
||||
1. **quality/** (+1 226L) : nouveau module deterministe
|
||||
2. **config.py** (+484L) : chargement YAML, rules context, modeles Pydantic supplementaires
|
||||
3. **main.py** (+233L) : fusion multi-PDFs, vetos/decisions, metriques
|
||||
|
||||
---
|
||||
|
||||
## 6. Monolithes identifies (>500L)
|
||||
|
||||
| # | Fichier | Lignes | Responsabilites |
|
||||
|---|---------|--------|----------------|
|
||||
| 1 | cim10_extractor.py | 1 352 | Extraction LLM + validation + filtering + RAG |
|
||||
| 2 | cpam_response.py | 1 046 | RAG CPAM multi-requete + prompt engineering |
|
||||
| 3 | app.py | 872 | Routes Flask + dashboard + admin |
|
||||
| 4 | rag_search.py | 849 | Embedding + reranker + FAISS + generation |
|
||||
| 5 | rag_index.py | 803 | Dual indexing + chunking CIM-10 |
|
||||
| 6 | config.py | 746 | Config + Pydantic + chargement YAML |
|
||||
| 7 | main.py | 640 | Orchestration pipeline complet |
|
||||
| 8 | decision_engine.py | 609 | Decisions KEEP/DOWNGRADE/REMOVE |
|
||||
| 9 | anonymizer.py | 529 | 3 phases anonymisation |
|
||||
| 10 | veto_engine.py | 411 | Vetos + contestabilite |
|
||||
|
||||
---
|
||||
|
||||
## 7. Tests
|
||||
|
||||
### Couverture par fichier (top 10)
|
||||
|
||||
| Fichier test | Lignes | Fonctions |
|
||||
|-------------|--------|-----------|
|
||||
| test_cpam_response.py | 1 289 | 75 |
|
||||
| test_rag.py | 1 089 | 72 |
|
||||
| test_medical.py | 686 | 94 |
|
||||
| test_fusion.py | 493 | 33 |
|
||||
| test_viewer.py | 299 | 31 |
|
||||
| test_das_llm.py | 272 | 13 |
|
||||
| test_clinical_context.py | 264 | 36 |
|
||||
| test_das_filter.py | 260 | 67 |
|
||||
| test_justification.py | 245 | 13 |
|
||||
| test_rum_export.py | 212 | 29 |
|
||||
|
||||
### Zones sous-testees
|
||||
|
||||
- **quality/** : nouveau module, pas de fichier test dedie visible
|
||||
- **rag_index.py** : 803L sans test specifique (teste via test_rag.py)
|
||||
- **Ratio global** : 0.56 (en baisse vs 0.68 en v1) -- le code a grandi plus vite que les tests
|
||||
|
||||
---
|
||||
|
||||
## 8. Dependencies externes
|
||||
|
||||
| Package | Role | Criticite |
|
||||
|---------|------|----------|
|
||||
| pdfplumber | Extraction PDF | Haute |
|
||||
| PyMuPDF | PDF alternatif + redaction | Haute |
|
||||
| torch + transformers | Modeles HF | Haute |
|
||||
| sentence-transformers | Embeddings RAG | Haute |
|
||||
| faiss-cpu | Index semantique | Haute |
|
||||
| edsnlp | NLP medical francais | Moyenne (optionnel) |
|
||||
| flask | Viewer web | Moyenne |
|
||||
| pydantic | Validation donnees | Haute |
|
||||
| requests | Client HTTP (Ollama) | Haute |
|
||||
| openpyxl + pandas | Parsing Excel CPAM | Moyenne |
|
||||
| PyYAML | Configuration YAML | Haute (v2) |
|
||||
|
||||
---
|
||||
|
||||
## 9. Variables globales et thread-safety
|
||||
|
||||
### Thread-safe
|
||||
|
||||
| Module | Variable | Technique |
|
||||
|--------|----------|-----------|
|
||||
| config.py | `_RULES_RUNTIME_CTX` | contextvars.ContextVar |
|
||||
| rag_search.py | `_embed_model` | Lock + double-check + sentinel |
|
||||
| rag_search.py | `_reranker_model` | Lazy singleton |
|
||||
| cim10_dict.py | `_dict_cache` | @lru_cache(maxsize=1) |
|
||||
| ccam_dict.py | `_dict_cache` | @lru_cache(maxsize=1) |
|
||||
| ollama_cache.py | JSON | File-based lock (fcntl) |
|
||||
|
||||
### Non thread-safe (risque)
|
||||
|
||||
| Module | Variable | Risque |
|
||||
|--------|----------|--------|
|
||||
| main.py:139 | `_use_edsnlp` | Race condition en batch multi-thread |
|
||||
| main.py:141 | `_use_rag` | Race condition en batch multi-thread |
|
||||
|
||||
---
|
||||
|
||||
## 10. Dettes techniques
|
||||
|
||||
### Haute priorite
|
||||
|
||||
| # | Description | Fichier | Impact |
|
||||
|---|------------|---------|--------|
|
||||
| T1 | Flags `_use_edsnlp`, `_use_rag` non thread-safe | main.py | Comportement imprevisible en batch |
|
||||
| T2 | cim10_extractor.py (1352L) melange 4+ responsabilites | medical/ | Testabilite, maintenance |
|
||||
| T3 | cpam_response.py (1046L) -- prompts en dur, pas de templates | control/ | Versioning, A/B testing |
|
||||
| T4 | Docstrings manquantes sur extract_medical_info() | cim10_extractor.py | Documentation API |
|
||||
| T5 | `except Exception:` sans re-raise dans main.py | main.py | Bugs silencieux |
|
||||
|
||||
### Moyenne priorite
|
||||
|
||||
| # | Description | Fichier | Impact |
|
||||
|---|------------|---------|--------|
|
||||
| T6 | Prompts LLM en dur (~50 lignes) | cim10_extractor.py | Versioning |
|
||||
| T7 | Pas de pytest-cov -> couverture inconnue | tests/ | Risque regressions |
|
||||
| T8 | Cache Ollama sans TTL, grandit indefiniment | ollama_cache.py | Disque |
|
||||
| T9 | GHM estime sur 28% des dossiers seulement | ghm.py | Reporting incomplet |
|
||||
| T10 | quality/ sans tests dedies | tests/ | Couverture insuffisante |
|
||||
|
||||
### Basse priorite
|
||||
|
||||
| # | Description | Fichier | Impact |
|
||||
|---|------------|---------|--------|
|
||||
| T11 | Pagination viewer (500+ dossiers) | viewer/app.py | UX |
|
||||
| T12 | Extraction CCAM eparses (~1/dossier) | cim10_extractor.py | Completude |
|
||||
| T13 | Vetos/decisions appliques 2x (PDF + fusion) -- code duplique | main.py | Maintenance |
|
||||
|
||||
---
|
||||
|
||||
## 11. Points forts architecturaux
|
||||
|
||||
1. **Couche quality/ deterministe** : le LLM propose, le moteur de regles dispose -- conforme au principe de l'IA medicale
|
||||
2. **Pipeline CPAM multi-pass** : extraction -> argumentation -> validation adversariale avec modeles potentiellement differents
|
||||
3. **Configuration YAML editable** : regles, seuils bio, routage dynamique sans toucher au code
|
||||
4. **Fallbacks gracieux** : CUDA->CPU (embedding), Ollama->Anthropic (LLM), edsnlp optionnel
|
||||
5. **RAG dual-index** : separation referentiels / procedures pour meilleure precision
|
||||
6. **Fusion multi-PDFs** : gestion native des dossiers en plusieurs parties
|
||||
7. **Tracabilite** : tags [BIO-1], [IMG-2] etc. dans les arguments CPAM
|
||||
|
||||
---
|
||||
|
||||
## 12. Recommandations
|
||||
|
||||
### Court terme (stabilite)
|
||||
|
||||
1. Remplacer `_use_edsnlp` / `_use_rag` par contextvars (thread-safety)
|
||||
2. Ajouter docstrings sur les fonctions principales des monolithes
|
||||
3. Remplacer `except Exception:` par logging `exc_info=True` + re-raise fatales
|
||||
4. Ajouter tests dedies pour quality/ (decision_engine, veto_engine, rules_router)
|
||||
|
||||
### Moyen terme (maintenance)
|
||||
|
||||
1. Externaliser les prompts LLM dans `src/prompts/` (templates versionnables)
|
||||
2. Refactorer cim10_extractor.py : separer extraction LLM / validation / enrichissement RAG
|
||||
3. Ajouter pytest-cov et viser 70%+ de couverture
|
||||
4. Extraire la logique vetos+decisions dupliquee dans un helper `_apply_quality_checks()`
|
||||
|
||||
### Long terme (architecture pro)
|
||||
|
||||
1. Architecture en couches : Domain / Use Cases / Adapters
|
||||
2. Event bus pour vetos/decisions (permet A/B testing regles sans code)
|
||||
3. Architecture multi-modeles LLM (role-based dispatch : coding, cpam, validation, qc)
|
||||
Reference in New Issue
Block a user