Initial commit

2026-03-05 01:20:14 +01:00
commit 2163e574c1
184 changed files with 354881 additions and 0 deletions
--- a/TASK_4_SUMMARY.md
+++ b/TASK_4_SUMMARY.md
@@ -0,0 +1,149 @@
+# Task 4 Summary: Référentiels Manager Implementation
+
+## Completed: Subtask 4.1 ✅
+
+### What was implemented:
+
+1. **ReferentielsManager Class** (`src/pipeline_mco_pmsi/rag/referentiels_manager.py`)
+   - ✅ `__init__()`: Initializes manager with data directory and embedding model configuration
+   - ✅ `import_referentiel()`: Imports PDF files, generates SHA-256 hash, extracts text
+   - ✅ `get_version_info()`: Retrieves version information for a referentiel type
+   - ✅ `_extract_text_from_pdf()`: Extracts text from PDF files using pypdf
+   - ✅ `chunk_referentiel()`: Delegates to specific chunking methods
+   - ✅ `chunk_guide_mco()`: Basic chunking for Guide Méthodologique MCO
+   - ✅ `chunk_cim10()`: Basic chunking for CIM-10 FR
+   - ✅ `chunk_ccam()`: Basic chunking for CCAM descriptive
+   - ⏳ `build_index()`: Placeholder (to be implemented in subtask 4.3)
+
+2. **Data Models**
+   - ✅ `Chunk`: Represents a chunk of referentiel with metadata
+   - ✅ `VectorIndex`: Represents a vector index with metadata
+   - ✅ Uses existing `ReferentielVersion` from models.metadata
+
+3. **Unit Tests** (`tests/test_referentiels_manager.py`)
+   - ✅ 15 tests covering all implemented functionality
+   - ✅ All tests passing
+   - ✅ Test coverage: 79% for referentiels_manager.py
+
+### Key Features:
+
+- **SHA-256 Hashing**: Every imported referentiel gets a unique hash for versioning
+- **Text Extraction**: Robust PDF text extraction with error handling
+- **Version Caching**: Imported versions are cached for quick retrieval
+- **Flexible Chunking**: Different chunking strategies for each referentiel type
+- **Error Handling**: Comprehensive error handling with logging
+
+### Requirements Satisfied:
+
+- ✅ **Exigence 3.1**: Maintenir des copies versionnées du référentiel CIM-10 PMSI avec hash et date d'import
+- ✅ **Exigence 3.2**: Maintenir des copies versionnées du référentiel CCAM PMSI avec hash et date d'import
+- ✅ **Exigence 3.3**: Maintenir des copies versionnées du guide MCO avec hash et date d'import
+- ✅ **Exigence 13.1**: Générer un hash lors de l'ingestion de nouveaux fichiers de référentiel
+
+## Remaining: Subtask 4.2 ⏳
+
+### To be implemented:
+
+1. **Intelligent Chunking for Guide MCO**
+   - Parse chapter/section structure
+   - Preserve complete rules (règles d'exclusion, hiérarchisation)
+   - Extract eligibility criteria for DP/DAS
+   - Target: 500-1000 tokens per chunk with 100 token overlap
+
+2. **Intelligent Chunking for CIM-10**
+   - Parse code blocks with inclusion/exclusion notes
+   - Separate vectorization for alphabetical indexes vs analytical codes
+   - Maintain natural language ↔ code links (e.g., "Gastrite" → "K29.7")
+   - Target: 300-600 tokens per chunk
+
+3. **Intelligent Chunking for CCAM**
+   - Parse acts with ATIH extensions (7+3 character codes)
+   - Preserve technical notes and application conditions
+   - Vectorize alphabetical indexes for natural language search
+   - Target: 400-800 tokens per chunk
+
+### Requirements to satisfy:
+
+- ⏳ **Exigence 23.2**: Chunker le Guide Méthodologique MCO en sections logiques préservant le contexte des règles
+- ⏳ **Exigence 23.3**: Chunker la CIM-10 FR en préservant les notes d'inclusion/exclusion et blocs
+- ⏳ **Exigence 23.4**: Chunker la CCAM descriptive en préservant les extensions ATIH et notes techniques
+
+## Remaining: Subtask 4.3 ⏳
+
+### To be implemented:
+
+1. **Embedding Model Integration**
+   - Load French medical embedding model (CamemBERT-bio or DrBERT)
+   - Configure sentence-transformers
+   - Generate embeddings for chunks (768 dimensions)
+   - L2 normalization for cosine similarity
+
+2. **FAISS Index Creation**
+   - Build HNSW (Hierarchical Navigable Small World) index
+   - Configure index parameters (M, efConstruction)
+   - Store index to disk
+   - Generate index hash for versioning
+
+3. **Alphabetical Index Vectorization**
+   - Separate vectorization for alphabetical indexes
+   - Maintain bidirectional links (terms ↔ codes)
+   - Enable natural language search
+
+### Requirements to satisfy:
+
+- ⏳ **Exigence 23.1**: Implémenter une architecture RAG pour la recherche dans les référentiels
+- ⏳ **Exigence 23.5**: Vectoriser les index alphabétiques en plus des codes analytiques
+- ⏳ **Exigence 27.1**: Vectoriser les index alphabétiques CIM-10 et CCAM
+
+## Optional Subtasks
+
+### Subtask 4.4 (Optional): Property Tests ⏳
+
+Property tests to implement:
+- **Propriété 8**: Pour tout référentiel, il doit avoir version, hash, et date d'import
+- **Propriété 36**: Pour tout import, un hash SHA-256 doit être généré
+- **Propriété 46**: Pour tout chunk, le contexte doit être préservé
+
+### Subtask 4.5 (Optional): Unit Tests for Chunking ⏳
+
+Additional unit tests:
+- Test preservation of CIM-10 inclusion/exclusion notes
+- Test preservation of CCAM ATIH extensions
+- Test chunk size constraints
+- Test overlap behavior
+
+## Files Created/Modified
+
+### Created:
+- `src/pipeline_mco_pmsi/rag/referentiels_manager.py` (477 lines)
+- `src/pipeline_mco_pmsi/rag/__init__.py`
+- `tests/test_referentiels_manager.py` (260 lines)
+
+### Modified:
+- None (all new files)
+
+## Next Steps
+
+1. **Implement Subtask 4.2**: Intelligent chunking with structure preservation
+   - Parse PDF structure more intelligently
+   - Implement rule/note detection
+   - Preserve semantic context
+
+2. **Implement Subtask 4.3**: Vectorization and indexation
+   - Integrate sentence-transformers
+   - Build FAISS HNSW index
+   - Implement alphabetical index vectorization
+
+3. **Test with Real PDFs**: Verify chunking quality with actual ATIH documents
+   - guide_methodo_mco_2026_version_provisoire.pdf
+   - cim-10-fr_2026_a_usage_pmsi_version_provisoire_111225.pdf
+   - actualisation_ccam_descriptive_a_usage_pmsi_v4_2025.pdf
+
+4. **Optional**: Implement property-based tests for robustness
+
+## Notes
+
+- The current chunking implementation is basic (paragraph-based) and will need to be enhanced in subtask 4.2
+- The placeholder hash ("0" * 64) for index_hash is used until the index is actually built in subtask 4.3
+- All PDF files are available in the workspace root for testing
+- The implementation follows the design document specifications closely