Initial commit

This commit is contained in:
Dom
2026-03-05 01:20:14 +01:00
commit 2163e574c1
184 changed files with 354881 additions and 0 deletions

149
TASK_4_SUMMARY.md Normal file
View File

@@ -0,0 +1,149 @@
# Task 4 Summary: Référentiels Manager Implementation
## Completed: Subtask 4.1 ✅
### What was implemented:
1. **ReferentielsManager Class** (`src/pipeline_mco_pmsi/rag/referentiels_manager.py`)
-`__init__()`: Initializes manager with data directory and embedding model configuration
-`import_referentiel()`: Imports PDF files, generates SHA-256 hash, extracts text
-`get_version_info()`: Retrieves version information for a referentiel type
-`_extract_text_from_pdf()`: Extracts text from PDF files using pypdf
-`chunk_referentiel()`: Delegates to specific chunking methods
-`chunk_guide_mco()`: Basic chunking for Guide Méthodologique MCO
-`chunk_cim10()`: Basic chunking for CIM-10 FR
-`chunk_ccam()`: Basic chunking for CCAM descriptive
-`build_index()`: Placeholder (to be implemented in subtask 4.3)
2. **Data Models**
-`Chunk`: Represents a chunk of referentiel with metadata
-`VectorIndex`: Represents a vector index with metadata
- ✅ Uses existing `ReferentielVersion` from models.metadata
3. **Unit Tests** (`tests/test_referentiels_manager.py`)
- ✅ 15 tests covering all implemented functionality
- ✅ All tests passing
- ✅ Test coverage: 79% for referentiels_manager.py
### Key Features:
- **SHA-256 Hashing**: Every imported referentiel gets a unique hash for versioning
- **Text Extraction**: Robust PDF text extraction with error handling
- **Version Caching**: Imported versions are cached for quick retrieval
- **Flexible Chunking**: Different chunking strategies for each referentiel type
- **Error Handling**: Comprehensive error handling with logging
### Requirements Satisfied:
-**Exigence 3.1**: Maintenir des copies versionnées du référentiel CIM-10 PMSI avec hash et date d'import
-**Exigence 3.2**: Maintenir des copies versionnées du référentiel CCAM PMSI avec hash et date d'import
-**Exigence 3.3**: Maintenir des copies versionnées du guide MCO avec hash et date d'import
-**Exigence 13.1**: Générer un hash lors de l'ingestion de nouveaux fichiers de référentiel
## Remaining: Subtask 4.2 ⏳
### To be implemented:
1. **Intelligent Chunking for Guide MCO**
- Parse chapter/section structure
- Preserve complete rules (règles d'exclusion, hiérarchisation)
- Extract eligibility criteria for DP/DAS
- Target: 500-1000 tokens per chunk with 100 token overlap
2. **Intelligent Chunking for CIM-10**
- Parse code blocks with inclusion/exclusion notes
- Separate vectorization for alphabetical indexes vs analytical codes
- Maintain natural language ↔ code links (e.g., "Gastrite" → "K29.7")
- Target: 300-600 tokens per chunk
3. **Intelligent Chunking for CCAM**
- Parse acts with ATIH extensions (7+3 character codes)
- Preserve technical notes and application conditions
- Vectorize alphabetical indexes for natural language search
- Target: 400-800 tokens per chunk
### Requirements to satisfy:
-**Exigence 23.2**: Chunker le Guide Méthodologique MCO en sections logiques préservant le contexte des règles
-**Exigence 23.3**: Chunker la CIM-10 FR en préservant les notes d'inclusion/exclusion et blocs
-**Exigence 23.4**: Chunker la CCAM descriptive en préservant les extensions ATIH et notes techniques
## Remaining: Subtask 4.3 ⏳
### To be implemented:
1. **Embedding Model Integration**
- Load French medical embedding model (CamemBERT-bio or DrBERT)
- Configure sentence-transformers
- Generate embeddings for chunks (768 dimensions)
- L2 normalization for cosine similarity
2. **FAISS Index Creation**
- Build HNSW (Hierarchical Navigable Small World) index
- Configure index parameters (M, efConstruction)
- Store index to disk
- Generate index hash for versioning
3. **Alphabetical Index Vectorization**
- Separate vectorization for alphabetical indexes
- Maintain bidirectional links (terms ↔ codes)
- Enable natural language search
### Requirements to satisfy:
-**Exigence 23.1**: Implémenter une architecture RAG pour la recherche dans les référentiels
-**Exigence 23.5**: Vectoriser les index alphabétiques en plus des codes analytiques
-**Exigence 27.1**: Vectoriser les index alphabétiques CIM-10 et CCAM
## Optional Subtasks
### Subtask 4.4 (Optional): Property Tests ⏳
Property tests to implement:
- **Propriété 8**: Pour tout référentiel, il doit avoir version, hash, et date d'import
- **Propriété 36**: Pour tout import, un hash SHA-256 doit être généré
- **Propriété 46**: Pour tout chunk, le contexte doit être préservé
### Subtask 4.5 (Optional): Unit Tests for Chunking ⏳
Additional unit tests:
- Test preservation of CIM-10 inclusion/exclusion notes
- Test preservation of CCAM ATIH extensions
- Test chunk size constraints
- Test overlap behavior
## Files Created/Modified
### Created:
- `src/pipeline_mco_pmsi/rag/referentiels_manager.py` (477 lines)
- `src/pipeline_mco_pmsi/rag/__init__.py`
- `tests/test_referentiels_manager.py` (260 lines)
### Modified:
- None (all new files)
## Next Steps
1. **Implement Subtask 4.2**: Intelligent chunking with structure preservation
- Parse PDF structure more intelligently
- Implement rule/note detection
- Preserve semantic context
2. **Implement Subtask 4.3**: Vectorization and indexation
- Integrate sentence-transformers
- Build FAISS HNSW index
- Implement alphabetical index vectorization
3. **Test with Real PDFs**: Verify chunking quality with actual ATIH documents
- guide_methodo_mco_2026_version_provisoire.pdf
- cim-10-fr_2026_a_usage_pmsi_version_provisoire_111225.pdf
- actualisation_ccam_descriptive_a_usage_pmsi_v4_2025.pdf
4. **Optional**: Implement property-based tests for robustness
## Notes
- The current chunking implementation is basic (paragraph-based) and will need to be enhanced in subtask 4.2
- The placeholder hash ("0" * 64) for index_hash is used until the index is actually built in subtask 4.3
- All PDF files are available in the workspace root for testing
- The implementation follows the design document specifications closely