150 lines
6.1 KiB
Markdown
150 lines
6.1 KiB
Markdown
# Task 4 Summary: Référentiels Manager Implementation
|
|
|
|
## Completed: Subtask 4.1 ✅
|
|
|
|
### What was implemented:
|
|
|
|
1. **ReferentielsManager Class** (`src/pipeline_mco_pmsi/rag/referentiels_manager.py`)
|
|
- ✅ `__init__()`: Initializes manager with data directory and embedding model configuration
|
|
- ✅ `import_referentiel()`: Imports PDF files, generates SHA-256 hash, extracts text
|
|
- ✅ `get_version_info()`: Retrieves version information for a referentiel type
|
|
- ✅ `_extract_text_from_pdf()`: Extracts text from PDF files using pypdf
|
|
- ✅ `chunk_referentiel()`: Delegates to specific chunking methods
|
|
- ✅ `chunk_guide_mco()`: Basic chunking for Guide Méthodologique MCO
|
|
- ✅ `chunk_cim10()`: Basic chunking for CIM-10 FR
|
|
- ✅ `chunk_ccam()`: Basic chunking for CCAM descriptive
|
|
- ⏳ `build_index()`: Placeholder (to be implemented in subtask 4.3)
|
|
|
|
2. **Data Models**
|
|
- ✅ `Chunk`: Represents a chunk of referentiel with metadata
|
|
- ✅ `VectorIndex`: Represents a vector index with metadata
|
|
- ✅ Uses existing `ReferentielVersion` from models.metadata
|
|
|
|
3. **Unit Tests** (`tests/test_referentiels_manager.py`)
|
|
- ✅ 15 tests covering all implemented functionality
|
|
- ✅ All tests passing
|
|
- ✅ Test coverage: 79% for referentiels_manager.py
|
|
|
|
### Key Features:
|
|
|
|
- **SHA-256 Hashing**: Every imported referentiel gets a unique hash for versioning
|
|
- **Text Extraction**: Robust PDF text extraction with error handling
|
|
- **Version Caching**: Imported versions are cached for quick retrieval
|
|
- **Flexible Chunking**: Different chunking strategies for each referentiel type
|
|
- **Error Handling**: Comprehensive error handling with logging
|
|
|
|
### Requirements Satisfied:
|
|
|
|
- ✅ **Exigence 3.1**: Maintenir des copies versionnées du référentiel CIM-10 PMSI avec hash et date d'import
|
|
- ✅ **Exigence 3.2**: Maintenir des copies versionnées du référentiel CCAM PMSI avec hash et date d'import
|
|
- ✅ **Exigence 3.3**: Maintenir des copies versionnées du guide MCO avec hash et date d'import
|
|
- ✅ **Exigence 13.1**: Générer un hash lors de l'ingestion de nouveaux fichiers de référentiel
|
|
|
|
## Remaining: Subtask 4.2 ⏳
|
|
|
|
### To be implemented:
|
|
|
|
1. **Intelligent Chunking for Guide MCO**
|
|
- Parse chapter/section structure
|
|
- Preserve complete rules (règles d'exclusion, hiérarchisation)
|
|
- Extract eligibility criteria for DP/DAS
|
|
- Target: 500-1000 tokens per chunk with 100 token overlap
|
|
|
|
2. **Intelligent Chunking for CIM-10**
|
|
- Parse code blocks with inclusion/exclusion notes
|
|
- Separate vectorization for alphabetical indexes vs analytical codes
|
|
- Maintain natural language ↔ code links (e.g., "Gastrite" → "K29.7")
|
|
- Target: 300-600 tokens per chunk
|
|
|
|
3. **Intelligent Chunking for CCAM**
|
|
- Parse acts with ATIH extensions (7+3 character codes)
|
|
- Preserve technical notes and application conditions
|
|
- Vectorize alphabetical indexes for natural language search
|
|
- Target: 400-800 tokens per chunk
|
|
|
|
### Requirements to satisfy:
|
|
|
|
- ⏳ **Exigence 23.2**: Chunker le Guide Méthodologique MCO en sections logiques préservant le contexte des règles
|
|
- ⏳ **Exigence 23.3**: Chunker la CIM-10 FR en préservant les notes d'inclusion/exclusion et blocs
|
|
- ⏳ **Exigence 23.4**: Chunker la CCAM descriptive en préservant les extensions ATIH et notes techniques
|
|
|
|
## Remaining: Subtask 4.3 ⏳
|
|
|
|
### To be implemented:
|
|
|
|
1. **Embedding Model Integration**
|
|
- Load French medical embedding model (CamemBERT-bio or DrBERT)
|
|
- Configure sentence-transformers
|
|
- Generate embeddings for chunks (768 dimensions)
|
|
- L2 normalization for cosine similarity
|
|
|
|
2. **FAISS Index Creation**
|
|
- Build HNSW (Hierarchical Navigable Small World) index
|
|
- Configure index parameters (M, efConstruction)
|
|
- Store index to disk
|
|
- Generate index hash for versioning
|
|
|
|
3. **Alphabetical Index Vectorization**
|
|
- Separate vectorization for alphabetical indexes
|
|
- Maintain bidirectional links (terms ↔ codes)
|
|
- Enable natural language search
|
|
|
|
### Requirements to satisfy:
|
|
|
|
- ⏳ **Exigence 23.1**: Implémenter une architecture RAG pour la recherche dans les référentiels
|
|
- ⏳ **Exigence 23.5**: Vectoriser les index alphabétiques en plus des codes analytiques
|
|
- ⏳ **Exigence 27.1**: Vectoriser les index alphabétiques CIM-10 et CCAM
|
|
|
|
## Optional Subtasks
|
|
|
|
### Subtask 4.4 (Optional): Property Tests ⏳
|
|
|
|
Property tests to implement:
|
|
- **Propriété 8**: Pour tout référentiel, il doit avoir version, hash, et date d'import
|
|
- **Propriété 36**: Pour tout import, un hash SHA-256 doit être généré
|
|
- **Propriété 46**: Pour tout chunk, le contexte doit être préservé
|
|
|
|
### Subtask 4.5 (Optional): Unit Tests for Chunking ⏳
|
|
|
|
Additional unit tests:
|
|
- Test preservation of CIM-10 inclusion/exclusion notes
|
|
- Test preservation of CCAM ATIH extensions
|
|
- Test chunk size constraints
|
|
- Test overlap behavior
|
|
|
|
## Files Created/Modified
|
|
|
|
### Created:
|
|
- `src/pipeline_mco_pmsi/rag/referentiels_manager.py` (477 lines)
|
|
- `src/pipeline_mco_pmsi/rag/__init__.py`
|
|
- `tests/test_referentiels_manager.py` (260 lines)
|
|
|
|
### Modified:
|
|
- None (all new files)
|
|
|
|
## Next Steps
|
|
|
|
1. **Implement Subtask 4.2**: Intelligent chunking with structure preservation
|
|
- Parse PDF structure more intelligently
|
|
- Implement rule/note detection
|
|
- Preserve semantic context
|
|
|
|
2. **Implement Subtask 4.3**: Vectorization and indexation
|
|
- Integrate sentence-transformers
|
|
- Build FAISS HNSW index
|
|
- Implement alphabetical index vectorization
|
|
|
|
3. **Test with Real PDFs**: Verify chunking quality with actual ATIH documents
|
|
- guide_methodo_mco_2026_version_provisoire.pdf
|
|
- cim-10-fr_2026_a_usage_pmsi_version_provisoire_111225.pdf
|
|
- actualisation_ccam_descriptive_a_usage_pmsi_v4_2025.pdf
|
|
|
|
4. **Optional**: Implement property-based tests for robustness
|
|
|
|
## Notes
|
|
|
|
- The current chunking implementation is basic (paragraph-based) and will need to be enhanced in subtask 4.2
|
|
- The placeholder hash ("0" * 64) for index_hash is used until the index is actually built in subtask 4.3
|
|
- All PDF files are available in the workspace root for testing
|
|
- The implementation follows the design document specifications closely
|