6.1 KiB
6.1 KiB
Task 4 Summary: Référentiels Manager Implementation
Completed: Subtask 4.1 ✅
What was implemented:
-
ReferentielsManager Class (
src/pipeline_mco_pmsi/rag/referentiels_manager.py)- ✅
__init__(): Initializes manager with data directory and embedding model configuration - ✅
import_referentiel(): Imports PDF files, generates SHA-256 hash, extracts text - ✅
get_version_info(): Retrieves version information for a referentiel type - ✅
_extract_text_from_pdf(): Extracts text from PDF files using pypdf - ✅
chunk_referentiel(): Delegates to specific chunking methods - ✅
chunk_guide_mco(): Basic chunking for Guide Méthodologique MCO - ✅
chunk_cim10(): Basic chunking for CIM-10 FR - ✅
chunk_ccam(): Basic chunking for CCAM descriptive - ⏳
build_index(): Placeholder (to be implemented in subtask 4.3)
- ✅
-
Data Models
- ✅
Chunk: Represents a chunk of referentiel with metadata - ✅
VectorIndex: Represents a vector index with metadata - ✅ Uses existing
ReferentielVersionfrom models.metadata
- ✅
-
Unit Tests (
tests/test_referentiels_manager.py)- ✅ 15 tests covering all implemented functionality
- ✅ All tests passing
- ✅ Test coverage: 79% for referentiels_manager.py
Key Features:
- SHA-256 Hashing: Every imported referentiel gets a unique hash for versioning
- Text Extraction: Robust PDF text extraction with error handling
- Version Caching: Imported versions are cached for quick retrieval
- Flexible Chunking: Different chunking strategies for each referentiel type
- Error Handling: Comprehensive error handling with logging
Requirements Satisfied:
- ✅ Exigence 3.1: Maintenir des copies versionnées du référentiel CIM-10 PMSI avec hash et date d'import
- ✅ Exigence 3.2: Maintenir des copies versionnées du référentiel CCAM PMSI avec hash et date d'import
- ✅ Exigence 3.3: Maintenir des copies versionnées du guide MCO avec hash et date d'import
- ✅ Exigence 13.1: Générer un hash lors de l'ingestion de nouveaux fichiers de référentiel
Remaining: Subtask 4.2 ⏳
To be implemented:
-
Intelligent Chunking for Guide MCO
- Parse chapter/section structure
- Preserve complete rules (règles d'exclusion, hiérarchisation)
- Extract eligibility criteria for DP/DAS
- Target: 500-1000 tokens per chunk with 100 token overlap
-
Intelligent Chunking for CIM-10
- Parse code blocks with inclusion/exclusion notes
- Separate vectorization for alphabetical indexes vs analytical codes
- Maintain natural language ↔ code links (e.g., "Gastrite" → "K29.7")
- Target: 300-600 tokens per chunk
-
Intelligent Chunking for CCAM
- Parse acts with ATIH extensions (7+3 character codes)
- Preserve technical notes and application conditions
- Vectorize alphabetical indexes for natural language search
- Target: 400-800 tokens per chunk
Requirements to satisfy:
- ⏳ Exigence 23.2: Chunker le Guide Méthodologique MCO en sections logiques préservant le contexte des règles
- ⏳ Exigence 23.3: Chunker la CIM-10 FR en préservant les notes d'inclusion/exclusion et blocs
- ⏳ Exigence 23.4: Chunker la CCAM descriptive en préservant les extensions ATIH et notes techniques
Remaining: Subtask 4.3 ⏳
To be implemented:
-
Embedding Model Integration
- Load French medical embedding model (CamemBERT-bio or DrBERT)
- Configure sentence-transformers
- Generate embeddings for chunks (768 dimensions)
- L2 normalization for cosine similarity
-
FAISS Index Creation
- Build HNSW (Hierarchical Navigable Small World) index
- Configure index parameters (M, efConstruction)
- Store index to disk
- Generate index hash for versioning
-
Alphabetical Index Vectorization
- Separate vectorization for alphabetical indexes
- Maintain bidirectional links (terms ↔ codes)
- Enable natural language search
Requirements to satisfy:
- ⏳ Exigence 23.1: Implémenter une architecture RAG pour la recherche dans les référentiels
- ⏳ Exigence 23.5: Vectoriser les index alphabétiques en plus des codes analytiques
- ⏳ Exigence 27.1: Vectoriser les index alphabétiques CIM-10 et CCAM
Optional Subtasks
Subtask 4.4 (Optional): Property Tests ⏳
Property tests to implement:
- Propriété 8: Pour tout référentiel, il doit avoir version, hash, et date d'import
- Propriété 36: Pour tout import, un hash SHA-256 doit être généré
- Propriété 46: Pour tout chunk, le contexte doit être préservé
Subtask 4.5 (Optional): Unit Tests for Chunking ⏳
Additional unit tests:
- Test preservation of CIM-10 inclusion/exclusion notes
- Test preservation of CCAM ATIH extensions
- Test chunk size constraints
- Test overlap behavior
Files Created/Modified
Created:
src/pipeline_mco_pmsi/rag/referentiels_manager.py(477 lines)src/pipeline_mco_pmsi/rag/__init__.pytests/test_referentiels_manager.py(260 lines)
Modified:
- None (all new files)
Next Steps
-
Implement Subtask 4.2: Intelligent chunking with structure preservation
- Parse PDF structure more intelligently
- Implement rule/note detection
- Preserve semantic context
-
Implement Subtask 4.3: Vectorization and indexation
- Integrate sentence-transformers
- Build FAISS HNSW index
- Implement alphabetical index vectorization
-
Test with Real PDFs: Verify chunking quality with actual ATIH documents
- guide_methodo_mco_2026_version_provisoire.pdf
- cim-10-fr_2026_a_usage_pmsi_version_provisoire_111225.pdf
- actualisation_ccam_descriptive_a_usage_pmsi_v4_2025.pdf
-
Optional: Implement property-based tests for robustness
Notes
- The current chunking implementation is basic (paragraph-based) and will need to be enhanced in subtask 4.2
- The placeholder hash ("0" * 64) for index_hash is used until the index is actually built in subtask 4.3
- All PDF files are available in the workspace root for testing
- The implementation follows the design document specifications closely