Dom/aivanov_CIM

Fork 0

Files

Dom 2163e574c1 Initial commit

2026-03-05 01:20:14 +01:00

6.1 KiB

Raw Permalink Blame History

Task 4 Summary: Référentiels Manager Implementation

Completed: Subtask 4.1 ✅

What was implemented:

ReferentielsManager Class (src/pipeline_mco_pmsi/rag/referentiels_manager.py)
- ✅ __init__(): Initializes manager with data directory and embedding model configuration
- ✅ import_referentiel(): Imports PDF files, generates SHA-256 hash, extracts text
- ✅ get_version_info(): Retrieves version information for a referentiel type
- ✅ _extract_text_from_pdf(): Extracts text from PDF files using pypdf
- ✅ chunk_referentiel(): Delegates to specific chunking methods
- ✅ chunk_guide_mco(): Basic chunking for Guide Méthodologique MCO
- ✅ chunk_cim10(): Basic chunking for CIM-10 FR
- ✅ chunk_ccam(): Basic chunking for CCAM descriptive
- ⏳ build_index(): Placeholder (to be implemented in subtask 4.3)
Data Models
- ✅ Chunk: Represents a chunk of referentiel with metadata
- ✅ VectorIndex: Represents a vector index with metadata
- ✅ Uses existing ReferentielVersion from models.metadata
Unit Tests (tests/test_referentiels_manager.py)
- ✅ 15 tests covering all implemented functionality
- ✅ All tests passing
- ✅ Test coverage: 79% for referentiels_manager.py

Key Features:

SHA-256 Hashing: Every imported referentiel gets a unique hash for versioning
Text Extraction: Robust PDF text extraction with error handling
Version Caching: Imported versions are cached for quick retrieval
Flexible Chunking: Different chunking strategies for each referentiel type
Error Handling: Comprehensive error handling with logging

Requirements Satisfied:

✅ Exigence 3.1: Maintenir des copies versionnées du référentiel CIM-10 PMSI avec hash et date d'import
✅ Exigence 3.2: Maintenir des copies versionnées du référentiel CCAM PMSI avec hash et date d'import
✅ Exigence 3.3: Maintenir des copies versionnées du guide MCO avec hash et date d'import
✅ Exigence 13.1: Générer un hash lors de l'ingestion de nouveaux fichiers de référentiel

Remaining: Subtask 4.2 ⏳

To be implemented:

Intelligent Chunking for Guide MCO
- Parse chapter/section structure
- Preserve complete rules (règles d'exclusion, hiérarchisation)
- Extract eligibility criteria for DP/DAS
- Target: 500-1000 tokens per chunk with 100 token overlap
Intelligent Chunking for CIM-10
- Parse code blocks with inclusion/exclusion notes
- Separate vectorization for alphabetical indexes vs analytical codes
- Maintain natural language ↔ code links (e.g., "Gastrite" → "K29.7")
- Target: 300-600 tokens per chunk
Intelligent Chunking for CCAM
- Parse acts with ATIH extensions (7+3 character codes)
- Preserve technical notes and application conditions
- Vectorize alphabetical indexes for natural language search
- Target: 400-800 tokens per chunk

Requirements to satisfy:

⏳ Exigence 23.2: Chunker le Guide Méthodologique MCO en sections logiques préservant le contexte des règles
⏳ Exigence 23.3: Chunker la CIM-10 FR en préservant les notes d'inclusion/exclusion et blocs
⏳ Exigence 23.4: Chunker la CCAM descriptive en préservant les extensions ATIH et notes techniques

Remaining: Subtask 4.3 ⏳

To be implemented:

Embedding Model Integration
- Load French medical embedding model (CamemBERT-bio or DrBERT)
- Configure sentence-transformers
- Generate embeddings for chunks (768 dimensions)
- L2 normalization for cosine similarity
FAISS Index Creation
- Build HNSW (Hierarchical Navigable Small World) index
- Configure index parameters (M, efConstruction)
- Store index to disk
- Generate index hash for versioning
Alphabetical Index Vectorization
- Separate vectorization for alphabetical indexes
- Maintain bidirectional links (terms ↔ codes)
- Enable natural language search

Requirements to satisfy:

⏳ Exigence 23.1: Implémenter une architecture RAG pour la recherche dans les référentiels
⏳ Exigence 23.5: Vectoriser les index alphabétiques en plus des codes analytiques
⏳ Exigence 27.1: Vectoriser les index alphabétiques CIM-10 et CCAM

Optional Subtasks

Subtask 4.4 (Optional): Property Tests ⏳

Property tests to implement:

Propriété 8: Pour tout référentiel, il doit avoir version, hash, et date d'import
Propriété 36: Pour tout import, un hash SHA-256 doit être généré
Propriété 46: Pour tout chunk, le contexte doit être préservé

Subtask 4.5 (Optional): Unit Tests for Chunking ⏳

Additional unit tests:

Test preservation of CIM-10 inclusion/exclusion notes
Test preservation of CCAM ATIH extensions
Test chunk size constraints
Test overlap behavior

Files Created/Modified

Created:

src/pipeline_mco_pmsi/rag/referentiels_manager.py (477 lines)
src/pipeline_mco_pmsi/rag/__init__.py
tests/test_referentiels_manager.py (260 lines)

Modified:

None (all new files)

Next Steps

Implement Subtask 4.2: Intelligent chunking with structure preservation
- Parse PDF structure more intelligently
- Implement rule/note detection
- Preserve semantic context
Implement Subtask 4.3: Vectorization and indexation
- Integrate sentence-transformers
- Build FAISS HNSW index
- Implement alphabetical index vectorization
Test with Real PDFs: Verify chunking quality with actual ATIH documents
- guide_methodo_mco_2026_version_provisoire.pdf
- cim-10-fr_2026_a_usage_pmsi_version_provisoire_111225.pdf
- actualisation_ccam_descriptive_a_usage_pmsi_v4_2025.pdf
Optional: Implement property-based tests for robustness

Notes

The current chunking implementation is basic (paragraph-based) and will need to be enhanced in subtask 4.2
The placeholder hash ("0" * 64) for index_hash is used until the index is actually built in subtask 4.3
All PDF files are available in the workspace root for testing
The implementation follows the design document specifications closely

6.1 KiB Raw Permalink Blame History

Task 4 Summary: Référentiels Manager Implementation

Completed: Subtask 4.1 ✅

What was implemented:

Key Features:

Requirements Satisfied:

Remaining: Subtask 4.2 ⏳

To be implemented:

Requirements to satisfy:

Remaining: Subtask 4.3 ⏳

To be implemented:

Requirements to satisfy:

Optional Subtasks

Subtask 4.4 (Optional): Property Tests ⏳

Subtask 4.5 (Optional): Unit Tests for Chunking ⏳

Files Created/Modified

Created:

Modified:

Next Steps

Notes

6.1 KiB

Raw Permalink Blame History