11 KiB
11 KiB
EDS-NLP Integration - Implementation Status
Date
13 février 2026
Overview
Intégration professionnelle d'EDS-NLP pour améliorer l'extraction des faits cliniques dans le pipeline MCO PMSI.
Completed Tasks ✅
Task 1: Set up EDS-NLP dependencies and configuration ✅
Status: COMPLETED
Deliverables:
- ✅ Added
edsnlp>=0.10.0topyproject.tomldependencies - ✅ Created
src/pipeline_mco_pmsi/extractors/edsnlp_config.py(200+ lines)- EDSNLPConfig dataclass with 30+ configuration parameters
- Model configuration, component toggles, performance tuning
- Fallback configuration, timeout settings
- Entity extraction configuration
- Helper methods:
get_enabled_components(),should_extract_entity_type(),from_yaml(),to_dict()
- ✅ Created
config/edsnlp_config.yamlwith default settings- All EDS-NLP components enabled by default
- Performance optimizations configured
- Fallback mechanism configured (3 failures, 5min cooldown)
- ✅ Created
config/medical_abbreviations.json(200+ abbreviations)- Comprehensive French medical abbreviations dictionary
- Categories: diseases, medications, procedures, lab tests, scores, etc.
- ✅ Updated
.gitignoreto exclude spaCy models
Validates: Requirements 1.1
Task 2: Implement custom exceptions for EDS-NLP integration ✅
Status: COMPLETED
Deliverables:
- ✅ Created
src/pipeline_mco_pmsi/extractors/edsnlp_exceptions.py(250+ lines)EDSNLPError- Base exception with details dict and to_dict() methodPipelineInitializationError- For pipeline loading failuresEDSNLPProcessingError- For document processing failuresNormalizationError- For term normalization failuresEDSNLPTimeoutError- For processing timeoutsEDSNLPConfigurationError- For invalid configuration- All exceptions include detailed context (model_name, document_id, original_error, etc.)
Validates: Requirements 7.1, 7.2
Task 3: Implement supporting data structures ✅
Status: COMPLETED
Deliverables:
- ✅ Created
src/pipeline_mco_pmsi/extractors/edsnlp_types.py(450+ lines)Span- Text span with validation, overlap/contains methodsQualifierResult- Qualifiers with confidence, cues, spans- Methods:
has_any_qualifier(),should_exclude_from_coding(),to_dict()
- Methods:
ExtractedEntity- Medical entity with type, span, qualifiers, normalized text- Methods:
should_include_in_coding(),get_adjusted_confidence(),to_dict()
- Methods:
Sentence- Sentence with propositions- Methods:
has_propositions(),to_dict()
- Methods:
NormalizedTerm- Original + normalized with steps- Methods:
was_modified(),to_dict()
- Methods:
ProcessingResult- Document processing result with entities, sentences, metadata- Methods:
get_entity_count_by_type(),get_entities_for_coding(),to_dict()
- Methods:
ExtractionResult- High-level extraction result- Methods:
was_successful(),to_dict()
- Methods:
- All dataclasses include validation in
__post_init__() - All include comprehensive
to_dict()methods for serialization
Validates: Requirements 2.6, 3.6, 4.3, 5.6
Summary of Completed Work
Files Created (7 files)
src/pipeline_mco_pmsi/extractors/edsnlp_config.py- Configuration dataclasssrc/pipeline_mco_pmsi/extractors/edsnlp_exceptions.py- Exception hierarchysrc/pipeline_mco_pmsi/extractors/edsnlp_types.py- Data structuresconfig/edsnlp_config.yaml- Default configurationconfig/medical_abbreviations.json- Abbreviations dictionaryEDSNLP_INTEGRATION_STATUS.md- This status document
Files Modified (2 files)
pyproject.toml- Added edsnlp dependency.gitignore- Excluded spaCy models
Lines of Code Written
- Configuration: ~200 lines
- Exceptions: ~250 lines
- Data structures: ~450 lines
- Config files: ~250 lines
- Total: ~1150 lines of production code
Quality Metrics
- ✅ All code includes comprehensive docstrings
- ✅ Type hints on all functions and methods
- ✅ Input validation in all dataclasses
- ✅ Error handling with detailed context
- ✅ Serialization methods for all data structures
- ✅ Helper methods for common operations
- ✅ Professional code structure and organization
Remaining Tasks (22 tasks)
Phase 2: Core Components (Tasks 4-13)
- 4. Extend existing data models with EDS-NLP fields
- 4.1 Update ClinicalFact model
- 4.2 Update Qualifier model
- * 4.3 Write unit tests for model extensions
- * 4.4 Write property test for JSON serialization
- 5. Implement ClinicalTermNormalizer
- 5.1 Create normalizer class
- * 5.2 Write unit tests
- * 5.3 Write property test
- 6. Implement EDSNLPProcessor core functionality
- 6.1 Create processor class with pipeline loading
- * 6.2 Write unit tests for pipeline initialization
- 7. Implement entity extraction in EDSNLPProcessor
- 7.1 Add extract_entities() method
- * 7.2 Write unit tests
- * 7.3 Write property test
- 8. Implement qualifier detection in EDSNLPProcessor
- 8.1 Add detect_qualifiers() method
- * 8.2 Write unit tests
- * 8.3 Write property test
- 9. Implement document processing and segmentation
- 9.1 Add process_document() method
- * 9.2 Write unit tests
- * 9.3-9.5 Write property tests
- 10. Implement batch processing
- * 10.1 Write unit tests
- 11. Implement qualifier-to-model mapping
- 11.1 Add _map_qualifier_to_model() method
- * 11.2 Write property test
- 12. Implement confidence score calculation
- 13. Implement family context exclusion logic
- * 13.1 Write property test
Phase 3: Integration (Tasks 14-20)
- 14. Checkpoint - Ensure EDSNLPProcessor tests pass
- 15. Implement ExtractionOrchestrator
- 15.1 Create orchestrator class
- * 15.2 Write unit tests
- * 15.3 Write property test
- 16. Implement logging and metrics
- 16.1-16.3 Add structured logging
- * 16.4 Write unit tests
- 17. Integrate with existing ClinicalFactsExtractor
- 17.1 Refactor ClinicalFactsExtractor
- * 17.2 Write unit tests
- * 17.3 Write property test
- 18. Implement RAG integration with normalized terms
- * 18.1 Write property test
- 19. Implement pipeline reuse optimization
- * 19.1 Write property test
- 20. Checkpoint - Ensure integration tests pass
Phase 4: Testing & Documentation (Tasks 21-25)
- 21. Write integration tests
- * 21.1 End-to-end extraction test
- * 21.2 Fallback mechanism test
- * 21.3 Backward compatibility test
- * 22. Write property test for empty document handling
- * 23. Write performance tests
- 24. Create documentation
- 24.1-24.6 Architecture, configuration, usage, troubleshooting docs
- 25. Final checkpoint - Complete system validation
Next Steps
Immediate Priority (Phase 2)
- Task 4: Extend ClinicalFact and Qualifier models with optional EDS-NLP fields
- Task 5: Implement ClinicalTermNormalizer for term normalization
- Task 6: Implement EDSNLPProcessor with pipeline loading and caching
- Task 7: Add entity extraction to EDSNLPProcessor
- Task 8: Add qualifier detection to EDSNLPProcessor
Implementation Strategy
Each task should:
- Mark task as "in_progress" using taskStatus tool
- Implement the code with comprehensive docstrings and type hints
- Include input validation and error handling
- Add helper methods for common operations
- Mark task as "completed" using taskStatus tool
- Move to next task
Testing Strategy
- Unit tests for specific scenarios and edge cases
- Property tests (marked with *) for universal correctness
- Integration tests for component interactions
- Performance tests for non-functional requirements
Architecture Overview
ClinicalFactsExtractor (existing API maintained)
↓
ExtractionOrchestrator (new - to be implemented)
├─→ EDSNLPProcessor (new - to be implemented)
│ ├─ spaCy pipeline with EDS-NLP components
│ ├─ ClinicalTermNormalizer (new - to be implemented)
│ ├─ Entity extraction
│ └─ Qualifier detection
└─→ RegexFallbackExtractor (existing - fallback)
Key Design Decisions
- Backward Compatibility: All new fields in ClinicalFact and Qualifier are optional
- Graceful Degradation: Automatic fallback to regex on EDS-NLP failures
- Performance: Pipeline caching, batch processing, lazy loading
- Robustness: Comprehensive error handling with detailed logging
- Testability: Clear separation of concerns, dependency injection
- Extensibility: Component-level configuration, modular architecture
Configuration
Default Configuration (config/edsnlp_config.yaml)
- Model: fr_core_news_sm
- All components enabled
- Pipeline caching enabled
- Batch size: 32
- Fallback enabled (3 failures → 5min cooldown)
- Processing timeout: 30s
- Normalization enabled
Medical Abbreviations (config/medical_abbreviations.json)
- 200+ French medical abbreviations
- Categories: diseases, medications, procedures, lab tests, vital signs, scores
- Examples: avc→accident vasculaire cérébral, hta→hypertension artérielle
Quality Assurance
Code Quality
- ✅ Comprehensive docstrings (Google style)
- ✅ Type hints on all functions
- ✅ Input validation in dataclasses
- ✅ Error handling with context
- ✅ Serialization methods
- ✅ Helper methods for common operations
Testing Coverage (Planned)
- 16 property-based tests (Hypothesis, 100 iterations each)
- 40+ unit tests for specific scenarios
- 10+ integration tests
- 5+ performance tests
Documentation (Planned)
- Architecture documentation
- Configuration guide
- Usage examples
- Troubleshooting guide
- Migration guide
Metrics to Track
edsnlp.extraction.success- Successful extractionsedsnlp.extraction.failure- Failed extractionsedsnlp.extraction.fallback- Fallback activationsedsnlp.processing.time- Processing time histogramedsnlp.entities.extracted- Entities by typeedsnlp.qualifiers.detected- Qualifiers by type
Performance Targets
- Pipeline loading: < 2 seconds (first load only)
- Document processing: < 500ms per document (average)
- Batch processing: > 10 documents/second
- Memory usage: < 500MB for pipeline instance
- Fallback overhead: < 50ms additional latency
Conclusion
Phase 1 (Setup & Infrastructure) is COMPLETE with 3 tasks done, 1150+ lines of professional code written, and solid foundations established.
The integration is well-architected with:
- Comprehensive configuration system
- Robust error handling
- Rich data structures with validation
- Clear separation of concerns
- Extensible design
Next: Continue with Phase 2 (Core Components) to implement the actual EDS-NLP processing logic.