6.7 KiB
6.7 KiB
Task 8.1 Summary: ClinicalFactsExtractor Implementation
Overview
Successfully implemented the ClinicalFactsExtractor class for extracting structured clinical facts from medical documents with qualifier detection and evidence association.
Implementation Details
Files Created
- src/pipeline_mco_pmsi/extractors/init.py - Module initialization
- src/pipeline_mco_pmsi/extractors/clinical_facts_extractor.py - Main extractor implementation
- tests/test_clinical_facts_extractor.py - Comprehensive unit tests
Key Features Implemented
1. Clinical Facts Extraction (extract_facts())
- Extracts structured facts from clinical documents
- Supports multiple fact types:
- Diagnostics: Diagnoses, conclusions, impressions
- Actes: Medical procedures, interventions, surgeries
- Examens: Tests, imaging, lab results
- Traitements: Medications, prescriptions, therapies
- Associates each fact with precise textual evidence (document_id, span, text)
- Generates unique fact IDs using UUID
2. Qualifier Detection (detect_qualifiers())
- Negation Detection: Identifies negated facts using markers:
- "pas de", "absence de", "sans", "aucun", "ni"
- "exclu", "infirmé", "non retrouvé"
- Suspicion Detection: Identifies suspected/uncertain facts:
- "possible", "suspecté", "probable", "à confirmer"
- "évocateur", "compatible avec", "hypothèse"
- Priority System: Negation takes priority over suspicion
- Confidence Adjustment: Reduces confidence scores based on qualifiers:
- Negated facts: confidence = 0.3
- Suspected facts: confidence = 0.6
- Affirmed facts: confidence = 1.0
3. Temporality Detection (_detect_temporality())
- Antécédents: "antécédent", "ancien", "histoire de", "connu pour"
- Chronique: "chronique", "persistant", "au long cours", "de longue date"
- Actuel: Default temporality when no markers detected
4. Evidence Association
- Each fact includes:
document_id: Source document identifierspan: Exact character positions (start, end)text: Extracted textcontext: Surrounding text (±50 characters)
- Enables full traceability and auditability
5. Confidence Calculation
- Base confidence from qualifier detection
- Adjusted for temporality:
- Antécédents: ×0.9
- Chronique: ×0.95
- Final confidence bounded to [0.0, 1.0]
Technical Implementation
Pattern-Based Extraction
- Uses compiled regex patterns for performance
- Separate patterns for each fact type
- Case-insensitive matching
- Captures both structured (e.g., "Diagnostic: ...") and free-text mentions
Context Window Analysis
- 150-character window for qualifier detection
- Handles markers before, within, or after fact text
- Marker relevance check (max 50 characters distance)
Marker Relevance Algorithm
- Detects if marker is within the extracted fact text
- Checks proximity for markers before/after the fact
- Case-insensitive matching with fallback to first word
Test Coverage
Unit Tests (35 tests, all passing)
-
Qualifier Detection Tests (8 tests)
- Negation with various markers
- Suspicion detection
- Affirmation (no markers)
- Priority handling
-
Temporality Detection Tests (6 tests)
- Antécédent keywords
- Chronique conditions
- Default to "actuel"
-
Fact Extraction Tests (8 tests)
- Extraction by fact type
- Negation handling
- Suspicion handling
- Antécédent handling
-
Stay-Level Extraction Tests (2 tests)
- Multi-section extraction
- Document ID preservation
-
Confidence Calculation Tests (5 tests)
- High confidence for affirmed facts
- Reduced confidence for suspected/negated
- Temporality adjustments
- Bounds checking
-
Context Extraction Tests (4 tests)
- Context window extraction
- Start/end boundary handling
- Ellipsis addition
-
Marker Relevance Tests (3 tests)
- Close markers
- Distant markers
- Markers after facts
Requirements Validated
Exigence 6.2: Extraction de faits structurés ✓
- Extracts diagnostics, actes, examens, traitements
- Structured data with type, text, qualifier, temporality
Exigence 6.3: Association avec preuves ✓
- Each fact has evidence with document_id and span
- Exact character positions tracked
Exigence 6.4: Assignation de qualificateurs ✓
- All facts have qualifiers (affirmé/nié/suspecté)
- Markers detected and recorded
Exigence 2.1: Détection de négation ✓
- Negation markers detected
- Facts marked as "nié"
- Confidence reduced
Exigence 2.2: Détection de suspicion ✓
- Suspicion markers detected
- Facts marked as "suspecté"
- Confidence reduced
Exigence 2.3: Détection de temporalité ✓
- Temporality markers detected
- Facts marked with temporality
- Confidence adjusted
Integration Points
Input
StructuredStayfromDocumentProcessor- Contains segmented sections from clinical documents
Output
- List of
ClinicalFactobjects - Each with:
- Unique ID
- Type (diagnostic/acte/examen/traitement)
- Text content
- Qualifier (certainty, markers, confidence)
- Temporality (actuel/antécédent/chronique)
- Evidence (document_id, span, text, context)
- Overall confidence score
Next Steps
- Facts will be used by the
Codeurto propose CIM-10/CCAM codes - Evidence will support code justification
- Qualifiers will influence code selection (e.g., negated facts not coded)
Performance Characteristics
- Compiled regex patterns for fast matching
- Single-pass extraction per section
- O(n) complexity where n = document length
- Minimal memory overhead (streaming processing)
Code Quality
- Type hints throughout
- Comprehensive docstrings
- Immutable data models (Pydantic)
- 100% test pass rate
- Clear separation of concerns
Example Usage
from pipeline_mco_pmsi.extractors import ClinicalFactsExtractor
from pipeline_mco_pmsi.processors import DocumentProcessor
# Process documents
processor = DocumentProcessor()
structured_stay = processor.process_documents(documents, stay_metadata)
# Extract facts
extractor = ClinicalFactsExtractor()
facts = extractor.extract_facts(structured_stay)
# Analyze facts
for fact in facts:
print(f"Type: {fact.type}")
print(f"Text: {fact.text}")
print(f"Certainty: {fact.qualifier.certainty}")
print(f"Temporality: {fact.temporality}")
print(f"Confidence: {fact.confidence}")
print(f"Evidence: {fact.evidence.document_id} @ {fact.evidence.span}")
Conclusion
Task 8.1 is complete. The ClinicalFactsExtractor successfully extracts structured clinical facts with comprehensive qualifier detection and evidence association, meeting all specified requirements.