206 lines
6.7 KiB
Markdown
206 lines
6.7 KiB
Markdown
# Task 8.1 Summary: ClinicalFactsExtractor Implementation
|
||
|
||
## Overview
|
||
Successfully implemented the `ClinicalFactsExtractor` class for extracting structured clinical facts from medical documents with qualifier detection and evidence association.
|
||
|
||
## Implementation Details
|
||
|
||
### Files Created
|
||
1. **src/pipeline_mco_pmsi/extractors/__init__.py** - Module initialization
|
||
2. **src/pipeline_mco_pmsi/extractors/clinical_facts_extractor.py** - Main extractor implementation
|
||
3. **tests/test_clinical_facts_extractor.py** - Comprehensive unit tests
|
||
|
||
### Key Features Implemented
|
||
|
||
#### 1. Clinical Facts Extraction (`extract_facts()`)
|
||
- Extracts structured facts from clinical documents
|
||
- Supports multiple fact types:
|
||
- **Diagnostics**: Diagnoses, conclusions, impressions
|
||
- **Actes**: Medical procedures, interventions, surgeries
|
||
- **Examens**: Tests, imaging, lab results
|
||
- **Traitements**: Medications, prescriptions, therapies
|
||
- Associates each fact with precise textual evidence (document_id, span, text)
|
||
- Generates unique fact IDs using UUID
|
||
|
||
#### 2. Qualifier Detection (`detect_qualifiers()`)
|
||
- **Negation Detection**: Identifies negated facts using markers:
|
||
- "pas de", "absence de", "sans", "aucun", "ni"
|
||
- "exclu", "infirmé", "non retrouvé"
|
||
- **Suspicion Detection**: Identifies suspected/uncertain facts:
|
||
- "possible", "suspecté", "probable", "à confirmer"
|
||
- "évocateur", "compatible avec", "hypothèse"
|
||
- **Priority System**: Negation takes priority over suspicion
|
||
- **Confidence Adjustment**: Reduces confidence scores based on qualifiers:
|
||
- Negated facts: confidence = 0.3
|
||
- Suspected facts: confidence = 0.6
|
||
- Affirmed facts: confidence = 1.0
|
||
|
||
#### 3. Temporality Detection (`_detect_temporality()`)
|
||
- **Antécédents**: "antécédent", "ancien", "histoire de", "connu pour"
|
||
- **Chronique**: "chronique", "persistant", "au long cours", "de longue date"
|
||
- **Actuel**: Default temporality when no markers detected
|
||
|
||
#### 4. Evidence Association
|
||
- Each fact includes:
|
||
- `document_id`: Source document identifier
|
||
- `span`: Exact character positions (start, end)
|
||
- `text`: Extracted text
|
||
- `context`: Surrounding text (±50 characters)
|
||
- Enables full traceability and auditability
|
||
|
||
#### 5. Confidence Calculation
|
||
- Base confidence from qualifier detection
|
||
- Adjusted for temporality:
|
||
- Antécédents: ×0.9
|
||
- Chronique: ×0.95
|
||
- Final confidence bounded to [0.0, 1.0]
|
||
|
||
### Technical Implementation
|
||
|
||
#### Pattern-Based Extraction
|
||
- Uses compiled regex patterns for performance
|
||
- Separate patterns for each fact type
|
||
- Case-insensitive matching
|
||
- Captures both structured (e.g., "Diagnostic: ...") and free-text mentions
|
||
|
||
#### Context Window Analysis
|
||
- 150-character window for qualifier detection
|
||
- Handles markers before, within, or after fact text
|
||
- Marker relevance check (max 50 characters distance)
|
||
|
||
#### Marker Relevance Algorithm
|
||
- Detects if marker is within the extracted fact text
|
||
- Checks proximity for markers before/after the fact
|
||
- Case-insensitive matching with fallback to first word
|
||
|
||
## Test Coverage
|
||
|
||
### Unit Tests (35 tests, all passing)
|
||
1. **Qualifier Detection Tests** (8 tests)
|
||
- Negation with various markers
|
||
- Suspicion detection
|
||
- Affirmation (no markers)
|
||
- Priority handling
|
||
|
||
2. **Temporality Detection Tests** (6 tests)
|
||
- Antécédent keywords
|
||
- Chronique conditions
|
||
- Default to "actuel"
|
||
|
||
3. **Fact Extraction Tests** (8 tests)
|
||
- Extraction by fact type
|
||
- Negation handling
|
||
- Suspicion handling
|
||
- Antécédent handling
|
||
|
||
4. **Stay-Level Extraction Tests** (2 tests)
|
||
- Multi-section extraction
|
||
- Document ID preservation
|
||
|
||
5. **Confidence Calculation Tests** (5 tests)
|
||
- High confidence for affirmed facts
|
||
- Reduced confidence for suspected/negated
|
||
- Temporality adjustments
|
||
- Bounds checking
|
||
|
||
6. **Context Extraction Tests** (4 tests)
|
||
- Context window extraction
|
||
- Start/end boundary handling
|
||
- Ellipsis addition
|
||
|
||
7. **Marker Relevance Tests** (3 tests)
|
||
- Close markers
|
||
- Distant markers
|
||
- Markers after facts
|
||
|
||
## Requirements Validated
|
||
|
||
### Exigence 6.2: Extraction de faits structurés ✓
|
||
- Extracts diagnostics, actes, examens, traitements
|
||
- Structured data with type, text, qualifier, temporality
|
||
|
||
### Exigence 6.3: Association avec preuves ✓
|
||
- Each fact has evidence with document_id and span
|
||
- Exact character positions tracked
|
||
|
||
### Exigence 6.4: Assignation de qualificateurs ✓
|
||
- All facts have qualifiers (affirmé/nié/suspecté)
|
||
- Markers detected and recorded
|
||
|
||
### Exigence 2.1: Détection de négation ✓
|
||
- Negation markers detected
|
||
- Facts marked as "nié"
|
||
- Confidence reduced
|
||
|
||
### Exigence 2.2: Détection de suspicion ✓
|
||
- Suspicion markers detected
|
||
- Facts marked as "suspecté"
|
||
- Confidence reduced
|
||
|
||
### Exigence 2.3: Détection de temporalité ✓
|
||
- Temporality markers detected
|
||
- Facts marked with temporality
|
||
- Confidence adjusted
|
||
|
||
## Integration Points
|
||
|
||
### Input
|
||
- `StructuredStay` from `DocumentProcessor`
|
||
- Contains segmented sections from clinical documents
|
||
|
||
### Output
|
||
- List of `ClinicalFact` objects
|
||
- Each with:
|
||
- Unique ID
|
||
- Type (diagnostic/acte/examen/traitement)
|
||
- Text content
|
||
- Qualifier (certainty, markers, confidence)
|
||
- Temporality (actuel/antécédent/chronique)
|
||
- Evidence (document_id, span, text, context)
|
||
- Overall confidence score
|
||
|
||
### Next Steps
|
||
- Facts will be used by the `Codeur` to propose CIM-10/CCAM codes
|
||
- Evidence will support code justification
|
||
- Qualifiers will influence code selection (e.g., negated facts not coded)
|
||
|
||
## Performance Characteristics
|
||
- Compiled regex patterns for fast matching
|
||
- Single-pass extraction per section
|
||
- O(n) complexity where n = document length
|
||
- Minimal memory overhead (streaming processing)
|
||
|
||
## Code Quality
|
||
- Type hints throughout
|
||
- Comprehensive docstrings
|
||
- Immutable data models (Pydantic)
|
||
- 100% test pass rate
|
||
- Clear separation of concerns
|
||
|
||
## Example Usage
|
||
|
||
```python
|
||
from pipeline_mco_pmsi.extractors import ClinicalFactsExtractor
|
||
from pipeline_mco_pmsi.processors import DocumentProcessor
|
||
|
||
# Process documents
|
||
processor = DocumentProcessor()
|
||
structured_stay = processor.process_documents(documents, stay_metadata)
|
||
|
||
# Extract facts
|
||
extractor = ClinicalFactsExtractor()
|
||
facts = extractor.extract_facts(structured_stay)
|
||
|
||
# Analyze facts
|
||
for fact in facts:
|
||
print(f"Type: {fact.type}")
|
||
print(f"Text: {fact.text}")
|
||
print(f"Certainty: {fact.qualifier.certainty}")
|
||
print(f"Temporality: {fact.temporality}")
|
||
print(f"Confidence: {fact.confidence}")
|
||
print(f"Evidence: {fact.evidence.document_id} @ {fact.evidence.span}")
|
||
```
|
||
|
||
## Conclusion
|
||
Task 8.1 is complete. The `ClinicalFactsExtractor` successfully extracts structured clinical facts with comprehensive qualifier detection and evidence association, meeting all specified requirements.
|