Initial commit

2026-03-05 01:20:14 +01:00
commit 2163e574c1
184 changed files with 354881 additions and 0 deletions
--- a/TASK_8.1_SUMMARY.md
+++ b/TASK_8.1_SUMMARY.md
@@ -0,0 +1,205 @@
+# Task 8.1 Summary: ClinicalFactsExtractor Implementation
+
+## Overview
+Successfully implemented the `ClinicalFactsExtractor` class for extracting structured clinical facts from medical documents with qualifier detection and evidence association.
+
+## Implementation Details
+
+### Files Created
+1. **src/pipeline_mco_pmsi/extractors/__init__.py** - Module initialization
+2. **src/pipeline_mco_pmsi/extractors/clinical_facts_extractor.py** - Main extractor implementation
+3. **tests/test_clinical_facts_extractor.py** - Comprehensive unit tests
+
+### Key Features Implemented
+
+#### 1. Clinical Facts Extraction (`extract_facts()`)
+- Extracts structured facts from clinical documents
+- Supports multiple fact types:
+  - **Diagnostics**: Diagnoses, conclusions, impressions
+  - **Actes**: Medical procedures, interventions, surgeries
+  - **Examens**: Tests, imaging, lab results
+  - **Traitements**: Medications, prescriptions, therapies
+- Associates each fact with precise textual evidence (document_id, span, text)
+- Generates unique fact IDs using UUID
+
+#### 2. Qualifier Detection (`detect_qualifiers()`)
+- **Negation Detection**: Identifies negated facts using markers:
+  - "pas de", "absence de", "sans", "aucun", "ni"
+  - "exclu", "infirmé", "non retrouvé"
+- **Suspicion Detection**: Identifies suspected/uncertain facts:
+  - "possible", "suspecté", "probable", "à confirmer"
+  - "évocateur", "compatible avec", "hypothèse"
+- **Priority System**: Negation takes priority over suspicion
+- **Confidence Adjustment**: Reduces confidence scores based on qualifiers:
+  - Negated facts: confidence = 0.3
+  - Suspected facts: confidence = 0.6
+  - Affirmed facts: confidence = 1.0
+
+#### 3. Temporality Detection (`_detect_temporality()`)
+- **Antécédents**: "antécédent", "ancien", "histoire de", "connu pour"
+- **Chronique**: "chronique", "persistant", "au long cours", "de longue date"
+- **Actuel**: Default temporality when no markers detected
+
+#### 4. Evidence Association
+- Each fact includes:
+  - `document_id`: Source document identifier
+  - `span`: Exact character positions (start, end)
+  - `text`: Extracted text
+  - `context`: Surrounding text (±50 characters)
+- Enables full traceability and auditability
+
+#### 5. Confidence Calculation
+- Base confidence from qualifier detection
+- Adjusted for temporality:
+  - Antécédents: ×0.9
+  - Chronique: ×0.95
+- Final confidence bounded to [0.0, 1.0]
+
+### Technical Implementation
+
+#### Pattern-Based Extraction
+- Uses compiled regex patterns for performance
+- Separate patterns for each fact type
+- Case-insensitive matching
+- Captures both structured (e.g., "Diagnostic: ...") and free-text mentions
+
+#### Context Window Analysis
+- 150-character window for qualifier detection
+- Handles markers before, within, or after fact text
+- Marker relevance check (max 50 characters distance)
+
+#### Marker Relevance Algorithm
+- Detects if marker is within the extracted fact text
+- Checks proximity for markers before/after the fact
+- Case-insensitive matching with fallback to first word
+
+## Test Coverage
+
+### Unit Tests (35 tests, all passing)
+1. **Qualifier Detection Tests** (8 tests)
+   - Negation with various markers
+   - Suspicion detection
+   - Affirmation (no markers)
+   - Priority handling
+
+2. **Temporality Detection Tests** (6 tests)
+   - Antécédent keywords
+   - Chronique conditions
+   - Default to "actuel"
+
+3. **Fact Extraction Tests** (8 tests)
+   - Extraction by fact type
+   - Negation handling
+   - Suspicion handling
+   - Antécédent handling
+
+4. **Stay-Level Extraction Tests** (2 tests)
+   - Multi-section extraction
+   - Document ID preservation
+
+5. **Confidence Calculation Tests** (5 tests)
+   - High confidence for affirmed facts
+   - Reduced confidence for suspected/negated
+   - Temporality adjustments
+   - Bounds checking
+
+6. **Context Extraction Tests** (4 tests)
+   - Context window extraction
+   - Start/end boundary handling
+   - Ellipsis addition
+
+7. **Marker Relevance Tests** (3 tests)
+   - Close markers
+   - Distant markers
+   - Markers after facts
+
+## Requirements Validated
+
+### Exigence 6.2: Extraction de faits structurés ✓
+- Extracts diagnostics, actes, examens, traitements
+- Structured data with type, text, qualifier, temporality
+
+### Exigence 6.3: Association avec preuves ✓
+- Each fact has evidence with document_id and span
+- Exact character positions tracked
+
+### Exigence 6.4: Assignation de qualificateurs ✓
+- All facts have qualifiers (affirmé/nié/suspecté)
+- Markers detected and recorded
+
+### Exigence 2.1: Détection de négation ✓
+- Negation markers detected
+- Facts marked as "nié"
+- Confidence reduced
+
+### Exigence 2.2: Détection de suspicion ✓
+- Suspicion markers detected
+- Facts marked as "suspecté"
+- Confidence reduced
+
+### Exigence 2.3: Détection de temporalité ✓
+- Temporality markers detected
+- Facts marked with temporality
+- Confidence adjusted
+
+## Integration Points
+
+### Input
+- `StructuredStay` from `DocumentProcessor`
+- Contains segmented sections from clinical documents
+
+### Output
+- List of `ClinicalFact` objects
+- Each with:
+  - Unique ID
+  - Type (diagnostic/acte/examen/traitement)
+  - Text content
+  - Qualifier (certainty, markers, confidence)
+  - Temporality (actuel/antécédent/chronique)
+  - Evidence (document_id, span, text, context)
+  - Overall confidence score
+
+### Next Steps
+- Facts will be used by the `Codeur` to propose CIM-10/CCAM codes
+- Evidence will support code justification
+- Qualifiers will influence code selection (e.g., negated facts not coded)
+
+## Performance Characteristics
+- Compiled regex patterns for fast matching
+- Single-pass extraction per section
+- O(n) complexity where n = document length
+- Minimal memory overhead (streaming processing)
+
+## Code Quality
+- Type hints throughout
+- Comprehensive docstrings
+- Immutable data models (Pydantic)
+- 100% test pass rate
+- Clear separation of concerns
+
+## Example Usage
+
+```python
+from pipeline_mco_pmsi.extractors import ClinicalFactsExtractor
+from pipeline_mco_pmsi.processors import DocumentProcessor
+
+# Process documents
+processor = DocumentProcessor()
+structured_stay = processor.process_documents(documents, stay_metadata)
+
+# Extract facts
+extractor = ClinicalFactsExtractor()
+facts = extractor.extract_facts(structured_stay)
+
+# Analyze facts
+for fact in facts:
+    print(f"Type: {fact.type}")
+    print(f"Text: {fact.text}")
+    print(f"Certainty: {fact.qualifier.certainty}")
+    print(f"Temporality: {fact.temporality}")
+    print(f"Confidence: {fact.confidence}")
+    print(f"Evidence: {fact.evidence.document_id} @ {fact.evidence.span}")
+```
+
+## Conclusion
+Task 8.1 is complete. The `ClinicalFactsExtractor` successfully extracts structured clinical facts with comprehensive qualifier detection and evidence association, meeting all specified requirements.