Files

2026-03-05 01:20:14 +01:00

6.7 KiB

Raw Blame History

Task 8.1 Summary: ClinicalFactsExtractor Implementation

Overview

Successfully implemented the ClinicalFactsExtractor class for extracting structured clinical facts from medical documents with qualifier detection and evidence association.

Implementation Details

Files Created

src/pipeline_mco_pmsi/extractors/init.py - Module initialization
src/pipeline_mco_pmsi/extractors/clinical_facts_extractor.py - Main extractor implementation
tests/test_clinical_facts_extractor.py - Comprehensive unit tests

Key Features Implemented

1. Clinical Facts Extraction (`extract_facts()`)

Extracts structured facts from clinical documents
Supports multiple fact types:
- Diagnostics: Diagnoses, conclusions, impressions
- Actes: Medical procedures, interventions, surgeries
- Examens: Tests, imaging, lab results
- Traitements: Medications, prescriptions, therapies
Associates each fact with precise textual evidence (document_id, span, text)
Generates unique fact IDs using UUID

2. Qualifier Detection (`detect_qualifiers()`)

Negation Detection: Identifies negated facts using markers:
- "pas de", "absence de", "sans", "aucun", "ni"
- "exclu", "infirmé", "non retrouvé"
Suspicion Detection: Identifies suspected/uncertain facts:
- "possible", "suspecté", "probable", "à confirmer"
- "évocateur", "compatible avec", "hypothèse"
Priority System: Negation takes priority over suspicion
Confidence Adjustment: Reduces confidence scores based on qualifiers:
- Negated facts: confidence = 0.3
- Suspected facts: confidence = 0.6
- Affirmed facts: confidence = 1.0

3. Temporality Detection (`_detect_temporality()`)

Antécédents: "antécédent", "ancien", "histoire de", "connu pour"
Chronique: "chronique", "persistant", "au long cours", "de longue date"
Actuel: Default temporality when no markers detected

4. Evidence Association

Each fact includes:
- document_id: Source document identifier
- span: Exact character positions (start, end)
- text: Extracted text
- context: Surrounding text (±50 characters)
Enables full traceability and auditability

5. Confidence Calculation

Base confidence from qualifier detection
Adjusted for temporality:
- Antécédents: ×0.9
- Chronique: ×0.95
Final confidence bounded to [0.0, 1.0]

Technical Implementation

Pattern-Based Extraction

Uses compiled regex patterns for performance
Separate patterns for each fact type
Case-insensitive matching
Captures both structured (e.g., "Diagnostic: ...") and free-text mentions

Context Window Analysis

150-character window for qualifier detection
Handles markers before, within, or after fact text
Marker relevance check (max 50 characters distance)

Marker Relevance Algorithm

Detects if marker is within the extracted fact text
Checks proximity for markers before/after the fact
Case-insensitive matching with fallback to first word

Test Coverage

Unit Tests (35 tests, all passing)

Qualifier Detection Tests (8 tests)
- Negation with various markers
- Suspicion detection
- Affirmation (no markers)
- Priority handling
Temporality Detection Tests (6 tests)
- Antécédent keywords
- Chronique conditions
- Default to "actuel"
Fact Extraction Tests (8 tests)
- Extraction by fact type
- Negation handling
- Suspicion handling
- Antécédent handling
Stay-Level Extraction Tests (2 tests)
- Multi-section extraction
- Document ID preservation
Confidence Calculation Tests (5 tests)
- High confidence for affirmed facts
- Reduced confidence for suspected/negated
- Temporality adjustments
- Bounds checking
Context Extraction Tests (4 tests)
- Context window extraction
- Start/end boundary handling
- Ellipsis addition
Marker Relevance Tests (3 tests)
- Close markers
- Distant markers
- Markers after facts

Requirements Validated

Exigence 6.2: Extraction de faits structurés ✓

Extracts diagnostics, actes, examens, traitements
Structured data with type, text, qualifier, temporality

Exigence 6.3: Association avec preuves ✓

Each fact has evidence with document_id and span
Exact character positions tracked

Exigence 6.4: Assignation de qualificateurs ✓

All facts have qualifiers (affirmé/nié/suspecté)
Markers detected and recorded

Exigence 2.1: Détection de négation ✓

Negation markers detected
Facts marked as "nié"
Confidence reduced

Exigence 2.2: Détection de suspicion ✓

Suspicion markers detected
Facts marked as "suspecté"
Confidence reduced

Exigence 2.3: Détection de temporalité ✓

Temporality markers detected
Facts marked with temporality
Confidence adjusted

Integration Points

Input

StructuredStay from DocumentProcessor
Contains segmented sections from clinical documents

Output

List of ClinicalFact objects
Each with:
- Unique ID
- Type (diagnostic/acte/examen/traitement)
- Text content
- Qualifier (certainty, markers, confidence)
- Temporality (actuel/antécédent/chronique)
- Evidence (document_id, span, text, context)
- Overall confidence score

Next Steps

Facts will be used by the Codeur to propose CIM-10/CCAM codes
Evidence will support code justification
Qualifiers will influence code selection (e.g., negated facts not coded)

Performance Characteristics

Compiled regex patterns for fast matching
Single-pass extraction per section
O(n) complexity where n = document length
Minimal memory overhead (streaming processing)

Code Quality

Type hints throughout
Comprehensive docstrings
Immutable data models (Pydantic)
100% test pass rate
Clear separation of concerns

Example Usage

from pipeline_mco_pmsi.extractors import ClinicalFactsExtractor
from pipeline_mco_pmsi.processors import DocumentProcessor

# Process documents
processor = DocumentProcessor()
structured_stay = processor.process_documents(documents, stay_metadata)

# Extract facts
extractor = ClinicalFactsExtractor()
facts = extractor.extract_facts(structured_stay)

# Analyze facts
for fact in facts:
    print(f"Type: {fact.type}")
    print(f"Text: {fact.text}")
    print(f"Certainty: {fact.qualifier.certainty}")
    print(f"Temporality: {fact.temporality}")
    print(f"Confidence: {fact.confidence}")
    print(f"Evidence: {fact.evidence.document_id} @ {fact.evidence.span}")

Conclusion

Task 8.1 is complete. The ClinicalFactsExtractor successfully extracts structured clinical facts with comprehensive qualifier detection and evidence association, meeting all specified requirements.

6.7 KiB Raw Blame History Unescape Escape