Files
aivanov_CIM/TASK_8.1_SUMMARY.md
2026-03-05 01:20:14 +01:00

6.7 KiB
Raw Permalink Blame History

Task 8.1 Summary: ClinicalFactsExtractor Implementation

Overview

Successfully implemented the ClinicalFactsExtractor class for extracting structured clinical facts from medical documents with qualifier detection and evidence association.

Implementation Details

Files Created

  1. src/pipeline_mco_pmsi/extractors/init.py - Module initialization
  2. src/pipeline_mco_pmsi/extractors/clinical_facts_extractor.py - Main extractor implementation
  3. tests/test_clinical_facts_extractor.py - Comprehensive unit tests

Key Features Implemented

1. Clinical Facts Extraction (extract_facts())

  • Extracts structured facts from clinical documents
  • Supports multiple fact types:
    • Diagnostics: Diagnoses, conclusions, impressions
    • Actes: Medical procedures, interventions, surgeries
    • Examens: Tests, imaging, lab results
    • Traitements: Medications, prescriptions, therapies
  • Associates each fact with precise textual evidence (document_id, span, text)
  • Generates unique fact IDs using UUID

2. Qualifier Detection (detect_qualifiers())

  • Negation Detection: Identifies negated facts using markers:
    • "pas de", "absence de", "sans", "aucun", "ni"
    • "exclu", "infirmé", "non retrouvé"
  • Suspicion Detection: Identifies suspected/uncertain facts:
    • "possible", "suspecté", "probable", "à confirmer"
    • "évocateur", "compatible avec", "hypothèse"
  • Priority System: Negation takes priority over suspicion
  • Confidence Adjustment: Reduces confidence scores based on qualifiers:
    • Negated facts: confidence = 0.3
    • Suspected facts: confidence = 0.6
    • Affirmed facts: confidence = 1.0

3. Temporality Detection (_detect_temporality())

  • Antécédents: "antécédent", "ancien", "histoire de", "connu pour"
  • Chronique: "chronique", "persistant", "au long cours", "de longue date"
  • Actuel: Default temporality when no markers detected

4. Evidence Association

  • Each fact includes:
    • document_id: Source document identifier
    • span: Exact character positions (start, end)
    • text: Extracted text
    • context: Surrounding text (±50 characters)
  • Enables full traceability and auditability

5. Confidence Calculation

  • Base confidence from qualifier detection
  • Adjusted for temporality:
    • Antécédents: ×0.9
    • Chronique: ×0.95
  • Final confidence bounded to [0.0, 1.0]

Technical Implementation

Pattern-Based Extraction

  • Uses compiled regex patterns for performance
  • Separate patterns for each fact type
  • Case-insensitive matching
  • Captures both structured (e.g., "Diagnostic: ...") and free-text mentions

Context Window Analysis

  • 150-character window for qualifier detection
  • Handles markers before, within, or after fact text
  • Marker relevance check (max 50 characters distance)

Marker Relevance Algorithm

  • Detects if marker is within the extracted fact text
  • Checks proximity for markers before/after the fact
  • Case-insensitive matching with fallback to first word

Test Coverage

Unit Tests (35 tests, all passing)

  1. Qualifier Detection Tests (8 tests)

    • Negation with various markers
    • Suspicion detection
    • Affirmation (no markers)
    • Priority handling
  2. Temporality Detection Tests (6 tests)

    • Antécédent keywords
    • Chronique conditions
    • Default to "actuel"
  3. Fact Extraction Tests (8 tests)

    • Extraction by fact type
    • Negation handling
    • Suspicion handling
    • Antécédent handling
  4. Stay-Level Extraction Tests (2 tests)

    • Multi-section extraction
    • Document ID preservation
  5. Confidence Calculation Tests (5 tests)

    • High confidence for affirmed facts
    • Reduced confidence for suspected/negated
    • Temporality adjustments
    • Bounds checking
  6. Context Extraction Tests (4 tests)

    • Context window extraction
    • Start/end boundary handling
    • Ellipsis addition
  7. Marker Relevance Tests (3 tests)

    • Close markers
    • Distant markers
    • Markers after facts

Requirements Validated

Exigence 6.2: Extraction de faits structurés ✓

  • Extracts diagnostics, actes, examens, traitements
  • Structured data with type, text, qualifier, temporality

Exigence 6.3: Association avec preuves ✓

  • Each fact has evidence with document_id and span
  • Exact character positions tracked

Exigence 6.4: Assignation de qualificateurs ✓

  • All facts have qualifiers (affirmé/nié/suspecté)
  • Markers detected and recorded

Exigence 2.1: Détection de négation ✓

  • Negation markers detected
  • Facts marked as "nié"
  • Confidence reduced

Exigence 2.2: Détection de suspicion ✓

  • Suspicion markers detected
  • Facts marked as "suspecté"
  • Confidence reduced

Exigence 2.3: Détection de temporalité ✓

  • Temporality markers detected
  • Facts marked with temporality
  • Confidence adjusted

Integration Points

Input

  • StructuredStay from DocumentProcessor
  • Contains segmented sections from clinical documents

Output

  • List of ClinicalFact objects
  • Each with:
    • Unique ID
    • Type (diagnostic/acte/examen/traitement)
    • Text content
    • Qualifier (certainty, markers, confidence)
    • Temporality (actuel/antécédent/chronique)
    • Evidence (document_id, span, text, context)
    • Overall confidence score

Next Steps

  • Facts will be used by the Codeur to propose CIM-10/CCAM codes
  • Evidence will support code justification
  • Qualifiers will influence code selection (e.g., negated facts not coded)

Performance Characteristics

  • Compiled regex patterns for fast matching
  • Single-pass extraction per section
  • O(n) complexity where n = document length
  • Minimal memory overhead (streaming processing)

Code Quality

  • Type hints throughout
  • Comprehensive docstrings
  • Immutable data models (Pydantic)
  • 100% test pass rate
  • Clear separation of concerns

Example Usage

from pipeline_mco_pmsi.extractors import ClinicalFactsExtractor
from pipeline_mco_pmsi.processors import DocumentProcessor

# Process documents
processor = DocumentProcessor()
structured_stay = processor.process_documents(documents, stay_metadata)

# Extract facts
extractor = ClinicalFactsExtractor()
facts = extractor.extract_facts(structured_stay)

# Analyze facts
for fact in facts:
    print(f"Type: {fact.type}")
    print(f"Text: {fact.text}")
    print(f"Certainty: {fact.qualifier.certainty}")
    print(f"Temporality: {fact.temporality}")
    print(f"Confidence: {fact.confidence}")
    print(f"Evidence: {fact.evidence.document_id} @ {fact.evidence.span}")

Conclusion

Task 8.1 is complete. The ClinicalFactsExtractor successfully extracts structured clinical facts with comprehensive qualifier detection and evidence association, meeting all specified requirements.