Initial commit

This commit is contained in:
Dom
2026-03-05 01:20:14 +01:00
commit 2163e574c1
184 changed files with 354881 additions and 0 deletions

136
TASK_3_SUMMARY.md Normal file
View File

@@ -0,0 +1,136 @@
# Task 3: PII Protector Implementation - Summary
## Completed: Task 3.1 - PIIProtector Class with Hybrid Detection
### Implementation Overview
Successfully implemented the `PIIProtector` class in `src/pipeline_mco_pmsi/processors/pii_protector.py` with the following features:
#### Core Functionality
1. **Hybrid Detection Approach (Regex + NER)**
- Regex patterns for structured data (dates, NSS, phones, emails, addresses)
- NER (Named Entity Recognition) support via spaCy for person names
- Context-based name detection as fallback
2. **PII Types Detected**
- **Names**: Using NER and context patterns (e.g., "Patient Jean Dupont", "M. Martin")
- **Birth Dates**: Multiple formats (JJ/MM/AAAA, AAAA-MM-JJ, "15 mars 1960", etc.)
- **NSS (Social Security Numbers)**: With and without spaces (15 digits)
- **Phone Numbers**: Various formats (spaces, dots, dashes, international)
- **Emails**: Standard email format
- **Addresses**: Street addresses and postal codes
3. **Key Methods**
- `detect_pii(text)`: Detects all PII in text, returns list of PIISpan objects
- `anonymize_text(text, pii_spans)`: Replaces PII with placeholders
- `filter_logs(log_entry)`: Filters PII from log entries
- `has_pii(text)`: Checks if text contains PII
4. **Design Principles**
- **High Recall Preference**: Prefers false positives over false negatives to avoid PII leaks
- **Span Merging**: Automatically merges overlapping detections
- **Confidence Scoring**: Each detection has a confidence score (0.0-1.0)
- **Lazy Loading**: NER model loaded only when needed
### Test Coverage
Created comprehensive unit tests in `tests/test_pii_protector.py`:
- **38 unit tests** covering:
- Detection of all PII types with various formats
- Anonymization functionality
- Log filtering
- Edge cases (empty text, overlapping spans, composite names, etc.)
- **Test Results**: ✅ All 38 tests passing
- **Code Coverage**: 82% for pii_protector.py module
### Requirements Satisfied
**Exigence 11.1**: Hybrid detection (regex + NER) implemented
**Exigence 11.2**: PII excluded from logs via `filter_logs()`
**Exigence 11.3**: PII excluded from error messages (via anonymization)
**Exigence 5.10**: Audit logs maintained without PII exposure
### Key Features
1. **Flexible Regex Patterns**
- Handles multiple date formats (slash, dash, ISO, text)
- Detects NSS with/without spaces
- Supports various phone number formats
- Postal codes and street addresses
2. **Smart Name Detection**
- Context-based detection ("Patient", "M.", "Mme", etc.)
- Optional NER integration with spaCy
- Handles composite names (Jean-Pierre, Dupont-Martin)
3. **Robust Anonymization**
- Replaces PII with clear placeholders ([NOM_ANONYMISÉ], [NSS], etc.)
- Preserves text structure
- Handles multiple PII types in same text
4. **Conservative Approach**
- High recall to minimize PII leaks
- Accepts false positives as acceptable trade-off
- Comprehensive pattern coverage
### Files Created
1. `src/pipeline_mco_pmsi/processors/pii_protector.py` (400+ lines)
- PIIProtector class
- PIISpan dataclass
- Comprehensive regex patterns
- NER integration support
2. `src/pipeline_mco_pmsi/processors/__init__.py`
- Module exports
3. `tests/test_pii_protector.py` (450+ lines)
- 38 unit tests
- 4 test classes covering different aspects
- Edge case testing
### Optional Tasks (Not Implemented)
- **Task 3.2**: Property-based tests (marked as optional)
- **Task 3.3**: Additional unit tests for edge cases (marked as optional)
These can be implemented later if needed for more exhaustive testing.
### Next Steps
The PII Protector is now ready to be integrated into the pipeline:
- Can be used by the Audit Logger to filter logs
- Can be used to anonymize clinical text before export
- Can be used to validate that no PII leaks into system outputs
### Usage Example
```python
from pipeline_mco_pmsi.processors import PIIProtector
# Initialize protector
protector = PIIProtector(use_ner=False) # or True to enable NER
# Detect PII
text = "Patient Jean Dupont, né le 15/03/1960, NSS 1 60 03 75 123 456 78"
pii_spans = protector.detect_pii(text)
# Anonymize text
anonymized = protector.anonymize_text(text)
# Result: "Patient [NOM_ANONYMISÉ], né le [DATE_NAISSANCE], NSS [NSS]"
# Filter logs
log = "ERROR: Patient Jean Dupont - traitement échoué"
filtered = protector.filter_logs(log)
# Result: "ERROR: Patient [NOM_ANONYMISÉ] - traitement échoué"
# Check for PII
has_pii = protector.has_pii(text) # Returns True
```
## Conclusion
Task 3.1 successfully completed with a robust, well-tested PII protection system that meets all specified requirements. The implementation follows the conservative approach specified in the requirements, prioritizing high recall to prevent PII leaks.