Initial commit
This commit is contained in:
321
omop/README.md
Normal file
321
omop/README.md
Normal file
@@ -0,0 +1,321 @@
|
||||
# OMOP CDM 5.4 Data Pipeline
|
||||
|
||||
A comprehensive ETL pipeline for transforming healthcare data to OMOP Common Data Model (CDM) version 5.4 format.
|
||||
|
||||
## Overview
|
||||
|
||||
This pipeline provides a complete solution for:
|
||||
- Extracting data from staging tables
|
||||
- Mapping source codes to OMOP standard concepts
|
||||
- Transforming data to OMOP CDM 5.4 format
|
||||
- Validating data quality and OMOP compliance
|
||||
- Loading data into OMOP tables with parallel processing
|
||||
|
||||
## Features
|
||||
|
||||
- ✅ **Complete OMOP CDM 5.4 Support**: All clinical, vocabulary, and metadata tables
|
||||
- ✅ **Automated Concept Mapping**: LRU-cached mapping with fallback strategies
|
||||
- ✅ **Parallel Processing**: Multi-threaded ETL with configurable workers
|
||||
- ✅ **Data Quality Validation**: Comprehensive validation rules and OMOP compliance checks
|
||||
- ✅ **Error Handling**: Retry logic, circuit breaker, and checkpoint/resume functionality
|
||||
- ✅ **Web Interface**: Modern React dashboard for managing ETL pipelines (NEW!)
|
||||
- ✅ **REST API**: FastAPI backend with complete API documentation
|
||||
- ✅ **CLI Interface**: User-friendly command-line interface for all operations
|
||||
- ✅ **Vocabulary Management**: Tools for loading and managing OMOP vocabularies
|
||||
- ✅ **Comprehensive Logging**: Detailed logging with audit trail
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Option 1: Web Interface (Recommended)
|
||||
|
||||
```bash
|
||||
cd omop
|
||||
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt
|
||||
pip install -r requirements-api.txt
|
||||
|
||||
# Start web interface (API + Frontend)
|
||||
./start_web.sh
|
||||
```
|
||||
|
||||
Then open http://localhost:3000 in your browser.
|
||||
|
||||
See `QUICK_START_WEB.md` for detailed instructions.
|
||||
|
||||
### Option 2: Command Line Interface
|
||||
|
||||
```bash
|
||||
# Clone the repository
|
||||
cd omop
|
||||
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Or install in development mode
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
1. Copy the example environment file:
|
||||
```bash
|
||||
cp .env.example .env
|
||||
```
|
||||
|
||||
2. Edit `.env` with your database credentials:
|
||||
```
|
||||
DB_HOST=localhost
|
||||
DB_PORT=5432
|
||||
DB_NAME=omop_db
|
||||
DB_USER=your_user
|
||||
DB_PASSWORD=your_password
|
||||
```
|
||||
|
||||
3. Review and customize `config.yaml` as needed.
|
||||
|
||||
### Create Database Schemas
|
||||
|
||||
```bash
|
||||
# Create all schemas (OMOP, staging, audit)
|
||||
omop-pipeline schema create --type all
|
||||
|
||||
# Or create individually
|
||||
omop-pipeline schema create --type omop
|
||||
omop-pipeline schema create --type staging
|
||||
omop-pipeline schema create --type audit
|
||||
```
|
||||
|
||||
### Load Vocabularies
|
||||
|
||||
1. Download vocabularies from [Athena OHDSI](https://athena.ohdsi.org/)
|
||||
2. Extract the ZIP file to a directory
|
||||
3. Load vocabularies:
|
||||
|
||||
```bash
|
||||
omop-pipeline vocab load --path /path/to/vocabularies
|
||||
```
|
||||
|
||||
### Run ETL Pipeline
|
||||
|
||||
```bash
|
||||
# Run complete ETL pipeline
|
||||
omop-pipeline etl run --source staging.raw_patients --target person
|
||||
|
||||
# With custom batch size and workers
|
||||
omop-pipeline etl run --source staging.raw_patients --target person --batch-size 5000 --workers 8
|
||||
|
||||
# Run in sequential mode (no parallelization)
|
||||
omop-pipeline etl run --source staging.raw_patients --target person --sequential
|
||||
```
|
||||
|
||||
## Web Interface
|
||||
|
||||
The pipeline includes a modern web interface built with FastAPI and React.
|
||||
|
||||
### Features
|
||||
- 📊 **Dashboard**: Real-time statistics and performance metrics
|
||||
- ⚙️ **ETL Manager**: Launch and monitor ETL pipelines
|
||||
- 🗄️ **Schema Manager**: Create and validate database schemas
|
||||
- ✅ **Validation**: Data quality checks and unmapped codes
|
||||
- 📝 **Logs**: System logs and validation errors
|
||||
|
||||
### Quick Start
|
||||
```bash
|
||||
./start_web.sh
|
||||
```
|
||||
|
||||
Access the interface at http://localhost:3000
|
||||
|
||||
For more details, see `README_WEB_INTERFACE.md` and `WEB_INTERFACE_SUMMARY.md`.
|
||||
|
||||
## CLI Commands
|
||||
|
||||
### Schema Management
|
||||
|
||||
```bash
|
||||
# Create schemas
|
||||
omop-pipeline schema create --type [omop|staging|audit|all]
|
||||
|
||||
# Validate schema
|
||||
omop-pipeline schema validate
|
||||
```
|
||||
|
||||
### ETL Operations
|
||||
|
||||
```bash
|
||||
# Run complete ETL
|
||||
omop-pipeline etl run --source <table> --target <table>
|
||||
|
||||
# Run extraction only
|
||||
omop-pipeline etl extract --source <table>
|
||||
|
||||
# Run transformation only
|
||||
omop-pipeline etl transform --target <table>
|
||||
|
||||
# Run loading only
|
||||
omop-pipeline etl load --target <table>
|
||||
```
|
||||
|
||||
### Data Validation
|
||||
|
||||
```bash
|
||||
# Validate data quality
|
||||
omop-pipeline validate
|
||||
|
||||
# Validate specific table
|
||||
omop-pipeline validate --table person
|
||||
```
|
||||
|
||||
### Statistics
|
||||
|
||||
```bash
|
||||
# Show ETL statistics
|
||||
omop-pipeline stats show
|
||||
|
||||
# Show summary
|
||||
omop-pipeline stats summary
|
||||
```
|
||||
|
||||
### Vocabulary Management
|
||||
|
||||
```bash
|
||||
# Prepare vocabulary loading (shows instructions)
|
||||
omop-pipeline vocab prepare
|
||||
|
||||
# Load vocabularies
|
||||
omop-pipeline vocab load --path /path/to/vocabularies
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
```bash
|
||||
# Validate configuration
|
||||
omop-pipeline config validate
|
||||
```
|
||||
|
||||
### Logs
|
||||
|
||||
```bash
|
||||
# Show recent log entries
|
||||
omop-pipeline logs show
|
||||
|
||||
# Show last 100 lines
|
||||
omop-pipeline logs show --lines 100
|
||||
|
||||
# Filter by log level
|
||||
omop-pipeline logs show --level ERROR
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
The pipeline consists of the following components:
|
||||
|
||||
- **Extractor**: Extracts data from staging tables with batch processing
|
||||
- **Concept Mapper**: Maps source codes to OMOP concepts with LRU caching
|
||||
- **Transformer**: Transforms data to OMOP format with validation
|
||||
- **Validator**: Validates data quality and OMOP compliance
|
||||
- **Loader**: Loads data into OMOP tables using bulk operations
|
||||
- **Orchestrator**: Coordinates the complete ETL flow with parallel processing
|
||||
- **Error Handler**: Manages errors with retry logic and circuit breaker
|
||||
- **Schema Manager**: Creates and manages database schemas
|
||||
- **Vocabulary Loader**: Loads OMOP vocabularies from CSV files
|
||||
|
||||
## Configuration
|
||||
|
||||
The pipeline is configured via `config.yaml`:
|
||||
|
||||
```yaml
|
||||
database:
|
||||
host: localhost
|
||||
port: 5432
|
||||
database: omop_db
|
||||
user: postgres
|
||||
password: ${DB_PASSWORD} # From environment variable
|
||||
|
||||
etl:
|
||||
batch_size: 1000
|
||||
num_workers: 4
|
||||
concept_cache_size: 10000
|
||||
validate_before_load: true
|
||||
|
||||
logging:
|
||||
level: INFO
|
||||
file: logs/omop_pipeline.log
|
||||
max_bytes: 10485760
|
||||
backup_count: 5
|
||||
```
|
||||
|
||||
## Performance
|
||||
|
||||
The pipeline is optimized for high-volume data processing:
|
||||
|
||||
- **Parallel Processing**: Multi-threaded execution with configurable workers
|
||||
- **Batch Operations**: Efficient batch processing with PostgreSQL COPY
|
||||
- **Caching**: LRU cache for frequently used concept mappings
|
||||
- **Connection Pooling**: Optimized database connection management
|
||||
|
||||
Typical performance on a 16-core, 125GB RAM system:
|
||||
- **Throughput**: 5,000-10,000 records/second
|
||||
- **Memory Usage**: ~2-4GB per worker
|
||||
- **CPU Usage**: Scales linearly with number of workers
|
||||
|
||||
## Data Quality
|
||||
|
||||
The pipeline includes comprehensive data quality checks:
|
||||
|
||||
- **Referential Integrity**: Validates all foreign key relationships
|
||||
- **Date Consistency**: Ensures start dates <= end dates
|
||||
- **Concept Validation**: Verifies all concept_ids exist
|
||||
- **Value Ranges**: Checks numeric values are within acceptable ranges
|
||||
- **OMOP Compliance**: Validates against OMOP CDM specifications
|
||||
|
||||
## Error Handling
|
||||
|
||||
The pipeline implements robust error handling:
|
||||
|
||||
- **Error Levels**: INFO, WARNING, ERROR, CRITICAL
|
||||
- **Retry Logic**: Exponential backoff for transient errors
|
||||
- **Circuit Breaker**: Prevents cascading failures
|
||||
- **Checkpoint/Resume**: Resume processing after interruption
|
||||
- **Audit Trail**: Complete error logging to audit tables
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
# Run all tests
|
||||
pytest
|
||||
|
||||
# Run with coverage
|
||||
pytest --cov=src --cov-report=html
|
||||
|
||||
# Run specific test file
|
||||
pytest tests/test_transformer.py
|
||||
```
|
||||
|
||||
## Documentation
|
||||
|
||||
- [User Guide](docs/user_guide.md) - Detailed usage instructions
|
||||
- [Architecture](docs/architecture.md) - System architecture and design
|
||||
- [Transformation Rules](docs/transformation_rules.md) - Data transformation specifications
|
||||
- [CHANGELOG](CHANGELOG.md) - Version history and changes
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.12+
|
||||
- PostgreSQL 16.11+
|
||||
- 8GB+ RAM (16GB+ recommended for parallel processing)
|
||||
- OMOP vocabularies from Athena OHDSI
|
||||
|
||||
## License
|
||||
|
||||
MIT License - see LICENSE file for details
|
||||
|
||||
## Support
|
||||
|
||||
For issues, questions, or contributions, please open an issue on GitHub.
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
- OHDSI Community for OMOP CDM specifications
|
||||
- Athena OHDSI for vocabulary management
|
||||
Reference in New Issue
Block a user