# Design Document: Admin Monitoring System ## Overview This design document describes the architecture and implementation of a comprehensive monitoring and administration system for RPA Vision V3. The system extends the existing web dashboard with workflow chain management, trigger configuration, Prometheus metrics integration, centralized logging, and log download capabilities. ## Architecture ```mermaid graph TB subgraph "Admin Dashboard" UI[Web Interface] API[Flask API] WS[WebSocket Handler] end subgraph "Monitoring Core" Logger[Centralized Logger] Metrics[Prometheus Metrics] Collector[Metrics Collector] end subgraph "Management" ChainMgr[Chain Manager] TriggerMgr[Trigger Manager] LogExporter[Log Exporter] end subgraph "Storage" ChainStore[(Chains JSON)] TriggerStore[(Triggers JSON)] LogStore[(Log Files)] end UI --> API UI --> WS API --> ChainMgr API --> TriggerMgr API --> Logger API --> LogExporter API --> Metrics ChainMgr --> ChainStore TriggerMgr --> TriggerStore Logger --> LogStore Logger --> Metrics Collector --> Metrics WS --> Collector ``` ## Components and Interfaces ### 1. Centralized Logger (`core/monitoring/logger.py`) ```python @dataclass class LogEntry: timestamp: datetime level: str # INFO, WARNING, ERROR, DEBUG component: str message: str workflow_id: Optional[str] = None node_id: Optional[str] = None metadata: Dict[str, Any] = field(default_factory=dict) def to_dict(self) -> Dict[str, Any]: ... class RPALogger: def __init__(self, component: str, log_file: Optional[str] = None): ... def info(self, message: str, workflow_id: str = None, **metadata): ... def warning(self, message: str, workflow_id: str = None, **metadata): ... def error(self, message: str, workflow_id: str = None, **metadata): ... def debug(self, message: str, workflow_id: str = None, **metadata): ... def workflow_start(self, workflow_id: str, **metadata): ... def workflow_end(self, workflow_id: str, success: bool, duration: float): ... def get_recent_logs(self, limit: int = 100) -> List[LogEntry]: ... def export_logs(self, start_time: datetime = None, end_time: datetime = None) -> str: ... def get_logger(component: str) -> RPALogger: ... ``` ### 2. Prometheus Metrics (`core/monitoring/metrics.py`) ```python # Counters workflow_executions_total = Counter( 'workflow_executions_total', 'Total workflow executions', ['workflow_id', 'status'] ) log_entries_total = Counter( 'log_entries_total', 'Total log entries', ['level', 'component'] ) chain_executions_total = Counter( 'chain_executions_total', 'Total chain executions', ['chain_id', 'status'] ) trigger_fires_total = Counter( 'trigger_fires_total', 'Total trigger fires', ['trigger_type', 'workflow_id'] ) # Histograms workflow_duration_seconds = Histogram( 'workflow_duration_seconds', 'Workflow execution duration', ['workflow_id'] ) # Gauges active_workflows = Gauge('active_workflows', 'Number of active workflows') error_rate = Gauge('error_rate', 'Current error rate percentage') ``` ### 3. Chain Manager (`core/monitoring/chain_manager.py`) ```python @dataclass class WorkflowChain: chain_id: str name: str workflows: List[str] # Ordered list of workflow_ids status: str # active, inactive, running created_at: datetime last_execution: Optional[datetime] = None success_rate: float = 0.0 class ChainManager: def __init__(self, storage_path: Path): ... def list_chains(self) -> List[WorkflowChain]: ... def get_chain(self, chain_id: str) -> Optional[WorkflowChain]: ... def create_chain(self, name: str, workflows: List[str]) -> WorkflowChain: ... def validate_workflows_exist(self, workflow_ids: List[str]) -> bool: ... def execute_chain(self, chain_id: str, on_progress: Callable) -> ChainExecutionResult: ... def delete_chain(self, chain_id: str) -> bool: ... ``` ### 4. Trigger Manager (`core/monitoring/trigger_manager.py`) ```python @dataclass class Trigger: trigger_id: str trigger_type: str # schedule, file, manual workflow_id: str config: Dict[str, Any] enabled: bool created_at: datetime last_fired: Optional[datetime] = None class TriggerManager: def __init__(self, storage_path: Path): ... def list_triggers(self) -> List[Trigger]: ... def get_trigger(self, trigger_id: str) -> Optional[Trigger]: ... def create_trigger(self, trigger_type: str, workflow_id: str, config: Dict) -> Trigger: ... def validate_config(self, trigger_type: str, config: Dict) -> bool: ... def enable_trigger(self, trigger_id: str) -> bool: ... def disable_trigger(self, trigger_id: str) -> bool: ... def delete_trigger(self, trigger_id: str) -> bool: ... ``` ### 5. Log Exporter (`core/monitoring/log_exporter.py`) ```python class LogExporter: def __init__(self, logs_path: Path): ... def export_to_zip( self, start_time: Optional[datetime] = None, end_time: Optional[datetime] = None ) -> io.BytesIO: ... def get_execution_logs(self, start: datetime, end: datetime) -> List[Dict]: ... def get_error_logs(self, start: datetime, end: datetime) -> List[Dict]: ... def get_metrics_summary(self) -> Dict: ... ``` ### 6. API Endpoints (additions to `web_dashboard/app.py`) ```python # Chains API @app.route('/api/chains') def api_chains(): ... @app.route('/api/chains', methods=['POST']) def create_chain(): ... @app.route('/api/chains//execute', methods=['POST']) def execute_chain(chain_id): ... # Triggers API @app.route('/api/triggers') def api_triggers(): ... @app.route('/api/triggers', methods=['POST']) def create_trigger(): ... @app.route('/api/triggers//toggle', methods=['POST']) def toggle_trigger(trigger_id): ... # Logs API @app.route('/api/logs/download') def download_logs(): ... # Metrics API @app.route('/metrics') def prometheus_metrics(): ... ``` ## Data Models ### Chain Storage Format (`data/chains/*.json`) ```json { "chain_id": "chain_001", "name": "Complete Process", "workflows": ["wf_login", "wf_data_entry", "wf_submit"], "status": "active", "created_at": "2024-11-29T10:00:00", "last_execution": "2024-11-29T14:30:00", "success_rate": 92.5, "execution_history": [ { "timestamp": "2024-11-29T14:30:00", "success": true, "duration": 45.2, "failed_at": null } ] } ``` ### Trigger Storage Format (`data/triggers/*.json`) ```json { "trigger_id": "trigger_001", "trigger_type": "schedule", "workflow_id": "wf_login", "config": { "interval_seconds": 3600, "start_time": "08:00", "end_time": "18:00" }, "enabled": true, "created_at": "2024-11-29T10:00:00", "last_fired": "2024-11-29T14:00:00" } ``` ### Log Entry Format ```json { "timestamp": "2024-11-29T14:30:15.123", "level": "INFO", "component": "execution", "message": "Workflow started", "workflow_id": "wf_001", "node_id": "login_node", "metadata": { "trigger": "schedule", "user": "system" } } ``` ## Correctness Properties *A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.* ### Property 1: Chain listing completeness *For any* set of chains stored in the system, the chains API endpoint SHALL return all chains with their complete workflow sequences and status information. **Validates: Requirements 1.1** ### Property 2: Chain workflow validation *For any* chain creation request with workflow references, if any referenced workflow does not exist, the creation SHALL fail with a validation error. **Validates: Requirements 1.2** ### Property 3: Chain execution stops on failure *For any* chain execution where a workflow fails, the chain execution SHALL stop at the failed workflow and not execute subsequent workflows. **Validates: Requirements 1.4** ### Property 4: Trigger listing completeness *For any* set of triggers stored in the system, the triggers API endpoint SHALL return all triggers with type, workflow_id, and enabled status. **Validates: Requirements 2.1** ### Property 5: Trigger state persistence *For any* trigger enable/disable operation, the new state SHALL be persisted and returned correctly on subsequent queries. **Validates: Requirements 2.3** ### Property 6: Prometheus metrics format validity *For any* request to the /metrics endpoint, the response SHALL be valid Prometheus exposition format parseable by Prometheus. **Validates: Requirements 3.1** ### Property 7: Workflow execution counter increment *For any* workflow execution (success or failure), the workflow_executions_total counter SHALL increment by exactly 1 with correct labels. **Validates: Requirements 3.2** ### Property 8: Workflow duration histogram recording *For any* completed workflow execution with a measured duration, the workflow_duration_seconds histogram SHALL record that duration. **Validates: Requirements 3.3** ### Property 9: Log entry structure completeness *For any* log entry created by the logging system, the entry SHALL contain timestamp, level, component, and message fields. **Validates: Requirements 4.1** ### Property 10: Workflow log metadata inclusion *For any* log entry created with workflow context, the entry metadata SHALL include workflow_id and node_id when provided. **Validates: Requirements 4.2** ### Property 11: Log filtering correctness *For any* log query with filter parameters, all returned entries SHALL match the specified filter criteria. **Validates: Requirements 4.3** ### Property 12: Log counter synchronization *For any* log entry written, the corresponding Prometheus log counter SHALL be incremented by 1. **Validates: Requirements 4.4** ### Property 13: ZIP archive validity *For any* log download request, the response SHALL be a valid ZIP archive. **Validates: Requirements 5.1** ### Property 14: ZIP archive contents *For any* log download, the ZIP archive SHALL contain execution_logs.json, error_logs.json, and metrics.json files. **Validates: Requirements 5.2** ### Property 15: Date range filtering *For any* log download with date range parameters, all log entries in the archive SHALL have timestamps within the specified range. **Validates: Requirements 5.4** ## Error Handling ### Chain Errors - `ChainNotFoundError`: Chain ID does not exist - `WorkflowNotFoundError`: Referenced workflow does not exist - `ChainExecutionError`: Error during chain execution with failure point ### Trigger Errors - `TriggerNotFoundError`: Trigger ID does not exist - `InvalidTriggerConfigError`: Trigger configuration is invalid - `WorkflowNotFoundError`: Target workflow does not exist ### Log Errors - `LogExportError`: Error generating log archive - `InvalidDateRangeError`: Start date is after end date ## Testing Strategy ### Property-Based Testing Library The implementation will use **Hypothesis** for Python property-based testing. ### Test Configuration - Minimum 100 iterations per property test - Each property test tagged with: `**Feature: admin-monitoring, Property {number}: {property_text}**` ### Unit Tests - Test individual component methods - Test API endpoint responses - Test error handling paths ### Property-Based Tests Each correctness property will have a corresponding property-based test: 1. **Property 1-2**: Generate random chain configurations, verify listing and validation 2. **Property 3**: Generate chains with failing workflows, verify execution stops 3. **Property 4-5**: Generate random triggers, verify listing and state persistence 4. **Property 6-8**: Generate workflow executions, verify metrics format and values 5. **Property 9-12**: Generate log entries, verify structure and counter sync 6. **Property 13-15**: Generate log data, verify ZIP contents and filtering ### Integration Tests - End-to-end chain execution flow - Trigger firing and workflow execution - Log download with various filters