rpa_vision_v3/.kiro/specs/admin-monitoring/design.md

# Design Document: Admin Monitoring System

## Overview

This design document describes the architecture and implementation of a comprehensive monitoring and administration system for RPA Vision V3. The system extends the existing web dashboard with workflow chain management, trigger configuration, Prometheus metrics integration, centralized logging, and log download capabilities.

## Architecture

```mermaid
graph TB
    subgraph "Admin Dashboard"
        UI[Web Interface]
        API[Flask API]
        WS[WebSocket Handler]
    end

    subgraph "Monitoring Core"
        Logger[Centralized Logger]
        Metrics[Prometheus Metrics]
        Collector[Metrics Collector]
    end

    subgraph "Management"
        ChainMgr[Chain Manager]
        TriggerMgr[Trigger Manager]
        LogExporter[Log Exporter]
    end

    subgraph "Storage"
        ChainStore[(Chains JSON)]
        TriggerStore[(Triggers JSON)]
        LogStore[(Log Files)]
    end

    UI --> API
    UI --> WS
    API --> ChainMgr
    API --> TriggerMgr
    API --> Logger
    API --> LogExporter
    API --> Metrics

    ChainMgr --> ChainStore
    TriggerMgr --> TriggerStore
    Logger --> LogStore
    Logger --> Metrics

    Collector --> Metrics
    WS --> Collector
```

## Components and Interfaces

### 1. Centralized Logger (`core/monitoring/logger.py`)

```python
@dataclass
class LogEntry:
    timestamp: datetime
    level: str  # INFO, WARNING, ERROR, DEBUG
    component: str
    message: str
    workflow_id: Optional[str] = None
    node_id: Optional[str] = None
    metadata: Dict[str, Any] = field(default_factory=dict)

    def to_dict(self) -> Dict[str, Any]: ...

class RPALogger:
    def __init__(self, component: str, log_file: Optional[str] = None): ...
    def info(self, message: str, workflow_id: str = None, **metadata): ...
    def warning(self, message: str, workflow_id: str = None, **metadata): ...
    def error(self, message: str, workflow_id: str = None, **metadata): ...
    def debug(self, message: str, workflow_id: str = None, **metadata): ...
    def workflow_start(self, workflow_id: str, **metadata): ...
    def workflow_end(self, workflow_id: str, success: bool, duration: float): ...
    def get_recent_logs(self, limit: int = 100) -> List[LogEntry]: ...
    def export_logs(self, start_time: datetime = None, end_time: datetime = None) -> str: ...

def get_logger(component: str) -> RPALogger: ...
```

### 2. Prometheus Metrics (`core/monitoring/metrics.py`)

```python
# Counters
workflow_executions_total = Counter(
    'workflow_executions_total',
    'Total workflow executions',
    ['workflow_id', 'status']
)
log_entries_total = Counter(
    'log_entries_total',
    'Total log entries',
    ['level', 'component']
)
chain_executions_total = Counter(
    'chain_executions_total',
    'Total chain executions',
    ['chain_id', 'status']
)
trigger_fires_total = Counter(
    'trigger_fires_total',
    'Total trigger fires',
    ['trigger_type', 'workflow_id']
)

# Histograms
workflow_duration_seconds = Histogram(
    'workflow_duration_seconds',
    'Workflow execution duration',
    ['workflow_id']
)

# Gauges
active_workflows = Gauge('active_workflows', 'Number of active workflows')
error_rate = Gauge('error_rate', 'Current error rate percentage')
```

### 3. Chain Manager (`core/monitoring/chain_manager.py`)

```python
@dataclass
class WorkflowChain:
    chain_id: str
    name: str
    workflows: List[str]  # Ordered list of workflow_ids
    status: str  # active, inactive, running
    created_at: datetime
    last_execution: Optional[datetime] = None
    success_rate: float = 0.0

class ChainManager:
    def __init__(self, storage_path: Path): ...
    def list_chains(self) -> List[WorkflowChain]: ...
    def get_chain(self, chain_id: str) -> Optional[WorkflowChain]: ...
    def create_chain(self, name: str, workflows: List[str]) -> WorkflowChain: ...
    def validate_workflows_exist(self, workflow_ids: List[str]) -> bool: ...
    def execute_chain(self, chain_id: str, on_progress: Callable) -> ChainExecutionResult: ...
    def delete_chain(self, chain_id: str) -> bool: ...
```

### 4. Trigger Manager (`core/monitoring/trigger_manager.py`)

```python
@dataclass
class Trigger:
    trigger_id: str
    trigger_type: str  # schedule, file, manual
    workflow_id: str
    config: Dict[str, Any]
    enabled: bool
    created_at: datetime
    last_fired: Optional[datetime] = None

class TriggerManager:
    def __init__(self, storage_path: Path): ...
    def list_triggers(self) -> List[Trigger]: ...
    def get_trigger(self, trigger_id: str) -> Optional[Trigger]: ...
    def create_trigger(self, trigger_type: str, workflow_id: str, config: Dict) -> Trigger: ...
    def validate_config(self, trigger_type: str, config: Dict) -> bool: ...
    def enable_trigger(self, trigger_id: str) -> bool: ...
    def disable_trigger(self, trigger_id: str) -> bool: ...
    def delete_trigger(self, trigger_id: str) -> bool: ...
```

### 5. Log Exporter (`core/monitoring/log_exporter.py`)

```python
class LogExporter:
    def __init__(self, logs_path: Path): ...
    def export_to_zip(
        self,
        start_time: Optional[datetime] = None,
        end_time: Optional[datetime] = None
    ) -> io.BytesIO: ...
    def get_execution_logs(self, start: datetime, end: datetime) -> List[Dict]: ...
    def get_error_logs(self, start: datetime, end: datetime) -> List[Dict]: ...
    def get_metrics_summary(self) -> Dict: ...
```

### 6. API Endpoints (additions to `web_dashboard/app.py`)

```python
# Chains API
@app.route('/api/chains')
def api_chains(): ...

@app.route('/api/chains', methods=['POST'])
def create_chain(): ...

@app.route('/api/chains/<chain_id>/execute', methods=['POST'])
def execute_chain(chain_id): ...

# Triggers API
@app.route('/api/triggers')
def api_triggers(): ...

@app.route('/api/triggers', methods=['POST'])
def create_trigger(): ...

@app.route('/api/triggers/<trigger_id>/toggle', methods=['POST'])
def toggle_trigger(trigger_id): ...

# Logs API
@app.route('/api/logs/download')
def download_logs(): ...

# Metrics API
@app.route('/metrics')
def prometheus_metrics(): ...
```

## Data Models

### Chain Storage Format (`data/chains/*.json`)

```json
{
    "chain_id": "chain_001",
    "name": "Complete Process",
    "workflows": ["wf_login", "wf_data_entry", "wf_submit"],
    "status": "active",
    "created_at": "2024-11-29T10:00:00",
    "last_execution": "2024-11-29T14:30:00",
    "success_rate": 92.5,
    "execution_history": [
        {
            "timestamp": "2024-11-29T14:30:00",
            "success": true,
            "duration": 45.2,
            "failed_at": null
        }
    ]
}
```

### Trigger Storage Format (`data/triggers/*.json`)

```json
{
    "trigger_id": "trigger_001",
    "trigger_type": "schedule",
    "workflow_id": "wf_login",
    "config": {
        "interval_seconds": 3600,
        "start_time": "08:00",
        "end_time": "18:00"
    },
    "enabled": true,
    "created_at": "2024-11-29T10:00:00",
    "last_fired": "2024-11-29T14:00:00"
}
```

### Log Entry Format

```json
{
    "timestamp": "2024-11-29T14:30:15.123",
    "level": "INFO",
    "component": "execution",
    "message": "Workflow started",
    "workflow_id": "wf_001",
    "node_id": "login_node",
    "metadata": {
        "trigger": "schedule",
        "user": "system"
    }
}
```

## Correctness Properties

*A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*

### Property 1: Chain listing completeness
*For any* set of chains stored in the system, the chains API endpoint SHALL return all chains with their complete workflow sequences and status information.
**Validates: Requirements 1.1**

### Property 2: Chain workflow validation
*For any* chain creation request with workflow references, if any referenced workflow does not exist, the creation SHALL fail with a validation error.
**Validates: Requirements 1.2**

### Property 3: Chain execution stops on failure
*For any* chain execution where a workflow fails, the chain execution SHALL stop at the failed workflow and not execute subsequent workflows.
**Validates: Requirements 1.4**

### Property 4: Trigger listing completeness
*For any* set of triggers stored in the system, the triggers API endpoint SHALL return all triggers with type, workflow_id, and enabled status.
**Validates: Requirements 2.1**

### Property 5: Trigger state persistence
*For any* trigger enable/disable operation, the new state SHALL be persisted and returned correctly on subsequent queries.
**Validates: Requirements 2.3**

### Property 6: Prometheus metrics format validity
*For any* request to the /metrics endpoint, the response SHALL be valid Prometheus exposition format parseable by Prometheus.
**Validates: Requirements 3.1**

### Property 7: Workflow execution counter increment
*For any* workflow execution (success or failure), the workflow_executions_total counter SHALL increment by exactly 1 with correct labels.
**Validates: Requirements 3.2**

### Property 8: Workflow duration histogram recording
*For any* completed workflow execution with a measured duration, the workflow_duration_seconds histogram SHALL record that duration.
**Validates: Requirements 3.3**

### Property 9: Log entry structure completeness
*For any* log entry created by the logging system, the entry SHALL contain timestamp, level, component, and message fields.
**Validates: Requirements 4.1**

### Property 10: Workflow log metadata inclusion
*For any* log entry created with workflow context, the entry metadata SHALL include workflow_id and node_id when provided.
**Validates: Requirements 4.2**

### Property 11: Log filtering correctness
*For any* log query with filter parameters, all returned entries SHALL match the specified filter criteria.
**Validates: Requirements 4.3**

### Property 12: Log counter synchronization
*For any* log entry written, the corresponding Prometheus log counter SHALL be incremented by 1.
**Validates: Requirements 4.4**

### Property 13: ZIP archive validity
*For any* log download request, the response SHALL be a valid ZIP archive.
**Validates: Requirements 5.1**

### Property 14: ZIP archive contents
*For any* log download, the ZIP archive SHALL contain execution_logs.json, error_logs.json, and metrics.json files.
**Validates: Requirements 5.2**

### Property 15: Date range filtering
*For any* log download with date range parameters, all log entries in the archive SHALL have timestamps within the specified range.
**Validates: Requirements 5.4**

## Error Handling

### Chain Errors
- `ChainNotFoundError`: Chain ID does not exist
- `WorkflowNotFoundError`: Referenced workflow does not exist
- `ChainExecutionError`: Error during chain execution with failure point

### Trigger Errors
- `TriggerNotFoundError`: Trigger ID does not exist
- `InvalidTriggerConfigError`: Trigger configuration is invalid
- `WorkflowNotFoundError`: Target workflow does not exist

### Log Errors
- `LogExportError`: Error generating log archive
- `InvalidDateRangeError`: Start date is after end date

## Testing Strategy

### Property-Based Testing Library
The implementation will use **Hypothesis** for Python property-based testing.

### Test Configuration
- Minimum 100 iterations per property test
- Each property test tagged with: `**Feature: admin-monitoring, Property {number}: {property_text}**`

### Unit Tests
- Test individual component methods
- Test API endpoint responses
- Test error handling paths

### Property-Based Tests
Each correctness property will have a corresponding property-based test:

1. **Property 1-2**: Generate random chain configurations, verify listing and validation
2. **Property 3**: Generate chains with failing workflows, verify execution stops
3. **Property 4-5**: Generate random triggers, verify listing and state persistence
4. **Property 6-8**: Generate workflow executions, verify metrics format and values
5. **Property 9-12**: Generate log entries, verify structure and counter sync
6. **Property 13-15**: Generate log data, verify ZIP contents and filtering

### Integration Tests
- End-to-end chain execution flow
- Trigger firing and workflow execution
- Log download with various filters