rpa_vision_v3/.kiro/specs/rpa-analytics/design.md

# Design Document: RPA Analytics & Insights

## Overview

This design document describes the architecture and implementation of a comprehensive analytics and insights system for RPA Vision V3. The system collects execution metrics, performs real-time and historical analysis, detects anomalies, generates automated insights, and provides customizable dashboards and reports.

The analytics system is designed to be:
- **Non-intrusive**: Minimal impact on workflow execution performance
- **Scalable**: Handle high-volume metric collection and analysis
- **Real-time**: Provide sub-second latency for live monitoring
- **Intelligent**: Automatic anomaly detection and insight generation
- **Flexible**: Customizable dashboards, reports, and alerts

## Architecture

```mermaid
graph TB
    subgraph "Data Collection"
        EC[Execution Collector]
        MC[Metrics Collector]
        RC[Resource Collector]
        Buffer[Async Buffer]
    end

    subgraph "Storage Layer"
        TS[Time Series DB]
        MS[Metrics Store]
        AS[Archive Storage]
    end

    subgraph "Analytics Engine"
        PA[Performance Analyzer]
        AA[Anomaly Detector]
        IA[Insight Generator]
        CA[Comparative Analyzer]
    end

    subgraph "Query & Aggregation"
        QE[Query Engine]
        AG[Aggregator]
        Cache[Query Cache]
    end

    subgraph "Presentation"
        API[Analytics API]
        RT[Real-time Stream]
        RG[Report Generator]
        DM[Dashboard Manager]
    end

    EC --> Buffer
    MC --> Buffer
    RC --> Buffer
    Buffer --> TS
    Buffer --> MS

    TS --> QE
    MS --> QE
    QE --> AG
    AG --> Cache

    QE --> PA
    QE --> AA
    QE --> IA
    QE --> CA

    PA --> API
    AA --> API
    IA --> API
    CA --> API

    API --> RT
    API --> RG
    API --> DM

    MS --> AS
```

## Components and Interfaces

### 1. Metrics Collection (`core/analytics/collection/`)

#### A. Execution Collector
```python
@dataclass
class ExecutionMetrics:
    """Metrics for a workflow execution."""
    execution_id: str
    workflow_id: str
    started_at: datetime
    completed_at: Optional[datetime]
    duration_ms: Optional[float]
    status: str  # 'running', 'completed', 'failed'
    steps_total: int
    steps_completed: int
    steps_failed: int
    error_message: Optional[str] = None
    context: Dict[str, Any] = field(default_factory=dict)

@dataclass
class StepMetrics:
    """Metrics for a workflow step."""
    step_id: str
    execution_id: str
    workflow_id: str
    node_id: str
    action_type: str
    target_element: str
    started_at: datetime
    completed_at: datetime
    duration_ms: float
    status: str
    confidence_score: float
    retry_count: int = 0
    error_details: Optional[str] = None

class MetricsCollector:
    """Collects metrics from workflow executions."""

    def __init__(self, buffer_size: int = 1000, flush_interval_sec: float = 5.0):
        self.buffer_size = buffer_size
        self.flush_interval = flush_interval_sec
        self._buffer: List[Union[ExecutionMetrics, StepMetrics]] = []
        self._lock = threading.Lock()
        self._flush_thread: Optional[threading.Thread] = None

    def record_execution_start(self, execution_id: str, workflow_id: str) -> None:
        """Record the start of a workflow execution."""

    def record_execution_complete(
        self,
        execution_id: str,
        status: str,
        error_message: Optional[str] = None
    ) -> None:
        """Record the completion of a workflow execution."""

    def record_step(self, step_metrics: StepMetrics) -> None:
        """Record metrics for a completed step."""

    def flush(self) -> None:
        """Flush buffered metrics to storage."""
```

#### B. Resource Collector
```python
@dataclass
class ResourceMetrics:
    """System resource usage metrics."""
    timestamp: datetime
    workflow_id: Optional[str]
    execution_id: Optional[str]
    cpu_percent: float
    memory_mb: float
    gpu_utilization: float
    gpu_memory_mb: float
    disk_io_mb: float

class ResourceCollector:
    """Collects system resource usage metrics."""

    def __init__(self, sample_interval_sec: float = 1.0):
        self.sample_interval = sample_interval_sec
        self._running = False
        self._thread: Optional[threading.Thread] = None

    def start(self) -> None:
        """Start collecting resource metrics."""

    def stop(self) -> None:
        """Stop collecting resource metrics."""

    def get_current_metrics(self) -> ResourceMetrics:
        """Get current resource usage."""
```

### 2. Storage Layer (`core/analytics/storage/`)

#### A. Time Series Store
```python
class TimeSeriesStore:
    """Store for time-series metrics data."""

    def __init__(self, storage_path: Path):
        self.storage_path = storage_path
        # Use SQLite with time-series optimizations
        self.db_path = storage_path / 'timeseries.db'

    def write_metrics(self, metrics: List[Union[ExecutionMetrics, StepMetrics]]) -> None:
        """Write metrics to time-series storage."""

    def query_range(
        self,
        start_time: datetime,
        end_time: datetime,
        workflow_id: Optional[str] = None,
        metric_types: Optional[List[str]] = None
    ) -> List[Dict]:
        """Query metrics within a time range."""

    def aggregate(
        self,
        metric: str,
        aggregation: str,  # 'avg', 'sum', 'count', 'min', 'max'
        group_by: List[str],
        start_time: datetime,
        end_time: datetime,
        filters: Optional[Dict] = None
    ) -> List[Dict]:
        """Aggregate metrics with grouping."""
```

#### B. Archive Storage
```python
class ArchiveStorage:
    """Archive storage for old metrics."""

    def __init__(self, storage_path: Path):
        self.storage_path = storage_path
        self.archive_path = storage_path / 'archive'

    def archive_data(
        self,
        data: List[Dict],
        archive_date: datetime
    ) -> str:
        """Archive data with compression."""

    def query_archive(
        self,
        start_date: datetime,
        end_date: datetime,
        filters: Optional[Dict] = None
    ) -> List[Dict]:
        """Query archived data."""

    def apply_retention_policy(
        self,
        policy: Dict[str, int]  # metric_type -> retention_days
    ) -> int:
        """Apply retention policy and return number of records deleted."""
```

### 3. Analytics Engine (`core/analytics/engine/`)

#### A. Performance Analyzer
```python
@dataclass
class PerformanceStats:
    """Performance statistics."""
    workflow_id: str
    time_period: str
    execution_count: int
    avg_duration_ms: float
    median_duration_ms: float
    p95_duration_ms: float
    p99_duration_ms: float
    min_duration_ms: float
    max_duration_ms: float
    std_dev_ms: float
    slowest_steps: List[Dict]

class PerformanceAnalyzer:
    """Analyzes workflow performance."""

    def __init__(self, time_series_store: TimeSeriesStore):
        self.store = time_series_store

    def analyze_workflow(
        self,
        workflow_id: str,
        start_time: datetime,
        end_time: datetime
    ) -> PerformanceStats:
        """Analyze performance for a workflow."""

    def identify_bottlenecks(
        self,
        workflow_id: str,
        threshold_percentile: float = 0.95
    ) -> List[Dict]:
        """Identify bottleneck steps in a workflow."""

    def detect_performance_degradation(
        self,
        workflow_id: str,
        baseline_period: timedelta,
        current_period: timedelta,
        threshold_percent: float = 20.0
    ) -> Optional[Dict]:
        """Detect performance degradation compared to baseline."""
```

#### B. Anomaly Detector
```python
@dataclass
class Anomaly:
    """Detected anomaly."""
    anomaly_id: str
    workflow_id: str
    metric_name: str
    detected_at: datetime
    severity: float  # 0.0 to 1.0
    deviation: float
    baseline_value: float
    actual_value: float
    description: str
    recommended_action: Optional[str] = None

class AnomalyDetector:
    """Detects anomalies in workflow execution."""

    def __init__(
        self,
        time_series_store: TimeSeriesStore,
        sensitivity: float = 2.0  # Standard deviations
    ):
        self.store = time_series_store
        self.sensitivity = sensitivity
        self.baselines: Dict[str, Dict] = {}

    def detect_anomalies(
        self,
        workflow_id: str,
        metrics: List[Dict]
    ) -> List[Anomaly]:
        """Detect anomalies in metrics."""

    def update_baseline(
        self,
        workflow_id: str,
        stable_period_days: int = 7
    ) -> None:
        """Update baseline from stable period."""

    def correlate_anomalies(
        self,
        anomalies: List[Anomaly],
        time_window_minutes: int = 30
    ) -> List[List[Anomaly]]:
        """Correlate related anomalies."""
```

#### C. Insight Generator
```python
@dataclass
class Insight:
    """Generated insight."""
    insight_id: str
    workflow_id: str
    category: str  # 'performance', 'reliability', 'resource', 'best_practice'
    title: str
    description: str
    recommendation: str
    expected_impact: str
    ease_of_implementation: str  # 'easy', 'medium', 'hard'
    priority_score: float
    supporting_data: Dict[str, Any]
    created_at: datetime

class InsightGenerator:
    """Generates automated insights."""

    def __init__(
        self,
        performance_analyzer: PerformanceAnalyzer,
        anomaly_detector: AnomalyDetector
    ):
        self.performance_analyzer = performance_analyzer
        self.anomaly_detector = anomaly_detector

    def generate_insights(
        self,
        workflow_id: str,
        analysis_period_days: int = 30
    ) -> List[Insight]:
        """Generate insights for a workflow."""

    def prioritize_insights(
        self,
        insights: List[Insight]
    ) -> List[Insight]:
        """Prioritize insights by impact and ease."""

    def track_insight_implementation(
        self,
        insight_id: str,
        implemented: bool,
        actual_impact: Optional[Dict] = None
    ) -> None:
        """Track insight implementation and measure impact."""
```

### 4. Query Engine (`core/analytics/query/`)

```python
class QueryEngine:
    """Query engine for analytics data."""

    def __init__(
        self,
        time_series_store: TimeSeriesStore,
        archive_storage: ArchiveStorage,
        cache_size: int = 100
    ):
        self.ts_store = time_series_store
        self.archive = archive_storage
        self.cache = LRUCache(cache_size)

    def query(
        self,
        query: Dict[str, Any],
        use_cache: bool = True
    ) -> List[Dict]:
        """Execute a query against analytics data."""

    def aggregate(
        self,
        metric: str,
        aggregation: str,
        group_by: List[str],
        filters: Dict[str, Any],
        time_range: Tuple[datetime, datetime]
    ) -> List[Dict]:
        """Aggregate metrics with grouping."""

    def compare(
        self,
        workflow_ids: List[str],
        metrics: List[str],
        time_range: Tuple[datetime, datetime]
    ) -> Dict[str, Dict]:
        """Compare metrics across workflows."""
```

### 5. Real-time Analytics (`core/analytics/realtime/`)

```python
class RealtimeAnalytics:
    """Real-time analytics for active workflows."""

    def __init__(self, metrics_collector: MetricsCollector):
        self.collector = metrics_collector
        self.active_executions: Dict[str, ExecutionMetrics] = {}
        self.subscribers: Dict[str, List[Callable]] = {}

    def track_execution(self, execution_id: str, workflow_id: str) -> None:
        """Start tracking an execution in real-time."""

    def update_progress(
        self,
        execution_id: str,
        current_step: int,
        total_steps: int
    ) -> None:
        """Update execution progress."""

    def get_live_metrics(self, execution_id: str) -> Dict[str, Any]:
        """Get live metrics for an execution."""

    def subscribe(
        self,
        execution_id: str,
        callback: Callable[[Dict], None]
    ) -> None:
        """Subscribe to real-time updates."""
```

## Data Models

### Metrics Schema

```sql
-- Execution metrics table
CREATE TABLE execution_metrics (
    execution_id TEXT PRIMARY KEY,
    workflow_id TEXT NOT NULL,
    started_at TIMESTAMP NOT NULL,
    completed_at TIMESTAMP,
    duration_ms REAL,
    status TEXT NOT NULL,
    steps_total INTEGER,
    steps_completed INTEGER,
    steps_failed INTEGER,
    error_message TEXT,
    context JSON
);

CREATE INDEX idx_workflow_time ON execution_metrics(workflow_id, started_at);
CREATE INDEX idx_status ON execution_metrics(status);

-- Step metrics table
CREATE TABLE step_metrics (
    step_id TEXT PRIMARY KEY,
    execution_id TEXT NOT NULL,
    workflow_id TEXT NOT NULL,
    node_id TEXT NOT NULL,
    action_type TEXT NOT NULL,
    target_element TEXT,
    started_at TIMESTAMP NOT NULL,
    completed_at TIMESTAMP NOT NULL,
    duration_ms REAL NOT NULL,
    status TEXT NOT NULL,
    confidence_score REAL,
    retry_count INTEGER DEFAULT 0,
    error_details TEXT,
    FOREIGN KEY (execution_id) REFERENCES execution_metrics(execution_id)
);

CREATE INDEX idx_execution ON step_metrics(execution_id);
CREATE INDEX idx_workflow_action ON step_metrics(workflow_id, action_type);

-- Resource metrics table
CREATE TABLE resource_metrics (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp TIMESTAMP NOT NULL,
    workflow_id TEXT,
    execution_id TEXT,
    cpu_percent REAL NOT NULL,
    memory_mb REAL NOT NULL,
    gpu_utilization REAL,
    gpu_memory_mb REAL,
    disk_io_mb REAL
);

CREATE INDEX idx_resource_time ON resource_metrics(timestamp);
CREATE INDEX idx_resource_workflow ON resource_metrics(workflow_id, timestamp);
```

## Correctness Properties

*A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*

### Property 1: Metrics completeness
*For any* workflow execution, all required metrics (execution_id, workflow_id, timestamps, duration) SHALL be recorded.
**Validates: Requirements 1.1, 1.4**

### Property 2: Step metrics integrity
*For any* completed step, the step metrics SHALL include action_type, target_element, and execution_result.
**Validates: Requirements 1.2**

### Property 3: Failure recording completeness
*For any* failed execution, the failure reason, error details, and context SHALL be recorded.
**Validates: Requirements 1.3**

### Property 4: Async persistence guarantee
*For any* buffered metrics, they SHALL eventually be persisted to storage within the flush interval.
**Validates: Requirements 1.5**

### Property 5: Statistical accuracy
*For any* dataset of execution times, the calculated average, median, p95, and p99 SHALL match standard statistical definitions.
**Validates: Requirements 2.1**

### Property 6: Bottleneck identification correctness
*For any* workflow, the identified bottleneck steps SHALL be the steps with the highest execution times.
**Validates: Requirements 2.3**

### Property 7: Performance degradation detection
*For any* workflow where execution time increases above threshold, an alert SHALL be generated.
**Validates: Requirements 2.4**

### Property 8: Success rate calculation accuracy
*For any* set of executions, the success rate SHALL equal (successful_count / total_count) * 100.
**Validates: Requirements 3.1**

### Property 9: Failure categorization completeness
*For any* set of failures, all failures SHALL be assigned to a category.
**Validates: Requirements 3.2**

### Property 10: Anomaly detection sensitivity
*For any* metric value that deviates from baseline by more than sensitivity threshold, an anomaly SHALL be detected.
**Validates: Requirements 4.1**

### Property 11: Severity score validity
*For any* detected anomaly, the severity score SHALL be between 0.0 and 1.0.
**Validates: Requirements 4.2**

### Property 12: Resource tracking completeness
*For any* workflow execution, CPU, memory, and GPU metrics SHALL be tracked.
**Validates: Requirements 5.1**

### Property 13: Insight generation consistency
*For any* workflow with performance issues, at least one actionable insight SHALL be generated.
**Validates: Requirements 6.1**

### Property 14: Insight prioritization correctness
*For any* set of insights, they SHALL be ordered by priority_score in descending order.
**Validates: Requirements 6.4**

### Property 15: Filter application correctness
*For any* query with filters, only records matching all filter criteria SHALL be returned.
**Validates: Requirements 7.1**

### Property 16: Export format validity
*For any* report export, the output SHALL be valid according to the target format specification (PDF, CSV, JSON).
**Validates: Requirements 7.3**

### Property 17: Comparison calculation accuracy
*For any* two workflows being compared, the difference calculations SHALL be mathematically correct.
**Validates: Requirements 8.1**

### Property 18: Real-time latency guarantee
*For any* real-time metric request, the response SHALL be delivered within 1 second.
**Validates: Requirements 9.1**

### Property 19: Retention policy enforcement
*For any* data older than its retention period, it SHALL be archived or deleted according to policy.
**Validates: Requirements 10.2**

### Property 20: Archive data integrity
*For any* archived data, it SHALL be retrievable and match the original data when decompressed.
**Validates: Requirements 10.3**

## Integration Points

### With Execution Loop
- Hook into execution start/complete events
- Collect step-level metrics during execution
- Minimal performance impact (<1% overhead)

### With Self-Healing System
- Integrate recovery metrics
- Track recovery success rates
- Correlate failures with recovery attempts

### With Dashboard
- Provide REST API for metrics
- WebSocket for real-time updates
- Export endpoints for reports

## Performance Considerations

### Optimization Strategies

1. **Async Collection**: Buffer metrics and persist asynchronously
2. **Query Caching**: Cache frequently accessed aggregations
3. **Index Optimization**: Strategic indexes on time-series data
4. **Data Partitioning**: Partition by time for efficient queries
5. **Archive Strategy**: Move old data to compressed archive

### Scalability Targets

- Handle 1000+ workflow executions per hour
- Support 10,000+ steps per hour
- Real-time queries < 1 second
- Historical queries < 5 seconds
- Storage growth < 1GB per month

## Testing Strategy

### Property-Based Testing
Use Hypothesis to test correctness properties:
- Generate random execution data
- Verify statistical calculations
- Test anomaly detection with synthetic data
- Validate query filters and aggregations

### Integration Testing
- End-to-end metric collection and analysis
- Real-time analytics with simulated workflows
- Archive and retention policy testing
- Dashboard integration testing

### Performance Testing
- Load testing with high metric volume
- Query performance benchmarking
- Real-time latency testing
- Storage growth monitoring