# Design Document: RPA Analytics & Insights ## Overview This design document describes the architecture and implementation of a comprehensive analytics and insights system for RPA Vision V3. The system collects execution metrics, performs real-time and historical analysis, detects anomalies, generates automated insights, and provides customizable dashboards and reports. The analytics system is designed to be: - **Non-intrusive**: Minimal impact on workflow execution performance - **Scalable**: Handle high-volume metric collection and analysis - **Real-time**: Provide sub-second latency for live monitoring - **Intelligent**: Automatic anomaly detection and insight generation - **Flexible**: Customizable dashboards, reports, and alerts ## Architecture ```mermaid graph TB subgraph "Data Collection" EC[Execution Collector] MC[Metrics Collector] RC[Resource Collector] Buffer[Async Buffer] end subgraph "Storage Layer" TS[Time Series DB] MS[Metrics Store] AS[Archive Storage] end subgraph "Analytics Engine" PA[Performance Analyzer] AA[Anomaly Detector] IA[Insight Generator] CA[Comparative Analyzer] end subgraph "Query & Aggregation" QE[Query Engine] AG[Aggregator] Cache[Query Cache] end subgraph "Presentation" API[Analytics API] RT[Real-time Stream] RG[Report Generator] DM[Dashboard Manager] end EC --> Buffer MC --> Buffer RC --> Buffer Buffer --> TS Buffer --> MS TS --> QE MS --> QE QE --> AG AG --> Cache QE --> PA QE --> AA QE --> IA QE --> CA PA --> API AA --> API IA --> API CA --> API API --> RT API --> RG API --> DM MS --> AS ``` ## Components and Interfaces ### 1. Metrics Collection (`core/analytics/collection/`) #### A. Execution Collector ```python @dataclass class ExecutionMetrics: """Metrics for a workflow execution.""" execution_id: str workflow_id: str started_at: datetime completed_at: Optional[datetime] duration_ms: Optional[float] status: str # 'running', 'completed', 'failed' steps_total: int steps_completed: int steps_failed: int error_message: Optional[str] = None context: Dict[str, Any] = field(default_factory=dict) @dataclass class StepMetrics: """Metrics for a workflow step.""" step_id: str execution_id: str workflow_id: str node_id: str action_type: str target_element: str started_at: datetime completed_at: datetime duration_ms: float status: str confidence_score: float retry_count: int = 0 error_details: Optional[str] = None class MetricsCollector: """Collects metrics from workflow executions.""" def __init__(self, buffer_size: int = 1000, flush_interval_sec: float = 5.0): self.buffer_size = buffer_size self.flush_interval = flush_interval_sec self._buffer: List[Union[ExecutionMetrics, StepMetrics]] = [] self._lock = threading.Lock() self._flush_thread: Optional[threading.Thread] = None def record_execution_start(self, execution_id: str, workflow_id: str) -> None: """Record the start of a workflow execution.""" def record_execution_complete( self, execution_id: str, status: str, error_message: Optional[str] = None ) -> None: """Record the completion of a workflow execution.""" def record_step(self, step_metrics: StepMetrics) -> None: """Record metrics for a completed step.""" def flush(self) -> None: """Flush buffered metrics to storage.""" ``` #### B. Resource Collector ```python @dataclass class ResourceMetrics: """System resource usage metrics.""" timestamp: datetime workflow_id: Optional[str] execution_id: Optional[str] cpu_percent: float memory_mb: float gpu_utilization: float gpu_memory_mb: float disk_io_mb: float class ResourceCollector: """Collects system resource usage metrics.""" def __init__(self, sample_interval_sec: float = 1.0): self.sample_interval = sample_interval_sec self._running = False self._thread: Optional[threading.Thread] = None def start(self) -> None: """Start collecting resource metrics.""" def stop(self) -> None: """Stop collecting resource metrics.""" def get_current_metrics(self) -> ResourceMetrics: """Get current resource usage.""" ``` ### 2. Storage Layer (`core/analytics/storage/`) #### A. Time Series Store ```python class TimeSeriesStore: """Store for time-series metrics data.""" def __init__(self, storage_path: Path): self.storage_path = storage_path # Use SQLite with time-series optimizations self.db_path = storage_path / 'timeseries.db' def write_metrics(self, metrics: List[Union[ExecutionMetrics, StepMetrics]]) -> None: """Write metrics to time-series storage.""" def query_range( self, start_time: datetime, end_time: datetime, workflow_id: Optional[str] = None, metric_types: Optional[List[str]] = None ) -> List[Dict]: """Query metrics within a time range.""" def aggregate( self, metric: str, aggregation: str, # 'avg', 'sum', 'count', 'min', 'max' group_by: List[str], start_time: datetime, end_time: datetime, filters: Optional[Dict] = None ) -> List[Dict]: """Aggregate metrics with grouping.""" ``` #### B. Archive Storage ```python class ArchiveStorage: """Archive storage for old metrics.""" def __init__(self, storage_path: Path): self.storage_path = storage_path self.archive_path = storage_path / 'archive' def archive_data( self, data: List[Dict], archive_date: datetime ) -> str: """Archive data with compression.""" def query_archive( self, start_date: datetime, end_date: datetime, filters: Optional[Dict] = None ) -> List[Dict]: """Query archived data.""" def apply_retention_policy( self, policy: Dict[str, int] # metric_type -> retention_days ) -> int: """Apply retention policy and return number of records deleted.""" ``` ### 3. Analytics Engine (`core/analytics/engine/`) #### A. Performance Analyzer ```python @dataclass class PerformanceStats: """Performance statistics.""" workflow_id: str time_period: str execution_count: int avg_duration_ms: float median_duration_ms: float p95_duration_ms: float p99_duration_ms: float min_duration_ms: float max_duration_ms: float std_dev_ms: float slowest_steps: List[Dict] class PerformanceAnalyzer: """Analyzes workflow performance.""" def __init__(self, time_series_store: TimeSeriesStore): self.store = time_series_store def analyze_workflow( self, workflow_id: str, start_time: datetime, end_time: datetime ) -> PerformanceStats: """Analyze performance for a workflow.""" def identify_bottlenecks( self, workflow_id: str, threshold_percentile: float = 0.95 ) -> List[Dict]: """Identify bottleneck steps in a workflow.""" def detect_performance_degradation( self, workflow_id: str, baseline_period: timedelta, current_period: timedelta, threshold_percent: float = 20.0 ) -> Optional[Dict]: """Detect performance degradation compared to baseline.""" ``` #### B. Anomaly Detector ```python @dataclass class Anomaly: """Detected anomaly.""" anomaly_id: str workflow_id: str metric_name: str detected_at: datetime severity: float # 0.0 to 1.0 deviation: float baseline_value: float actual_value: float description: str recommended_action: Optional[str] = None class AnomalyDetector: """Detects anomalies in workflow execution.""" def __init__( self, time_series_store: TimeSeriesStore, sensitivity: float = 2.0 # Standard deviations ): self.store = time_series_store self.sensitivity = sensitivity self.baselines: Dict[str, Dict] = {} def detect_anomalies( self, workflow_id: str, metrics: List[Dict] ) -> List[Anomaly]: """Detect anomalies in metrics.""" def update_baseline( self, workflow_id: str, stable_period_days: int = 7 ) -> None: """Update baseline from stable period.""" def correlate_anomalies( self, anomalies: List[Anomaly], time_window_minutes: int = 30 ) -> List[List[Anomaly]]: """Correlate related anomalies.""" ``` #### C. Insight Generator ```python @dataclass class Insight: """Generated insight.""" insight_id: str workflow_id: str category: str # 'performance', 'reliability', 'resource', 'best_practice' title: str description: str recommendation: str expected_impact: str ease_of_implementation: str # 'easy', 'medium', 'hard' priority_score: float supporting_data: Dict[str, Any] created_at: datetime class InsightGenerator: """Generates automated insights.""" def __init__( self, performance_analyzer: PerformanceAnalyzer, anomaly_detector: AnomalyDetector ): self.performance_analyzer = performance_analyzer self.anomaly_detector = anomaly_detector def generate_insights( self, workflow_id: str, analysis_period_days: int = 30 ) -> List[Insight]: """Generate insights for a workflow.""" def prioritize_insights( self, insights: List[Insight] ) -> List[Insight]: """Prioritize insights by impact and ease.""" def track_insight_implementation( self, insight_id: str, implemented: bool, actual_impact: Optional[Dict] = None ) -> None: """Track insight implementation and measure impact.""" ``` ### 4. Query Engine (`core/analytics/query/`) ```python class QueryEngine: """Query engine for analytics data.""" def __init__( self, time_series_store: TimeSeriesStore, archive_storage: ArchiveStorage, cache_size: int = 100 ): self.ts_store = time_series_store self.archive = archive_storage self.cache = LRUCache(cache_size) def query( self, query: Dict[str, Any], use_cache: bool = True ) -> List[Dict]: """Execute a query against analytics data.""" def aggregate( self, metric: str, aggregation: str, group_by: List[str], filters: Dict[str, Any], time_range: Tuple[datetime, datetime] ) -> List[Dict]: """Aggregate metrics with grouping.""" def compare( self, workflow_ids: List[str], metrics: List[str], time_range: Tuple[datetime, datetime] ) -> Dict[str, Dict]: """Compare metrics across workflows.""" ``` ### 5. Real-time Analytics (`core/analytics/realtime/`) ```python class RealtimeAnalytics: """Real-time analytics for active workflows.""" def __init__(self, metrics_collector: MetricsCollector): self.collector = metrics_collector self.active_executions: Dict[str, ExecutionMetrics] = {} self.subscribers: Dict[str, List[Callable]] = {} def track_execution(self, execution_id: str, workflow_id: str) -> None: """Start tracking an execution in real-time.""" def update_progress( self, execution_id: str, current_step: int, total_steps: int ) -> None: """Update execution progress.""" def get_live_metrics(self, execution_id: str) -> Dict[str, Any]: """Get live metrics for an execution.""" def subscribe( self, execution_id: str, callback: Callable[[Dict], None] ) -> None: """Subscribe to real-time updates.""" ``` ## Data Models ### Metrics Schema ```sql -- Execution metrics table CREATE TABLE execution_metrics ( execution_id TEXT PRIMARY KEY, workflow_id TEXT NOT NULL, started_at TIMESTAMP NOT NULL, completed_at TIMESTAMP, duration_ms REAL, status TEXT NOT NULL, steps_total INTEGER, steps_completed INTEGER, steps_failed INTEGER, error_message TEXT, context JSON ); CREATE INDEX idx_workflow_time ON execution_metrics(workflow_id, started_at); CREATE INDEX idx_status ON execution_metrics(status); -- Step metrics table CREATE TABLE step_metrics ( step_id TEXT PRIMARY KEY, execution_id TEXT NOT NULL, workflow_id TEXT NOT NULL, node_id TEXT NOT NULL, action_type TEXT NOT NULL, target_element TEXT, started_at TIMESTAMP NOT NULL, completed_at TIMESTAMP NOT NULL, duration_ms REAL NOT NULL, status TEXT NOT NULL, confidence_score REAL, retry_count INTEGER DEFAULT 0, error_details TEXT, FOREIGN KEY (execution_id) REFERENCES execution_metrics(execution_id) ); CREATE INDEX idx_execution ON step_metrics(execution_id); CREATE INDEX idx_workflow_action ON step_metrics(workflow_id, action_type); -- Resource metrics table CREATE TABLE resource_metrics ( id INTEGER PRIMARY KEY AUTOINCREMENT, timestamp TIMESTAMP NOT NULL, workflow_id TEXT, execution_id TEXT, cpu_percent REAL NOT NULL, memory_mb REAL NOT NULL, gpu_utilization REAL, gpu_memory_mb REAL, disk_io_mb REAL ); CREATE INDEX idx_resource_time ON resource_metrics(timestamp); CREATE INDEX idx_resource_workflow ON resource_metrics(workflow_id, timestamp); ``` ## Correctness Properties *A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.* ### Property 1: Metrics completeness *For any* workflow execution, all required metrics (execution_id, workflow_id, timestamps, duration) SHALL be recorded. **Validates: Requirements 1.1, 1.4** ### Property 2: Step metrics integrity *For any* completed step, the step metrics SHALL include action_type, target_element, and execution_result. **Validates: Requirements 1.2** ### Property 3: Failure recording completeness *For any* failed execution, the failure reason, error details, and context SHALL be recorded. **Validates: Requirements 1.3** ### Property 4: Async persistence guarantee *For any* buffered metrics, they SHALL eventually be persisted to storage within the flush interval. **Validates: Requirements 1.5** ### Property 5: Statistical accuracy *For any* dataset of execution times, the calculated average, median, p95, and p99 SHALL match standard statistical definitions. **Validates: Requirements 2.1** ### Property 6: Bottleneck identification correctness *For any* workflow, the identified bottleneck steps SHALL be the steps with the highest execution times. **Validates: Requirements 2.3** ### Property 7: Performance degradation detection *For any* workflow where execution time increases above threshold, an alert SHALL be generated. **Validates: Requirements 2.4** ### Property 8: Success rate calculation accuracy *For any* set of executions, the success rate SHALL equal (successful_count / total_count) * 100. **Validates: Requirements 3.1** ### Property 9: Failure categorization completeness *For any* set of failures, all failures SHALL be assigned to a category. **Validates: Requirements 3.2** ### Property 10: Anomaly detection sensitivity *For any* metric value that deviates from baseline by more than sensitivity threshold, an anomaly SHALL be detected. **Validates: Requirements 4.1** ### Property 11: Severity score validity *For any* detected anomaly, the severity score SHALL be between 0.0 and 1.0. **Validates: Requirements 4.2** ### Property 12: Resource tracking completeness *For any* workflow execution, CPU, memory, and GPU metrics SHALL be tracked. **Validates: Requirements 5.1** ### Property 13: Insight generation consistency *For any* workflow with performance issues, at least one actionable insight SHALL be generated. **Validates: Requirements 6.1** ### Property 14: Insight prioritization correctness *For any* set of insights, they SHALL be ordered by priority_score in descending order. **Validates: Requirements 6.4** ### Property 15: Filter application correctness *For any* query with filters, only records matching all filter criteria SHALL be returned. **Validates: Requirements 7.1** ### Property 16: Export format validity *For any* report export, the output SHALL be valid according to the target format specification (PDF, CSV, JSON). **Validates: Requirements 7.3** ### Property 17: Comparison calculation accuracy *For any* two workflows being compared, the difference calculations SHALL be mathematically correct. **Validates: Requirements 8.1** ### Property 18: Real-time latency guarantee *For any* real-time metric request, the response SHALL be delivered within 1 second. **Validates: Requirements 9.1** ### Property 19: Retention policy enforcement *For any* data older than its retention period, it SHALL be archived or deleted according to policy. **Validates: Requirements 10.2** ### Property 20: Archive data integrity *For any* archived data, it SHALL be retrievable and match the original data when decompressed. **Validates: Requirements 10.3** ## Integration Points ### With Execution Loop - Hook into execution start/complete events - Collect step-level metrics during execution - Minimal performance impact (<1% overhead) ### With Self-Healing System - Integrate recovery metrics - Track recovery success rates - Correlate failures with recovery attempts ### With Dashboard - Provide REST API for metrics - WebSocket for real-time updates - Export endpoints for reports ## Performance Considerations ### Optimization Strategies 1. **Async Collection**: Buffer metrics and persist asynchronously 2. **Query Caching**: Cache frequently accessed aggregations 3. **Index Optimization**: Strategic indexes on time-series data 4. **Data Partitioning**: Partition by time for efficient queries 5. **Archive Strategy**: Move old data to compressed archive ### Scalability Targets - Handle 1000+ workflow executions per hour - Support 10,000+ steps per hour - Real-time queries < 1 second - Historical queries < 5 seconds - Storage growth < 1GB per month ## Testing Strategy ### Property-Based Testing Use Hypothesis to test correctness properties: - Generate random execution data - Verify statistical calculations - Test anomaly detection with synthetic data - Validate query filters and aggregations ### Integration Testing - End-to-end metric collection and analysis - Real-time analytics with simulated workflows - Archive and retention policy testing - Dashboard integration testing ### Performance Testing - Load testing with high metric volume - Query performance benchmarking - Real-time latency testing - Storage growth monitoring