feat: replay E2E fonctionnel — 25/25 actions, 0 retries, SomEngine via serveur

Validé sur PC Windows (DESKTOP-58D5CAC, 2560x1600) :
- 8 clics résolus visuellement (1 anchor_template, 1 som_text_match, 6 som_vlm)
- Score moyen 0.75, temps moyen 1.6s
- Texte tapé correctement (bonjour, test word, date, email)
- 0 retries, 2 actions non vérifiées (OK)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Dom
2026-03-31 14:04:41 +02:00
parent 5e0b53cfd1
commit a7de6a488b
79542 changed files with 6091757 additions and 1 deletions

View File

@@ -0,0 +1,668 @@
# Design Document: RPA Analytics & Insights
## Overview
This design document describes the architecture and implementation of a comprehensive analytics and insights system for RPA Vision V3. The system collects execution metrics, performs real-time and historical analysis, detects anomalies, generates automated insights, and provides customizable dashboards and reports.
The analytics system is designed to be:
- **Non-intrusive**: Minimal impact on workflow execution performance
- **Scalable**: Handle high-volume metric collection and analysis
- **Real-time**: Provide sub-second latency for live monitoring
- **Intelligent**: Automatic anomaly detection and insight generation
- **Flexible**: Customizable dashboards, reports, and alerts
## Architecture
```mermaid
graph TB
subgraph "Data Collection"
EC[Execution Collector]
MC[Metrics Collector]
RC[Resource Collector]
Buffer[Async Buffer]
end
subgraph "Storage Layer"
TS[Time Series DB]
MS[Metrics Store]
AS[Archive Storage]
end
subgraph "Analytics Engine"
PA[Performance Analyzer]
AA[Anomaly Detector]
IA[Insight Generator]
CA[Comparative Analyzer]
end
subgraph "Query & Aggregation"
QE[Query Engine]
AG[Aggregator]
Cache[Query Cache]
end
subgraph "Presentation"
API[Analytics API]
RT[Real-time Stream]
RG[Report Generator]
DM[Dashboard Manager]
end
EC --> Buffer
MC --> Buffer
RC --> Buffer
Buffer --> TS
Buffer --> MS
TS --> QE
MS --> QE
QE --> AG
AG --> Cache
QE --> PA
QE --> AA
QE --> IA
QE --> CA
PA --> API
AA --> API
IA --> API
CA --> API
API --> RT
API --> RG
API --> DM
MS --> AS
```
## Components and Interfaces
### 1. Metrics Collection (`core/analytics/collection/`)
#### A. Execution Collector
```python
@dataclass
class ExecutionMetrics:
"""Metrics for a workflow execution."""
execution_id: str
workflow_id: str
started_at: datetime
completed_at: Optional[datetime]
duration_ms: Optional[float]
status: str # 'running', 'completed', 'failed'
steps_total: int
steps_completed: int
steps_failed: int
error_message: Optional[str] = None
context: Dict[str, Any] = field(default_factory=dict)
@dataclass
class StepMetrics:
"""Metrics for a workflow step."""
step_id: str
execution_id: str
workflow_id: str
node_id: str
action_type: str
target_element: str
started_at: datetime
completed_at: datetime
duration_ms: float
status: str
confidence_score: float
retry_count: int = 0
error_details: Optional[str] = None
class MetricsCollector:
"""Collects metrics from workflow executions."""
def __init__(self, buffer_size: int = 1000, flush_interval_sec: float = 5.0):
self.buffer_size = buffer_size
self.flush_interval = flush_interval_sec
self._buffer: List[Union[ExecutionMetrics, StepMetrics]] = []
self._lock = threading.Lock()
self._flush_thread: Optional[threading.Thread] = None
def record_execution_start(self, execution_id: str, workflow_id: str) -> None:
"""Record the start of a workflow execution."""
def record_execution_complete(
self,
execution_id: str,
status: str,
error_message: Optional[str] = None
) -> None:
"""Record the completion of a workflow execution."""
def record_step(self, step_metrics: StepMetrics) -> None:
"""Record metrics for a completed step."""
def flush(self) -> None:
"""Flush buffered metrics to storage."""
```
#### B. Resource Collector
```python
@dataclass
class ResourceMetrics:
"""System resource usage metrics."""
timestamp: datetime
workflow_id: Optional[str]
execution_id: Optional[str]
cpu_percent: float
memory_mb: float
gpu_utilization: float
gpu_memory_mb: float
disk_io_mb: float
class ResourceCollector:
"""Collects system resource usage metrics."""
def __init__(self, sample_interval_sec: float = 1.0):
self.sample_interval = sample_interval_sec
self._running = False
self._thread: Optional[threading.Thread] = None
def start(self) -> None:
"""Start collecting resource metrics."""
def stop(self) -> None:
"""Stop collecting resource metrics."""
def get_current_metrics(self) -> ResourceMetrics:
"""Get current resource usage."""
```
### 2. Storage Layer (`core/analytics/storage/`)
#### A. Time Series Store
```python
class TimeSeriesStore:
"""Store for time-series metrics data."""
def __init__(self, storage_path: Path):
self.storage_path = storage_path
# Use SQLite with time-series optimizations
self.db_path = storage_path / 'timeseries.db'
def write_metrics(self, metrics: List[Union[ExecutionMetrics, StepMetrics]]) -> None:
"""Write metrics to time-series storage."""
def query_range(
self,
start_time: datetime,
end_time: datetime,
workflow_id: Optional[str] = None,
metric_types: Optional[List[str]] = None
) -> List[Dict]:
"""Query metrics within a time range."""
def aggregate(
self,
metric: str,
aggregation: str, # 'avg', 'sum', 'count', 'min', 'max'
group_by: List[str],
start_time: datetime,
end_time: datetime,
filters: Optional[Dict] = None
) -> List[Dict]:
"""Aggregate metrics with grouping."""
```
#### B. Archive Storage
```python
class ArchiveStorage:
"""Archive storage for old metrics."""
def __init__(self, storage_path: Path):
self.storage_path = storage_path
self.archive_path = storage_path / 'archive'
def archive_data(
self,
data: List[Dict],
archive_date: datetime
) -> str:
"""Archive data with compression."""
def query_archive(
self,
start_date: datetime,
end_date: datetime,
filters: Optional[Dict] = None
) -> List[Dict]:
"""Query archived data."""
def apply_retention_policy(
self,
policy: Dict[str, int] # metric_type -> retention_days
) -> int:
"""Apply retention policy and return number of records deleted."""
```
### 3. Analytics Engine (`core/analytics/engine/`)
#### A. Performance Analyzer
```python
@dataclass
class PerformanceStats:
"""Performance statistics."""
workflow_id: str
time_period: str
execution_count: int
avg_duration_ms: float
median_duration_ms: float
p95_duration_ms: float
p99_duration_ms: float
min_duration_ms: float
max_duration_ms: float
std_dev_ms: float
slowest_steps: List[Dict]
class PerformanceAnalyzer:
"""Analyzes workflow performance."""
def __init__(self, time_series_store: TimeSeriesStore):
self.store = time_series_store
def analyze_workflow(
self,
workflow_id: str,
start_time: datetime,
end_time: datetime
) -> PerformanceStats:
"""Analyze performance for a workflow."""
def identify_bottlenecks(
self,
workflow_id: str,
threshold_percentile: float = 0.95
) -> List[Dict]:
"""Identify bottleneck steps in a workflow."""
def detect_performance_degradation(
self,
workflow_id: str,
baseline_period: timedelta,
current_period: timedelta,
threshold_percent: float = 20.0
) -> Optional[Dict]:
"""Detect performance degradation compared to baseline."""
```
#### B. Anomaly Detector
```python
@dataclass
class Anomaly:
"""Detected anomaly."""
anomaly_id: str
workflow_id: str
metric_name: str
detected_at: datetime
severity: float # 0.0 to 1.0
deviation: float
baseline_value: float
actual_value: float
description: str
recommended_action: Optional[str] = None
class AnomalyDetector:
"""Detects anomalies in workflow execution."""
def __init__(
self,
time_series_store: TimeSeriesStore,
sensitivity: float = 2.0 # Standard deviations
):
self.store = time_series_store
self.sensitivity = sensitivity
self.baselines: Dict[str, Dict] = {}
def detect_anomalies(
self,
workflow_id: str,
metrics: List[Dict]
) -> List[Anomaly]:
"""Detect anomalies in metrics."""
def update_baseline(
self,
workflow_id: str,
stable_period_days: int = 7
) -> None:
"""Update baseline from stable period."""
def correlate_anomalies(
self,
anomalies: List[Anomaly],
time_window_minutes: int = 30
) -> List[List[Anomaly]]:
"""Correlate related anomalies."""
```
#### C. Insight Generator
```python
@dataclass
class Insight:
"""Generated insight."""
insight_id: str
workflow_id: str
category: str # 'performance', 'reliability', 'resource', 'best_practice'
title: str
description: str
recommendation: str
expected_impact: str
ease_of_implementation: str # 'easy', 'medium', 'hard'
priority_score: float
supporting_data: Dict[str, Any]
created_at: datetime
class InsightGenerator:
"""Generates automated insights."""
def __init__(
self,
performance_analyzer: PerformanceAnalyzer,
anomaly_detector: AnomalyDetector
):
self.performance_analyzer = performance_analyzer
self.anomaly_detector = anomaly_detector
def generate_insights(
self,
workflow_id: str,
analysis_period_days: int = 30
) -> List[Insight]:
"""Generate insights for a workflow."""
def prioritize_insights(
self,
insights: List[Insight]
) -> List[Insight]:
"""Prioritize insights by impact and ease."""
def track_insight_implementation(
self,
insight_id: str,
implemented: bool,
actual_impact: Optional[Dict] = None
) -> None:
"""Track insight implementation and measure impact."""
```
### 4. Query Engine (`core/analytics/query/`)
```python
class QueryEngine:
"""Query engine for analytics data."""
def __init__(
self,
time_series_store: TimeSeriesStore,
archive_storage: ArchiveStorage,
cache_size: int = 100
):
self.ts_store = time_series_store
self.archive = archive_storage
self.cache = LRUCache(cache_size)
def query(
self,
query: Dict[str, Any],
use_cache: bool = True
) -> List[Dict]:
"""Execute a query against analytics data."""
def aggregate(
self,
metric: str,
aggregation: str,
group_by: List[str],
filters: Dict[str, Any],
time_range: Tuple[datetime, datetime]
) -> List[Dict]:
"""Aggregate metrics with grouping."""
def compare(
self,
workflow_ids: List[str],
metrics: List[str],
time_range: Tuple[datetime, datetime]
) -> Dict[str, Dict]:
"""Compare metrics across workflows."""
```
### 5. Real-time Analytics (`core/analytics/realtime/`)
```python
class RealtimeAnalytics:
"""Real-time analytics for active workflows."""
def __init__(self, metrics_collector: MetricsCollector):
self.collector = metrics_collector
self.active_executions: Dict[str, ExecutionMetrics] = {}
self.subscribers: Dict[str, List[Callable]] = {}
def track_execution(self, execution_id: str, workflow_id: str) -> None:
"""Start tracking an execution in real-time."""
def update_progress(
self,
execution_id: str,
current_step: int,
total_steps: int
) -> None:
"""Update execution progress."""
def get_live_metrics(self, execution_id: str) -> Dict[str, Any]:
"""Get live metrics for an execution."""
def subscribe(
self,
execution_id: str,
callback: Callable[[Dict], None]
) -> None:
"""Subscribe to real-time updates."""
```
## Data Models
### Metrics Schema
```sql
-- Execution metrics table
CREATE TABLE execution_metrics (
execution_id TEXT PRIMARY KEY,
workflow_id TEXT NOT NULL,
started_at TIMESTAMP NOT NULL,
completed_at TIMESTAMP,
duration_ms REAL,
status TEXT NOT NULL,
steps_total INTEGER,
steps_completed INTEGER,
steps_failed INTEGER,
error_message TEXT,
context JSON
);
CREATE INDEX idx_workflow_time ON execution_metrics(workflow_id, started_at);
CREATE INDEX idx_status ON execution_metrics(status);
-- Step metrics table
CREATE TABLE step_metrics (
step_id TEXT PRIMARY KEY,
execution_id TEXT NOT NULL,
workflow_id TEXT NOT NULL,
node_id TEXT NOT NULL,
action_type TEXT NOT NULL,
target_element TEXT,
started_at TIMESTAMP NOT NULL,
completed_at TIMESTAMP NOT NULL,
duration_ms REAL NOT NULL,
status TEXT NOT NULL,
confidence_score REAL,
retry_count INTEGER DEFAULT 0,
error_details TEXT,
FOREIGN KEY (execution_id) REFERENCES execution_metrics(execution_id)
);
CREATE INDEX idx_execution ON step_metrics(execution_id);
CREATE INDEX idx_workflow_action ON step_metrics(workflow_id, action_type);
-- Resource metrics table
CREATE TABLE resource_metrics (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TIMESTAMP NOT NULL,
workflow_id TEXT,
execution_id TEXT,
cpu_percent REAL NOT NULL,
memory_mb REAL NOT NULL,
gpu_utilization REAL,
gpu_memory_mb REAL,
disk_io_mb REAL
);
CREATE INDEX idx_resource_time ON resource_metrics(timestamp);
CREATE INDEX idx_resource_workflow ON resource_metrics(workflow_id, timestamp);
```
## Correctness Properties
*A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
### Property 1: Metrics completeness
*For any* workflow execution, all required metrics (execution_id, workflow_id, timestamps, duration) SHALL be recorded.
**Validates: Requirements 1.1, 1.4**
### Property 2: Step metrics integrity
*For any* completed step, the step metrics SHALL include action_type, target_element, and execution_result.
**Validates: Requirements 1.2**
### Property 3: Failure recording completeness
*For any* failed execution, the failure reason, error details, and context SHALL be recorded.
**Validates: Requirements 1.3**
### Property 4: Async persistence guarantee
*For any* buffered metrics, they SHALL eventually be persisted to storage within the flush interval.
**Validates: Requirements 1.5**
### Property 5: Statistical accuracy
*For any* dataset of execution times, the calculated average, median, p95, and p99 SHALL match standard statistical definitions.
**Validates: Requirements 2.1**
### Property 6: Bottleneck identification correctness
*For any* workflow, the identified bottleneck steps SHALL be the steps with the highest execution times.
**Validates: Requirements 2.3**
### Property 7: Performance degradation detection
*For any* workflow where execution time increases above threshold, an alert SHALL be generated.
**Validates: Requirements 2.4**
### Property 8: Success rate calculation accuracy
*For any* set of executions, the success rate SHALL equal (successful_count / total_count) * 100.
**Validates: Requirements 3.1**
### Property 9: Failure categorization completeness
*For any* set of failures, all failures SHALL be assigned to a category.
**Validates: Requirements 3.2**
### Property 10: Anomaly detection sensitivity
*For any* metric value that deviates from baseline by more than sensitivity threshold, an anomaly SHALL be detected.
**Validates: Requirements 4.1**
### Property 11: Severity score validity
*For any* detected anomaly, the severity score SHALL be between 0.0 and 1.0.
**Validates: Requirements 4.2**
### Property 12: Resource tracking completeness
*For any* workflow execution, CPU, memory, and GPU metrics SHALL be tracked.
**Validates: Requirements 5.1**
### Property 13: Insight generation consistency
*For any* workflow with performance issues, at least one actionable insight SHALL be generated.
**Validates: Requirements 6.1**
### Property 14: Insight prioritization correctness
*For any* set of insights, they SHALL be ordered by priority_score in descending order.
**Validates: Requirements 6.4**
### Property 15: Filter application correctness
*For any* query with filters, only records matching all filter criteria SHALL be returned.
**Validates: Requirements 7.1**
### Property 16: Export format validity
*For any* report export, the output SHALL be valid according to the target format specification (PDF, CSV, JSON).
**Validates: Requirements 7.3**
### Property 17: Comparison calculation accuracy
*For any* two workflows being compared, the difference calculations SHALL be mathematically correct.
**Validates: Requirements 8.1**
### Property 18: Real-time latency guarantee
*For any* real-time metric request, the response SHALL be delivered within 1 second.
**Validates: Requirements 9.1**
### Property 19: Retention policy enforcement
*For any* data older than its retention period, it SHALL be archived or deleted according to policy.
**Validates: Requirements 10.2**
### Property 20: Archive data integrity
*For any* archived data, it SHALL be retrievable and match the original data when decompressed.
**Validates: Requirements 10.3**
## Integration Points
### With Execution Loop
- Hook into execution start/complete events
- Collect step-level metrics during execution
- Minimal performance impact (<1% overhead)
### With Self-Healing System
- Integrate recovery metrics
- Track recovery success rates
- Correlate failures with recovery attempts
### With Dashboard
- Provide REST API for metrics
- WebSocket for real-time updates
- Export endpoints for reports
## Performance Considerations
### Optimization Strategies
1. **Async Collection**: Buffer metrics and persist asynchronously
2. **Query Caching**: Cache frequently accessed aggregations
3. **Index Optimization**: Strategic indexes on time-series data
4. **Data Partitioning**: Partition by time for efficient queries
5. **Archive Strategy**: Move old data to compressed archive
### Scalability Targets
- Handle 1000+ workflow executions per hour
- Support 10,000+ steps per hour
- Real-time queries < 1 second
- Historical queries < 5 seconds
- Storage growth < 1GB per month
## Testing Strategy
### Property-Based Testing
Use Hypothesis to test correctness properties:
- Generate random execution data
- Verify statistical calculations
- Test anomaly detection with synthetic data
- Validate query filters and aggregations
### Integration Testing
- End-to-end metric collection and analysis
- Real-time analytics with simulated workflows
- Archive and retention policy testing
- Dashboard integration testing
### Performance Testing
- Load testing with high metric volume
- Query performance benchmarking
- Real-time latency testing
- Storage growth monitoring

View File

@@ -0,0 +1,140 @@
# Requirements Document: RPA Analytics & Insights
## Introduction
Ce document spécifie les exigences pour un système d'analyse et de reporting avancé pour RPA Vision V3. Le système collectera, analysera et visualisera les données d'exécution des workflows pour fournir des insights actionnables, détecter les anomalies, et recommander des optimisations.
## Glossary
- **Analytics Engine**: Moteur d'analyse qui traite les données d'exécution
- **Metric**: Mesure quantitative d'un aspect du système (ex: taux de succès, temps d'exécution)
- **KPI (Key Performance Indicator)**: Indicateur clé de performance
- **Anomaly**: Comportement inhabituel détecté dans les données
- **Insight**: Observation ou recommandation générée automatiquement
- **Time Series**: Série temporelle de données pour analyse de tendances
- **Aggregation**: Regroupement de données selon une dimension (workflow, période, etc.)
- **Dashboard**: Tableau de bord visuel présentant les métriques
- **Report**: Rapport généré automatiquement ou à la demande
- **Baseline**: Référence de performance normale pour comparaison
## Requirements
### Requirement 1: Execution Metrics Collection
**User Story:** As a system administrator, I want comprehensive execution metrics to be collected automatically, so that I can analyze workflow performance and identify issues.
#### Acceptance Criteria
1. WHEN a workflow executes, THE Analytics System SHALL record execution start time, end time, and duration
2. WHEN a workflow step completes, THE Analytics System SHALL record step-level metrics including action type, target element, and execution result
3. WHEN an execution fails, THE Analytics System SHALL record failure reason, error details, and context information
4. WHEN metrics are collected, THE Analytics System SHALL store them with workflow ID, execution ID, and timestamp for later analysis
5. WHEN system resources are constrained, THE Analytics System SHALL buffer metrics and persist them asynchronously to avoid impacting workflow execution
### Requirement 2: Performance Analytics
**User Story:** As a workflow designer, I want to see performance analytics for my workflows, so that I can identify bottlenecks and optimize execution time.
#### Acceptance Criteria
1. WHEN analyzing workflow performance, THE Analytics System SHALL calculate average, median, p95, and p99 execution times
2. WHEN comparing time periods, THE Analytics System SHALL show performance trends over time with visual indicators
3. WHEN identifying bottlenecks, THE Analytics System SHALL highlight the slowest steps in each workflow
4. WHEN performance degrades, THE Analytics System SHALL detect and alert on execution time increases above threshold
5. WHEN analyzing step performance, THE Analytics System SHALL provide breakdown by action type and target element type
### Requirement 3: Success Rate Analytics
**User Story:** As a system administrator, I want to track success rates for workflows and steps, so that I can identify reliability issues and prioritize improvements.
#### Acceptance Criteria
1. WHEN calculating success rates, THE Analytics System SHALL compute percentage of successful executions per workflow
2. WHEN analyzing failures, THE Analytics System SHALL categorize failures by type and frequency
3. WHEN comparing workflows, THE Analytics System SHALL rank workflows by reliability score
4. WHEN success rate drops, THE Analytics System SHALL generate alerts with failure analysis
5. WHEN viewing trends, THE Analytics System SHALL show success rate evolution over configurable time windows
### Requirement 4: Anomaly Detection
**User Story:** As a system administrator, I want automatic anomaly detection, so that I can be alerted to unusual behavior before it becomes a major issue.
#### Acceptance Criteria
1. WHEN execution patterns deviate from baseline, THE Analytics System SHALL detect and flag anomalies
2. WHEN anomalies are detected, THE Analytics System SHALL calculate severity score based on deviation magnitude
3. WHEN multiple anomalies occur, THE Analytics System SHALL correlate them to identify systemic issues
4. WHEN anomalies persist, THE Analytics System SHALL escalate alerts based on duration and impact
5. WHEN baselines are outdated, THE Analytics System SHALL automatically update them based on recent stable periods
### Requirement 5: Resource Usage Analytics
**User Story:** As a system administrator, I want to monitor resource usage, so that I can optimize system capacity and prevent resource exhaustion.
#### Acceptance Criteria
1. WHEN workflows execute, THE Analytics System SHALL track CPU usage, memory consumption, and GPU utilization
2. WHEN analyzing resource patterns, THE Analytics System SHALL identify peak usage periods and resource-intensive workflows
3. WHEN resources approach limits, THE Analytics System SHALL generate capacity planning recommendations
4. WHEN comparing workflows, THE Analytics System SHALL show resource efficiency metrics per workflow
5. WHEN resource usage is abnormal, THE Analytics System SHALL detect and alert on resource leaks or inefficiencies
### Requirement 6: Automated Insights Generation
**User Story:** As a workflow designer, I want automated insights and recommendations, so that I can improve my workflows without deep analysis expertise.
#### Acceptance Criteria
1. WHEN analyzing workflow data, THE Analytics System SHALL generate actionable insights automatically
2. WHEN patterns are identified, THE Analytics System SHALL recommend specific optimizations with expected impact
3. WHEN comparing similar workflows, THE Analytics System SHALL suggest best practices from high-performing workflows
4. WHEN insights are generated, THE Analytics System SHALL prioritize them by potential impact and ease of implementation
5. WHEN insights are acted upon, THE Analytics System SHALL track implementation and measure actual impact
### Requirement 7: Custom Reports and Dashboards
**User Story:** As a business analyst, I want to create custom reports and dashboards, so that I can track metrics relevant to my specific needs.
#### Acceptance Criteria
1. WHEN creating reports, THE Analytics System SHALL support filtering by workflow, time period, execution status, and custom tags
2. WHEN configuring dashboards, THE Analytics System SHALL allow selection of metrics, visualizations, and layout
3. WHEN generating reports, THE Analytics System SHALL support export to PDF, CSV, and JSON formats
4. WHEN scheduling reports, THE Analytics System SHALL support automated generation and delivery via email or webhook
5. WHEN sharing dashboards, THE Analytics System SHALL support role-based access control and public sharing links
### Requirement 8: Comparative Analysis
**User Story:** As a system administrator, I want to compare workflows and time periods, so that I can understand what changed and why performance differs.
#### Acceptance Criteria
1. WHEN comparing workflows, THE Analytics System SHALL show side-by-side metrics with difference calculations
2. WHEN comparing time periods, THE Analytics System SHALL highlight significant changes with statistical significance
3. WHEN analyzing changes, THE Analytics System SHALL correlate performance changes with system events or deployments
4. WHEN comparing versions, THE Analytics System SHALL track workflow version changes and their performance impact
5. WHEN identifying regressions, THE Analytics System SHALL automatically detect performance degradations after changes
### Requirement 9: Real-time Analytics
**User Story:** As a system operator, I want real-time analytics during workflow execution, so that I can monitor active workflows and intervene if needed.
#### Acceptance Criteria
1. WHEN workflows are executing, THE Analytics System SHALL provide real-time metrics with sub-second latency
2. WHEN monitoring active workflows, THE Analytics System SHALL show current step, progress percentage, and estimated completion time
3. WHEN issues occur, THE Analytics System SHALL provide real-time alerts with context for immediate action
4. WHEN viewing live dashboards, THE Analytics System SHALL auto-refresh metrics without manual intervention
5. WHEN system load is high, THE Analytics System SHALL prioritize real-time metrics for active workflows over historical analysis
### Requirement 10: Data Retention and Archival
**User Story:** As a compliance officer, I want configurable data retention policies, so that I can meet regulatory requirements while managing storage costs.
#### Acceptance Criteria
1. WHEN configuring retention, THE Analytics System SHALL support different retention periods for different metric types
2. WHEN data ages out, THE Analytics System SHALL automatically archive or delete data according to policy
3. WHEN archiving data, THE Analytics System SHALL compress and store data in cost-effective storage
4. WHEN accessing archived data, THE Analytics System SHALL support querying with acceptable performance
5. WHEN retention policies change, THE Analytics System SHALL apply new policies to existing data without data loss

View File

@@ -0,0 +1,293 @@
# Implementation Plan: RPA Analytics & Insights
- [x] 1. Set up analytics module structure
- Create `core/analytics/` directory with subdirectories
- Define base interfaces and data models
- _Requirements: All_
- [ ] 2. Implement data models and storage schema
- [x] 2.1 Create ExecutionMetrics and StepMetrics dataclasses
- Define all required fields with proper types
- Add serialization methods
- _Requirements: 1.1, 1.2_
- [x] 2.2 Create ResourceMetrics dataclass
- Define resource tracking fields
- Add timestamp and context fields
- _Requirements: 5.1_
- [x] 2.3 Create database schema for time-series storage
- Define tables for execution, step, and resource metrics
- Add indexes for efficient queries
- _Requirements: 1.4_
- [ ]* 2.4 Write property test for metrics completeness
- **Property 1: Metrics completeness**
- **Validates: Requirements 1.1, 1.4**
- [ ] 3. Implement metrics collection system
- [ ] 3.1 Create MetricsCollector class
- Implement buffering mechanism
- Add async flush to storage
- Thread-safe operations
- _Requirements: 1.1, 1.2, 1.5_
- [ ] 3.2 Create ResourceCollector class
- Implement periodic sampling
- Track CPU, memory, GPU metrics
- _Requirements: 5.1_
- [ ] 3.3 Implement execution lifecycle tracking
- Hook into execution start/complete events
- Record step-level metrics
- Handle failures gracefully
- _Requirements: 1.1, 1.2, 1.3_
- [ ]* 3.4 Write property test for async persistence
- **Property 4: Async persistence guarantee**
- **Validates: Requirements 1.5**
- [ ]* 3.5 Write property test for failure recording
- **Property 3: Failure recording completeness**
- **Validates: Requirements 1.3**
- [ ] 4. Implement time-series storage
- [ ] 4.1 Create TimeSeriesStore class
- Implement SQLite-based storage
- Add write_metrics method with batching
- _Requirements: 1.4_
- [ ] 4.2 Implement query_range method
- Support time-based queries
- Add filtering by workflow_id
- _Requirements: 7.1_
- [ ] 4.3 Implement aggregate method
- Support avg, sum, count, min, max
- Add group_by functionality
- _Requirements: 2.1, 2.5_
- [ ]* 4.4 Write property test for filter correctness
- **Property 15: Filter application correctness**
- **Validates: Requirements 7.1**
- [ ] 5. Implement performance analyzer
- [x] 5.1 Create PerformanceAnalyzer class
- Calculate statistical metrics (avg, median, p95, p99)
- Generate PerformanceStats objects
- _Requirements: 2.1_
- [x] 5.2 Implement bottleneck identification
- Identify slowest steps per workflow
- Calculate percentile thresholds
- _Requirements: 2.3_
- [x] 5.3 Implement performance degradation detection
- Compare current vs baseline periods
- Calculate percentage changes
- Generate alerts when threshold exceeded
- _Requirements: 2.4_
- [ ]* 5.4 Write property test for statistical accuracy
- **Property 5: Statistical accuracy**
- **Validates: Requirements 2.1**
- [ ]* 5.5 Write property test for bottleneck identification
- **Property 6: Bottleneck identification correctness**
- **Validates: Requirements 2.3**
- [ ] 6. Implement anomaly detection
- [x] 6.1 Create AnomalyDetector class
- Implement baseline calculation
- Detect deviations using statistical methods
- _Requirements: 4.1_
- [x] 6.2 Implement severity scoring
- Calculate severity based on deviation magnitude
- Normalize scores to 0.0-1.0 range
- _Requirements: 4.2_
- [x] 6.3 Implement anomaly correlation
- Group related anomalies by time window
- Identify systemic issues
- _Requirements: 4.3_
- [x] 6.4 Implement baseline auto-update
- Detect stable periods
- Update baselines automatically
- _Requirements: 4.5_
- [ ]* 6.5 Write property test for anomaly detection
- **Property 10: Anomaly detection sensitivity**
- **Validates: Requirements 4.1**
- [ ]* 6.6 Write property test for severity scores
- **Property 11: Severity score validity**
- **Validates: Requirements 4.2**
- [ ] 7. Implement insight generator
- [x] 7.1 Create InsightGenerator class
- Analyze performance data
- Generate actionable insights
- _Requirements: 6.1_
- [x] 7.2 Implement insight prioritization
- Score insights by impact and ease
- Sort by priority_score
- _Requirements: 6.4_
- [x] 7.3 Implement best practice suggestions
- Compare similar workflows
- Extract patterns from high performers
- _Requirements: 6.3_
- [x] 7.4 Implement impact tracking
- Track insight implementations
- Measure actual vs expected impact
- _Requirements: 6.5_
- [ ]* 7.5 Write property test for insight generation
- **Property 13: Insight generation consistency**
- **Validates: Requirements 6.1**
- [ ]* 7.6 Write property test for prioritization
- **Property 14: Insight prioritization correctness**
- **Validates: Requirements 6.4**
- [ ] 8. Implement query engine
- [x] 8.1 Create QueryEngine class
- Implement query method with caching
- Support complex filters
- _Requirements: 7.1_
- [x] 8.2 Implement aggregation queries
- Support multiple aggregation functions
- Add group_by with multiple dimensions
- _Requirements: 2.1, 2.5_
- [x] 8.3 Implement comparison queries
- Compare workflows side-by-side
- Calculate differences and changes
- _Requirements: 8.1, 8.2_
- [x] 8.4 Add query caching
- Implement LRU cache
- Cache invalidation on new data
- _Requirements: Performance_
- [ ]* 8.5 Write property test for comparison accuracy
- **Property 17: Comparison calculation accuracy**
- **Validates: Requirements 8.1**
- [x] 9. Implement real-time analytics
- [x] 9.1 Create RealtimeAnalytics class
- Track active executions
- Calculate live progress
- _Requirements: 9.1, 9.2_
- [x] 9.2 Implement subscription system
- WebSocket-based updates
- Pub/sub for real-time events
- _Requirements: 9.4_
- [x] 9.3 Implement real-time alerting
- Detect issues during execution
- Send immediate notifications
- _Requirements: 9.3_
- [x] 9.4 Optimize for low latency
- In-memory tracking for active workflows
- Prioritize real-time over historical
- _Requirements: 9.1, 9.5_
- [ ]* 9.5 Write property test for real-time latency
- **Property 18: Real-time latency guarantee**
- **Validates: Requirements 9.1**
- [x] 10. Implement success rate analytics
- [x] 10.1 Create success rate calculator
- Calculate per-workflow success rates
- Support time-windowed calculations
- _Requirements: 3.1, 3.5_
- [x] 10.2 Implement failure categorization
- Categorize failures by type
- Calculate frequency per category
- _Requirements: 3.2_
- [x] 10.3 Implement reliability ranking
- Rank workflows by reliability score
- Consider success rate and stability
- _Requirements: 3.3_
- [ ]* 10.4 Write property test for success rate accuracy
- **Property 8: Success rate calculation accuracy**
- **Validates: Requirements 3.1**
- [x] 11. Implement archive and retention
- [x] 11.1 Create ArchiveStorage class
- Implement compression for old data
- Support efficient archive queries
- _Requirements: 10.2, 10.3_
- [x] 11.2 Implement retention policy engine
- Support different policies per metric type
- Automatic archival/deletion
- _Requirements: 10.1, 10.2_
- [x] 11.3 Implement policy application
- Apply policies to existing data
- Ensure no data loss on policy changes
- _Requirements: 10.5_
- [ ]* 11.4 Write property test for retention enforcement
- **Property 19: Retention policy enforcement**
- **Validates: Requirements 10.2**
- [ ]* 11.5 Write property test for archive integrity
- **Property 20: Archive data integrity**
- **Validates: Requirements 10.3**
- [x] 12. Implement report generator
- [x] 12.1 Create ReportGenerator class
- Support multiple output formats
- Template-based report generation
- _Requirements: 7.3_
- [x] 12.2 Implement PDF export
- Generate formatted PDF reports
- Include charts and tables
- _Requirements: 7.3_
- [x] 12.3 Implement CSV/JSON export
- Export raw data in structured formats
- Support large datasets
- _Requirements: 7.3_
- [x] 12.4 Implement scheduled reports
- Cron-based scheduling
- Email/webhook delivery
- _Requirements: 7.4_
- [ ]* 12.5 Write property test for export validity
- **Property 16: Export format validity**
- **Validates: Requirements 7.3**
- [x] 13. Implement dashboard manager
- [x] 13.1 Create DashboardManager class
- Store dashboard configurations
- Support custom layouts
- _Requirements: 7.2_
- [x] 13.2 Implement access control
- Role-based permissions
- Public sharing links
- _Requirements: 7.5_
- [x] 13.3 Implement dashboard templates
- Pre-built dashboard templates
- Customizable widgets
- _Requirements: 7.2_
- [x] 14. Implement analytics API
- [x] 14.1 Create REST API endpoints
- GET /analytics/metrics
- GET /analytics/performance
- GET /analytics/anomalies
- GET /analytics/insights
- _Requirements: All_
- [ ] 14.2 Implement WebSocket endpoints
- Real-time metric streaming
- Live execution monitoring
- _Requirements: 9.1, 9.4_
- [ ] 14.3 Add API documentation
- OpenAPI/Swagger specs
- Example requests/responses
- _Requirements: Documentation_
- [x] 15. Integration with execution loop
- [x] 15.1 Add metrics collection hooks
- Hook into execution start/complete
- Collect step metrics automatically
- _Requirements: 1.1, 1.2_
- [x] 15.2 Integrate with self-healing system
- Track recovery metrics
- Correlate failures with recoveries
- _Requirements: Integration_
- [x] 15.3 Add resource monitoring
- Track resources during execution
- Associate with workflow executions
- _Requirements: 5.1_
- [ ] 16. Create web dashboard integration
- [ ] 16.1 Add analytics views to dashboard
- Performance overview page
- Anomaly detection page
- Insights page
- _Requirements: 7.2_
- [ ] 16.2 Implement real-time charts
- Live execution monitoring
- Auto-refreshing metrics
- _Requirements: 9.4_
- [ ] 16.3 Add export and sharing features
- Export buttons for reports
- Share dashboard links
- _Requirements: 7.3, 7.5_
- [ ] 17. Final checkpoint - Ensure all tests pass
- Ensure all tests pass, ask the user if questions arise.