Files
rpa_vision_v3/.kiro/specs/rpa-analytics/design.md
Dom a7de6a488b feat: replay E2E fonctionnel — 25/25 actions, 0 retries, SomEngine via serveur
Validé sur PC Windows (DESKTOP-58D5CAC, 2560x1600) :
- 8 clics résolus visuellement (1 anchor_template, 1 som_text_match, 6 som_vlm)
- Score moyen 0.75, temps moyen 1.6s
- Texte tapé correctement (bonjour, test word, date, email)
- 0 retries, 2 actions non vérifiées (OK)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 14:04:41 +02:00

669 lines
19 KiB
Markdown

# Design Document: RPA Analytics & Insights
## Overview
This design document describes the architecture and implementation of a comprehensive analytics and insights system for RPA Vision V3. The system collects execution metrics, performs real-time and historical analysis, detects anomalies, generates automated insights, and provides customizable dashboards and reports.
The analytics system is designed to be:
- **Non-intrusive**: Minimal impact on workflow execution performance
- **Scalable**: Handle high-volume metric collection and analysis
- **Real-time**: Provide sub-second latency for live monitoring
- **Intelligent**: Automatic anomaly detection and insight generation
- **Flexible**: Customizable dashboards, reports, and alerts
## Architecture
```mermaid
graph TB
subgraph "Data Collection"
EC[Execution Collector]
MC[Metrics Collector]
RC[Resource Collector]
Buffer[Async Buffer]
end
subgraph "Storage Layer"
TS[Time Series DB]
MS[Metrics Store]
AS[Archive Storage]
end
subgraph "Analytics Engine"
PA[Performance Analyzer]
AA[Anomaly Detector]
IA[Insight Generator]
CA[Comparative Analyzer]
end
subgraph "Query & Aggregation"
QE[Query Engine]
AG[Aggregator]
Cache[Query Cache]
end
subgraph "Presentation"
API[Analytics API]
RT[Real-time Stream]
RG[Report Generator]
DM[Dashboard Manager]
end
EC --> Buffer
MC --> Buffer
RC --> Buffer
Buffer --> TS
Buffer --> MS
TS --> QE
MS --> QE
QE --> AG
AG --> Cache
QE --> PA
QE --> AA
QE --> IA
QE --> CA
PA --> API
AA --> API
IA --> API
CA --> API
API --> RT
API --> RG
API --> DM
MS --> AS
```
## Components and Interfaces
### 1. Metrics Collection (`core/analytics/collection/`)
#### A. Execution Collector
```python
@dataclass
class ExecutionMetrics:
"""Metrics for a workflow execution."""
execution_id: str
workflow_id: str
started_at: datetime
completed_at: Optional[datetime]
duration_ms: Optional[float]
status: str # 'running', 'completed', 'failed'
steps_total: int
steps_completed: int
steps_failed: int
error_message: Optional[str] = None
context: Dict[str, Any] = field(default_factory=dict)
@dataclass
class StepMetrics:
"""Metrics for a workflow step."""
step_id: str
execution_id: str
workflow_id: str
node_id: str
action_type: str
target_element: str
started_at: datetime
completed_at: datetime
duration_ms: float
status: str
confidence_score: float
retry_count: int = 0
error_details: Optional[str] = None
class MetricsCollector:
"""Collects metrics from workflow executions."""
def __init__(self, buffer_size: int = 1000, flush_interval_sec: float = 5.0):
self.buffer_size = buffer_size
self.flush_interval = flush_interval_sec
self._buffer: List[Union[ExecutionMetrics, StepMetrics]] = []
self._lock = threading.Lock()
self._flush_thread: Optional[threading.Thread] = None
def record_execution_start(self, execution_id: str, workflow_id: str) -> None:
"""Record the start of a workflow execution."""
def record_execution_complete(
self,
execution_id: str,
status: str,
error_message: Optional[str] = None
) -> None:
"""Record the completion of a workflow execution."""
def record_step(self, step_metrics: StepMetrics) -> None:
"""Record metrics for a completed step."""
def flush(self) -> None:
"""Flush buffered metrics to storage."""
```
#### B. Resource Collector
```python
@dataclass
class ResourceMetrics:
"""System resource usage metrics."""
timestamp: datetime
workflow_id: Optional[str]
execution_id: Optional[str]
cpu_percent: float
memory_mb: float
gpu_utilization: float
gpu_memory_mb: float
disk_io_mb: float
class ResourceCollector:
"""Collects system resource usage metrics."""
def __init__(self, sample_interval_sec: float = 1.0):
self.sample_interval = sample_interval_sec
self._running = False
self._thread: Optional[threading.Thread] = None
def start(self) -> None:
"""Start collecting resource metrics."""
def stop(self) -> None:
"""Stop collecting resource metrics."""
def get_current_metrics(self) -> ResourceMetrics:
"""Get current resource usage."""
```
### 2. Storage Layer (`core/analytics/storage/`)
#### A. Time Series Store
```python
class TimeSeriesStore:
"""Store for time-series metrics data."""
def __init__(self, storage_path: Path):
self.storage_path = storage_path
# Use SQLite with time-series optimizations
self.db_path = storage_path / 'timeseries.db'
def write_metrics(self, metrics: List[Union[ExecutionMetrics, StepMetrics]]) -> None:
"""Write metrics to time-series storage."""
def query_range(
self,
start_time: datetime,
end_time: datetime,
workflow_id: Optional[str] = None,
metric_types: Optional[List[str]] = None
) -> List[Dict]:
"""Query metrics within a time range."""
def aggregate(
self,
metric: str,
aggregation: str, # 'avg', 'sum', 'count', 'min', 'max'
group_by: List[str],
start_time: datetime,
end_time: datetime,
filters: Optional[Dict] = None
) -> List[Dict]:
"""Aggregate metrics with grouping."""
```
#### B. Archive Storage
```python
class ArchiveStorage:
"""Archive storage for old metrics."""
def __init__(self, storage_path: Path):
self.storage_path = storage_path
self.archive_path = storage_path / 'archive'
def archive_data(
self,
data: List[Dict],
archive_date: datetime
) -> str:
"""Archive data with compression."""
def query_archive(
self,
start_date: datetime,
end_date: datetime,
filters: Optional[Dict] = None
) -> List[Dict]:
"""Query archived data."""
def apply_retention_policy(
self,
policy: Dict[str, int] # metric_type -> retention_days
) -> int:
"""Apply retention policy and return number of records deleted."""
```
### 3. Analytics Engine (`core/analytics/engine/`)
#### A. Performance Analyzer
```python
@dataclass
class PerformanceStats:
"""Performance statistics."""
workflow_id: str
time_period: str
execution_count: int
avg_duration_ms: float
median_duration_ms: float
p95_duration_ms: float
p99_duration_ms: float
min_duration_ms: float
max_duration_ms: float
std_dev_ms: float
slowest_steps: List[Dict]
class PerformanceAnalyzer:
"""Analyzes workflow performance."""
def __init__(self, time_series_store: TimeSeriesStore):
self.store = time_series_store
def analyze_workflow(
self,
workflow_id: str,
start_time: datetime,
end_time: datetime
) -> PerformanceStats:
"""Analyze performance for a workflow."""
def identify_bottlenecks(
self,
workflow_id: str,
threshold_percentile: float = 0.95
) -> List[Dict]:
"""Identify bottleneck steps in a workflow."""
def detect_performance_degradation(
self,
workflow_id: str,
baseline_period: timedelta,
current_period: timedelta,
threshold_percent: float = 20.0
) -> Optional[Dict]:
"""Detect performance degradation compared to baseline."""
```
#### B. Anomaly Detector
```python
@dataclass
class Anomaly:
"""Detected anomaly."""
anomaly_id: str
workflow_id: str
metric_name: str
detected_at: datetime
severity: float # 0.0 to 1.0
deviation: float
baseline_value: float
actual_value: float
description: str
recommended_action: Optional[str] = None
class AnomalyDetector:
"""Detects anomalies in workflow execution."""
def __init__(
self,
time_series_store: TimeSeriesStore,
sensitivity: float = 2.0 # Standard deviations
):
self.store = time_series_store
self.sensitivity = sensitivity
self.baselines: Dict[str, Dict] = {}
def detect_anomalies(
self,
workflow_id: str,
metrics: List[Dict]
) -> List[Anomaly]:
"""Detect anomalies in metrics."""
def update_baseline(
self,
workflow_id: str,
stable_period_days: int = 7
) -> None:
"""Update baseline from stable period."""
def correlate_anomalies(
self,
anomalies: List[Anomaly],
time_window_minutes: int = 30
) -> List[List[Anomaly]]:
"""Correlate related anomalies."""
```
#### C. Insight Generator
```python
@dataclass
class Insight:
"""Generated insight."""
insight_id: str
workflow_id: str
category: str # 'performance', 'reliability', 'resource', 'best_practice'
title: str
description: str
recommendation: str
expected_impact: str
ease_of_implementation: str # 'easy', 'medium', 'hard'
priority_score: float
supporting_data: Dict[str, Any]
created_at: datetime
class InsightGenerator:
"""Generates automated insights."""
def __init__(
self,
performance_analyzer: PerformanceAnalyzer,
anomaly_detector: AnomalyDetector
):
self.performance_analyzer = performance_analyzer
self.anomaly_detector = anomaly_detector
def generate_insights(
self,
workflow_id: str,
analysis_period_days: int = 30
) -> List[Insight]:
"""Generate insights for a workflow."""
def prioritize_insights(
self,
insights: List[Insight]
) -> List[Insight]:
"""Prioritize insights by impact and ease."""
def track_insight_implementation(
self,
insight_id: str,
implemented: bool,
actual_impact: Optional[Dict] = None
) -> None:
"""Track insight implementation and measure impact."""
```
### 4. Query Engine (`core/analytics/query/`)
```python
class QueryEngine:
"""Query engine for analytics data."""
def __init__(
self,
time_series_store: TimeSeriesStore,
archive_storage: ArchiveStorage,
cache_size: int = 100
):
self.ts_store = time_series_store
self.archive = archive_storage
self.cache = LRUCache(cache_size)
def query(
self,
query: Dict[str, Any],
use_cache: bool = True
) -> List[Dict]:
"""Execute a query against analytics data."""
def aggregate(
self,
metric: str,
aggregation: str,
group_by: List[str],
filters: Dict[str, Any],
time_range: Tuple[datetime, datetime]
) -> List[Dict]:
"""Aggregate metrics with grouping."""
def compare(
self,
workflow_ids: List[str],
metrics: List[str],
time_range: Tuple[datetime, datetime]
) -> Dict[str, Dict]:
"""Compare metrics across workflows."""
```
### 5. Real-time Analytics (`core/analytics/realtime/`)
```python
class RealtimeAnalytics:
"""Real-time analytics for active workflows."""
def __init__(self, metrics_collector: MetricsCollector):
self.collector = metrics_collector
self.active_executions: Dict[str, ExecutionMetrics] = {}
self.subscribers: Dict[str, List[Callable]] = {}
def track_execution(self, execution_id: str, workflow_id: str) -> None:
"""Start tracking an execution in real-time."""
def update_progress(
self,
execution_id: str,
current_step: int,
total_steps: int
) -> None:
"""Update execution progress."""
def get_live_metrics(self, execution_id: str) -> Dict[str, Any]:
"""Get live metrics for an execution."""
def subscribe(
self,
execution_id: str,
callback: Callable[[Dict], None]
) -> None:
"""Subscribe to real-time updates."""
```
## Data Models
### Metrics Schema
```sql
-- Execution metrics table
CREATE TABLE execution_metrics (
execution_id TEXT PRIMARY KEY,
workflow_id TEXT NOT NULL,
started_at TIMESTAMP NOT NULL,
completed_at TIMESTAMP,
duration_ms REAL,
status TEXT NOT NULL,
steps_total INTEGER,
steps_completed INTEGER,
steps_failed INTEGER,
error_message TEXT,
context JSON
);
CREATE INDEX idx_workflow_time ON execution_metrics(workflow_id, started_at);
CREATE INDEX idx_status ON execution_metrics(status);
-- Step metrics table
CREATE TABLE step_metrics (
step_id TEXT PRIMARY KEY,
execution_id TEXT NOT NULL,
workflow_id TEXT NOT NULL,
node_id TEXT NOT NULL,
action_type TEXT NOT NULL,
target_element TEXT,
started_at TIMESTAMP NOT NULL,
completed_at TIMESTAMP NOT NULL,
duration_ms REAL NOT NULL,
status TEXT NOT NULL,
confidence_score REAL,
retry_count INTEGER DEFAULT 0,
error_details TEXT,
FOREIGN KEY (execution_id) REFERENCES execution_metrics(execution_id)
);
CREATE INDEX idx_execution ON step_metrics(execution_id);
CREATE INDEX idx_workflow_action ON step_metrics(workflow_id, action_type);
-- Resource metrics table
CREATE TABLE resource_metrics (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp TIMESTAMP NOT NULL,
workflow_id TEXT,
execution_id TEXT,
cpu_percent REAL NOT NULL,
memory_mb REAL NOT NULL,
gpu_utilization REAL,
gpu_memory_mb REAL,
disk_io_mb REAL
);
CREATE INDEX idx_resource_time ON resource_metrics(timestamp);
CREATE INDEX idx_resource_workflow ON resource_metrics(workflow_id, timestamp);
```
## Correctness Properties
*A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
### Property 1: Metrics completeness
*For any* workflow execution, all required metrics (execution_id, workflow_id, timestamps, duration) SHALL be recorded.
**Validates: Requirements 1.1, 1.4**
### Property 2: Step metrics integrity
*For any* completed step, the step metrics SHALL include action_type, target_element, and execution_result.
**Validates: Requirements 1.2**
### Property 3: Failure recording completeness
*For any* failed execution, the failure reason, error details, and context SHALL be recorded.
**Validates: Requirements 1.3**
### Property 4: Async persistence guarantee
*For any* buffered metrics, they SHALL eventually be persisted to storage within the flush interval.
**Validates: Requirements 1.5**
### Property 5: Statistical accuracy
*For any* dataset of execution times, the calculated average, median, p95, and p99 SHALL match standard statistical definitions.
**Validates: Requirements 2.1**
### Property 6: Bottleneck identification correctness
*For any* workflow, the identified bottleneck steps SHALL be the steps with the highest execution times.
**Validates: Requirements 2.3**
### Property 7: Performance degradation detection
*For any* workflow where execution time increases above threshold, an alert SHALL be generated.
**Validates: Requirements 2.4**
### Property 8: Success rate calculation accuracy
*For any* set of executions, the success rate SHALL equal (successful_count / total_count) * 100.
**Validates: Requirements 3.1**
### Property 9: Failure categorization completeness
*For any* set of failures, all failures SHALL be assigned to a category.
**Validates: Requirements 3.2**
### Property 10: Anomaly detection sensitivity
*For any* metric value that deviates from baseline by more than sensitivity threshold, an anomaly SHALL be detected.
**Validates: Requirements 4.1**
### Property 11: Severity score validity
*For any* detected anomaly, the severity score SHALL be between 0.0 and 1.0.
**Validates: Requirements 4.2**
### Property 12: Resource tracking completeness
*For any* workflow execution, CPU, memory, and GPU metrics SHALL be tracked.
**Validates: Requirements 5.1**
### Property 13: Insight generation consistency
*For any* workflow with performance issues, at least one actionable insight SHALL be generated.
**Validates: Requirements 6.1**
### Property 14: Insight prioritization correctness
*For any* set of insights, they SHALL be ordered by priority_score in descending order.
**Validates: Requirements 6.4**
### Property 15: Filter application correctness
*For any* query with filters, only records matching all filter criteria SHALL be returned.
**Validates: Requirements 7.1**
### Property 16: Export format validity
*For any* report export, the output SHALL be valid according to the target format specification (PDF, CSV, JSON).
**Validates: Requirements 7.3**
### Property 17: Comparison calculation accuracy
*For any* two workflows being compared, the difference calculations SHALL be mathematically correct.
**Validates: Requirements 8.1**
### Property 18: Real-time latency guarantee
*For any* real-time metric request, the response SHALL be delivered within 1 second.
**Validates: Requirements 9.1**
### Property 19: Retention policy enforcement
*For any* data older than its retention period, it SHALL be archived or deleted according to policy.
**Validates: Requirements 10.2**
### Property 20: Archive data integrity
*For any* archived data, it SHALL be retrievable and match the original data when decompressed.
**Validates: Requirements 10.3**
## Integration Points
### With Execution Loop
- Hook into execution start/complete events
- Collect step-level metrics during execution
- Minimal performance impact (<1% overhead)
### With Self-Healing System
- Integrate recovery metrics
- Track recovery success rates
- Correlate failures with recovery attempts
### With Dashboard
- Provide REST API for metrics
- WebSocket for real-time updates
- Export endpoints for reports
## Performance Considerations
### Optimization Strategies
1. **Async Collection**: Buffer metrics and persist asynchronously
2. **Query Caching**: Cache frequently accessed aggregations
3. **Index Optimization**: Strategic indexes on time-series data
4. **Data Partitioning**: Partition by time for efficient queries
5. **Archive Strategy**: Move old data to compressed archive
### Scalability Targets
- Handle 1000+ workflow executions per hour
- Support 10,000+ steps per hour
- Real-time queries < 1 second
- Historical queries < 5 seconds
- Storage growth < 1GB per month
## Testing Strategy
### Property-Based Testing
Use Hypothesis to test correctness properties:
- Generate random execution data
- Verify statistical calculations
- Test anomaly detection with synthetic data
- Validate query filters and aggregations
### Integration Testing
- End-to-end metric collection and analysis
- Real-time analytics with simulated workflows
- Archive and retention policy testing
- Dashboard integration testing
### Performance Testing
- Load testing with high metric volume
- Query performance benchmarking
- Real-time latency testing
- Storage growth monitoring