Validé sur PC Windows (DESKTOP-58D5CAC, 2560x1600) : - 8 clics résolus visuellement (1 anchor_template, 1 som_text_match, 6 som_vlm) - Score moyen 0.75, temps moyen 1.6s - Texte tapé correctement (bonjour, test word, date, email) - 0 retries, 2 actions non vérifiées (OK) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
669 lines
19 KiB
Markdown
669 lines
19 KiB
Markdown
# Design Document: RPA Analytics & Insights
|
|
|
|
## Overview
|
|
|
|
This design document describes the architecture and implementation of a comprehensive analytics and insights system for RPA Vision V3. The system collects execution metrics, performs real-time and historical analysis, detects anomalies, generates automated insights, and provides customizable dashboards and reports.
|
|
|
|
The analytics system is designed to be:
|
|
- **Non-intrusive**: Minimal impact on workflow execution performance
|
|
- **Scalable**: Handle high-volume metric collection and analysis
|
|
- **Real-time**: Provide sub-second latency for live monitoring
|
|
- **Intelligent**: Automatic anomaly detection and insight generation
|
|
- **Flexible**: Customizable dashboards, reports, and alerts
|
|
|
|
## Architecture
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Data Collection"
|
|
EC[Execution Collector]
|
|
MC[Metrics Collector]
|
|
RC[Resource Collector]
|
|
Buffer[Async Buffer]
|
|
end
|
|
|
|
subgraph "Storage Layer"
|
|
TS[Time Series DB]
|
|
MS[Metrics Store]
|
|
AS[Archive Storage]
|
|
end
|
|
|
|
subgraph "Analytics Engine"
|
|
PA[Performance Analyzer]
|
|
AA[Anomaly Detector]
|
|
IA[Insight Generator]
|
|
CA[Comparative Analyzer]
|
|
end
|
|
|
|
subgraph "Query & Aggregation"
|
|
QE[Query Engine]
|
|
AG[Aggregator]
|
|
Cache[Query Cache]
|
|
end
|
|
|
|
subgraph "Presentation"
|
|
API[Analytics API]
|
|
RT[Real-time Stream]
|
|
RG[Report Generator]
|
|
DM[Dashboard Manager]
|
|
end
|
|
|
|
EC --> Buffer
|
|
MC --> Buffer
|
|
RC --> Buffer
|
|
Buffer --> TS
|
|
Buffer --> MS
|
|
|
|
TS --> QE
|
|
MS --> QE
|
|
QE --> AG
|
|
AG --> Cache
|
|
|
|
QE --> PA
|
|
QE --> AA
|
|
QE --> IA
|
|
QE --> CA
|
|
|
|
PA --> API
|
|
AA --> API
|
|
IA --> API
|
|
CA --> API
|
|
|
|
API --> RT
|
|
API --> RG
|
|
API --> DM
|
|
|
|
MS --> AS
|
|
```
|
|
|
|
## Components and Interfaces
|
|
|
|
### 1. Metrics Collection (`core/analytics/collection/`)
|
|
|
|
#### A. Execution Collector
|
|
```python
|
|
@dataclass
|
|
class ExecutionMetrics:
|
|
"""Metrics for a workflow execution."""
|
|
execution_id: str
|
|
workflow_id: str
|
|
started_at: datetime
|
|
completed_at: Optional[datetime]
|
|
duration_ms: Optional[float]
|
|
status: str # 'running', 'completed', 'failed'
|
|
steps_total: int
|
|
steps_completed: int
|
|
steps_failed: int
|
|
error_message: Optional[str] = None
|
|
context: Dict[str, Any] = field(default_factory=dict)
|
|
|
|
@dataclass
|
|
class StepMetrics:
|
|
"""Metrics for a workflow step."""
|
|
step_id: str
|
|
execution_id: str
|
|
workflow_id: str
|
|
node_id: str
|
|
action_type: str
|
|
target_element: str
|
|
started_at: datetime
|
|
completed_at: datetime
|
|
duration_ms: float
|
|
status: str
|
|
confidence_score: float
|
|
retry_count: int = 0
|
|
error_details: Optional[str] = None
|
|
|
|
class MetricsCollector:
|
|
"""Collects metrics from workflow executions."""
|
|
|
|
def __init__(self, buffer_size: int = 1000, flush_interval_sec: float = 5.0):
|
|
self.buffer_size = buffer_size
|
|
self.flush_interval = flush_interval_sec
|
|
self._buffer: List[Union[ExecutionMetrics, StepMetrics]] = []
|
|
self._lock = threading.Lock()
|
|
self._flush_thread: Optional[threading.Thread] = None
|
|
|
|
def record_execution_start(self, execution_id: str, workflow_id: str) -> None:
|
|
"""Record the start of a workflow execution."""
|
|
|
|
def record_execution_complete(
|
|
self,
|
|
execution_id: str,
|
|
status: str,
|
|
error_message: Optional[str] = None
|
|
) -> None:
|
|
"""Record the completion of a workflow execution."""
|
|
|
|
def record_step(self, step_metrics: StepMetrics) -> None:
|
|
"""Record metrics for a completed step."""
|
|
|
|
def flush(self) -> None:
|
|
"""Flush buffered metrics to storage."""
|
|
```
|
|
|
|
#### B. Resource Collector
|
|
```python
|
|
@dataclass
|
|
class ResourceMetrics:
|
|
"""System resource usage metrics."""
|
|
timestamp: datetime
|
|
workflow_id: Optional[str]
|
|
execution_id: Optional[str]
|
|
cpu_percent: float
|
|
memory_mb: float
|
|
gpu_utilization: float
|
|
gpu_memory_mb: float
|
|
disk_io_mb: float
|
|
|
|
class ResourceCollector:
|
|
"""Collects system resource usage metrics."""
|
|
|
|
def __init__(self, sample_interval_sec: float = 1.0):
|
|
self.sample_interval = sample_interval_sec
|
|
self._running = False
|
|
self._thread: Optional[threading.Thread] = None
|
|
|
|
def start(self) -> None:
|
|
"""Start collecting resource metrics."""
|
|
|
|
def stop(self) -> None:
|
|
"""Stop collecting resource metrics."""
|
|
|
|
def get_current_metrics(self) -> ResourceMetrics:
|
|
"""Get current resource usage."""
|
|
```
|
|
|
|
### 2. Storage Layer (`core/analytics/storage/`)
|
|
|
|
#### A. Time Series Store
|
|
```python
|
|
class TimeSeriesStore:
|
|
"""Store for time-series metrics data."""
|
|
|
|
def __init__(self, storage_path: Path):
|
|
self.storage_path = storage_path
|
|
# Use SQLite with time-series optimizations
|
|
self.db_path = storage_path / 'timeseries.db'
|
|
|
|
def write_metrics(self, metrics: List[Union[ExecutionMetrics, StepMetrics]]) -> None:
|
|
"""Write metrics to time-series storage."""
|
|
|
|
def query_range(
|
|
self,
|
|
start_time: datetime,
|
|
end_time: datetime,
|
|
workflow_id: Optional[str] = None,
|
|
metric_types: Optional[List[str]] = None
|
|
) -> List[Dict]:
|
|
"""Query metrics within a time range."""
|
|
|
|
def aggregate(
|
|
self,
|
|
metric: str,
|
|
aggregation: str, # 'avg', 'sum', 'count', 'min', 'max'
|
|
group_by: List[str],
|
|
start_time: datetime,
|
|
end_time: datetime,
|
|
filters: Optional[Dict] = None
|
|
) -> List[Dict]:
|
|
"""Aggregate metrics with grouping."""
|
|
```
|
|
|
|
#### B. Archive Storage
|
|
```python
|
|
class ArchiveStorage:
|
|
"""Archive storage for old metrics."""
|
|
|
|
def __init__(self, storage_path: Path):
|
|
self.storage_path = storage_path
|
|
self.archive_path = storage_path / 'archive'
|
|
|
|
def archive_data(
|
|
self,
|
|
data: List[Dict],
|
|
archive_date: datetime
|
|
) -> str:
|
|
"""Archive data with compression."""
|
|
|
|
def query_archive(
|
|
self,
|
|
start_date: datetime,
|
|
end_date: datetime,
|
|
filters: Optional[Dict] = None
|
|
) -> List[Dict]:
|
|
"""Query archived data."""
|
|
|
|
def apply_retention_policy(
|
|
self,
|
|
policy: Dict[str, int] # metric_type -> retention_days
|
|
) -> int:
|
|
"""Apply retention policy and return number of records deleted."""
|
|
```
|
|
|
|
### 3. Analytics Engine (`core/analytics/engine/`)
|
|
|
|
#### A. Performance Analyzer
|
|
```python
|
|
@dataclass
|
|
class PerformanceStats:
|
|
"""Performance statistics."""
|
|
workflow_id: str
|
|
time_period: str
|
|
execution_count: int
|
|
avg_duration_ms: float
|
|
median_duration_ms: float
|
|
p95_duration_ms: float
|
|
p99_duration_ms: float
|
|
min_duration_ms: float
|
|
max_duration_ms: float
|
|
std_dev_ms: float
|
|
slowest_steps: List[Dict]
|
|
|
|
class PerformanceAnalyzer:
|
|
"""Analyzes workflow performance."""
|
|
|
|
def __init__(self, time_series_store: TimeSeriesStore):
|
|
self.store = time_series_store
|
|
|
|
def analyze_workflow(
|
|
self,
|
|
workflow_id: str,
|
|
start_time: datetime,
|
|
end_time: datetime
|
|
) -> PerformanceStats:
|
|
"""Analyze performance for a workflow."""
|
|
|
|
def identify_bottlenecks(
|
|
self,
|
|
workflow_id: str,
|
|
threshold_percentile: float = 0.95
|
|
) -> List[Dict]:
|
|
"""Identify bottleneck steps in a workflow."""
|
|
|
|
def detect_performance_degradation(
|
|
self,
|
|
workflow_id: str,
|
|
baseline_period: timedelta,
|
|
current_period: timedelta,
|
|
threshold_percent: float = 20.0
|
|
) -> Optional[Dict]:
|
|
"""Detect performance degradation compared to baseline."""
|
|
```
|
|
|
|
#### B. Anomaly Detector
|
|
```python
|
|
@dataclass
|
|
class Anomaly:
|
|
"""Detected anomaly."""
|
|
anomaly_id: str
|
|
workflow_id: str
|
|
metric_name: str
|
|
detected_at: datetime
|
|
severity: float # 0.0 to 1.0
|
|
deviation: float
|
|
baseline_value: float
|
|
actual_value: float
|
|
description: str
|
|
recommended_action: Optional[str] = None
|
|
|
|
class AnomalyDetector:
|
|
"""Detects anomalies in workflow execution."""
|
|
|
|
def __init__(
|
|
self,
|
|
time_series_store: TimeSeriesStore,
|
|
sensitivity: float = 2.0 # Standard deviations
|
|
):
|
|
self.store = time_series_store
|
|
self.sensitivity = sensitivity
|
|
self.baselines: Dict[str, Dict] = {}
|
|
|
|
def detect_anomalies(
|
|
self,
|
|
workflow_id: str,
|
|
metrics: List[Dict]
|
|
) -> List[Anomaly]:
|
|
"""Detect anomalies in metrics."""
|
|
|
|
def update_baseline(
|
|
self,
|
|
workflow_id: str,
|
|
stable_period_days: int = 7
|
|
) -> None:
|
|
"""Update baseline from stable period."""
|
|
|
|
def correlate_anomalies(
|
|
self,
|
|
anomalies: List[Anomaly],
|
|
time_window_minutes: int = 30
|
|
) -> List[List[Anomaly]]:
|
|
"""Correlate related anomalies."""
|
|
```
|
|
|
|
#### C. Insight Generator
|
|
```python
|
|
@dataclass
|
|
class Insight:
|
|
"""Generated insight."""
|
|
insight_id: str
|
|
workflow_id: str
|
|
category: str # 'performance', 'reliability', 'resource', 'best_practice'
|
|
title: str
|
|
description: str
|
|
recommendation: str
|
|
expected_impact: str
|
|
ease_of_implementation: str # 'easy', 'medium', 'hard'
|
|
priority_score: float
|
|
supporting_data: Dict[str, Any]
|
|
created_at: datetime
|
|
|
|
class InsightGenerator:
|
|
"""Generates automated insights."""
|
|
|
|
def __init__(
|
|
self,
|
|
performance_analyzer: PerformanceAnalyzer,
|
|
anomaly_detector: AnomalyDetector
|
|
):
|
|
self.performance_analyzer = performance_analyzer
|
|
self.anomaly_detector = anomaly_detector
|
|
|
|
def generate_insights(
|
|
self,
|
|
workflow_id: str,
|
|
analysis_period_days: int = 30
|
|
) -> List[Insight]:
|
|
"""Generate insights for a workflow."""
|
|
|
|
def prioritize_insights(
|
|
self,
|
|
insights: List[Insight]
|
|
) -> List[Insight]:
|
|
"""Prioritize insights by impact and ease."""
|
|
|
|
def track_insight_implementation(
|
|
self,
|
|
insight_id: str,
|
|
implemented: bool,
|
|
actual_impact: Optional[Dict] = None
|
|
) -> None:
|
|
"""Track insight implementation and measure impact."""
|
|
```
|
|
|
|
### 4. Query Engine (`core/analytics/query/`)
|
|
|
|
```python
|
|
class QueryEngine:
|
|
"""Query engine for analytics data."""
|
|
|
|
def __init__(
|
|
self,
|
|
time_series_store: TimeSeriesStore,
|
|
archive_storage: ArchiveStorage,
|
|
cache_size: int = 100
|
|
):
|
|
self.ts_store = time_series_store
|
|
self.archive = archive_storage
|
|
self.cache = LRUCache(cache_size)
|
|
|
|
def query(
|
|
self,
|
|
query: Dict[str, Any],
|
|
use_cache: bool = True
|
|
) -> List[Dict]:
|
|
"""Execute a query against analytics data."""
|
|
|
|
def aggregate(
|
|
self,
|
|
metric: str,
|
|
aggregation: str,
|
|
group_by: List[str],
|
|
filters: Dict[str, Any],
|
|
time_range: Tuple[datetime, datetime]
|
|
) -> List[Dict]:
|
|
"""Aggregate metrics with grouping."""
|
|
|
|
def compare(
|
|
self,
|
|
workflow_ids: List[str],
|
|
metrics: List[str],
|
|
time_range: Tuple[datetime, datetime]
|
|
) -> Dict[str, Dict]:
|
|
"""Compare metrics across workflows."""
|
|
```
|
|
|
|
### 5. Real-time Analytics (`core/analytics/realtime/`)
|
|
|
|
```python
|
|
class RealtimeAnalytics:
|
|
"""Real-time analytics for active workflows."""
|
|
|
|
def __init__(self, metrics_collector: MetricsCollector):
|
|
self.collector = metrics_collector
|
|
self.active_executions: Dict[str, ExecutionMetrics] = {}
|
|
self.subscribers: Dict[str, List[Callable]] = {}
|
|
|
|
def track_execution(self, execution_id: str, workflow_id: str) -> None:
|
|
"""Start tracking an execution in real-time."""
|
|
|
|
def update_progress(
|
|
self,
|
|
execution_id: str,
|
|
current_step: int,
|
|
total_steps: int
|
|
) -> None:
|
|
"""Update execution progress."""
|
|
|
|
def get_live_metrics(self, execution_id: str) -> Dict[str, Any]:
|
|
"""Get live metrics for an execution."""
|
|
|
|
def subscribe(
|
|
self,
|
|
execution_id: str,
|
|
callback: Callable[[Dict], None]
|
|
) -> None:
|
|
"""Subscribe to real-time updates."""
|
|
```
|
|
|
|
## Data Models
|
|
|
|
### Metrics Schema
|
|
|
|
```sql
|
|
-- Execution metrics table
|
|
CREATE TABLE execution_metrics (
|
|
execution_id TEXT PRIMARY KEY,
|
|
workflow_id TEXT NOT NULL,
|
|
started_at TIMESTAMP NOT NULL,
|
|
completed_at TIMESTAMP,
|
|
duration_ms REAL,
|
|
status TEXT NOT NULL,
|
|
steps_total INTEGER,
|
|
steps_completed INTEGER,
|
|
steps_failed INTEGER,
|
|
error_message TEXT,
|
|
context JSON
|
|
);
|
|
|
|
CREATE INDEX idx_workflow_time ON execution_metrics(workflow_id, started_at);
|
|
CREATE INDEX idx_status ON execution_metrics(status);
|
|
|
|
-- Step metrics table
|
|
CREATE TABLE step_metrics (
|
|
step_id TEXT PRIMARY KEY,
|
|
execution_id TEXT NOT NULL,
|
|
workflow_id TEXT NOT NULL,
|
|
node_id TEXT NOT NULL,
|
|
action_type TEXT NOT NULL,
|
|
target_element TEXT,
|
|
started_at TIMESTAMP NOT NULL,
|
|
completed_at TIMESTAMP NOT NULL,
|
|
duration_ms REAL NOT NULL,
|
|
status TEXT NOT NULL,
|
|
confidence_score REAL,
|
|
retry_count INTEGER DEFAULT 0,
|
|
error_details TEXT,
|
|
FOREIGN KEY (execution_id) REFERENCES execution_metrics(execution_id)
|
|
);
|
|
|
|
CREATE INDEX idx_execution ON step_metrics(execution_id);
|
|
CREATE INDEX idx_workflow_action ON step_metrics(workflow_id, action_type);
|
|
|
|
-- Resource metrics table
|
|
CREATE TABLE resource_metrics (
|
|
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
|
timestamp TIMESTAMP NOT NULL,
|
|
workflow_id TEXT,
|
|
execution_id TEXT,
|
|
cpu_percent REAL NOT NULL,
|
|
memory_mb REAL NOT NULL,
|
|
gpu_utilization REAL,
|
|
gpu_memory_mb REAL,
|
|
disk_io_mb REAL
|
|
);
|
|
|
|
CREATE INDEX idx_resource_time ON resource_metrics(timestamp);
|
|
CREATE INDEX idx_resource_workflow ON resource_metrics(workflow_id, timestamp);
|
|
```
|
|
|
|
## Correctness Properties
|
|
|
|
*A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
|
|
|
|
### Property 1: Metrics completeness
|
|
*For any* workflow execution, all required metrics (execution_id, workflow_id, timestamps, duration) SHALL be recorded.
|
|
**Validates: Requirements 1.1, 1.4**
|
|
|
|
### Property 2: Step metrics integrity
|
|
*For any* completed step, the step metrics SHALL include action_type, target_element, and execution_result.
|
|
**Validates: Requirements 1.2**
|
|
|
|
### Property 3: Failure recording completeness
|
|
*For any* failed execution, the failure reason, error details, and context SHALL be recorded.
|
|
**Validates: Requirements 1.3**
|
|
|
|
### Property 4: Async persistence guarantee
|
|
*For any* buffered metrics, they SHALL eventually be persisted to storage within the flush interval.
|
|
**Validates: Requirements 1.5**
|
|
|
|
### Property 5: Statistical accuracy
|
|
*For any* dataset of execution times, the calculated average, median, p95, and p99 SHALL match standard statistical definitions.
|
|
**Validates: Requirements 2.1**
|
|
|
|
### Property 6: Bottleneck identification correctness
|
|
*For any* workflow, the identified bottleneck steps SHALL be the steps with the highest execution times.
|
|
**Validates: Requirements 2.3**
|
|
|
|
### Property 7: Performance degradation detection
|
|
*For any* workflow where execution time increases above threshold, an alert SHALL be generated.
|
|
**Validates: Requirements 2.4**
|
|
|
|
### Property 8: Success rate calculation accuracy
|
|
*For any* set of executions, the success rate SHALL equal (successful_count / total_count) * 100.
|
|
**Validates: Requirements 3.1**
|
|
|
|
### Property 9: Failure categorization completeness
|
|
*For any* set of failures, all failures SHALL be assigned to a category.
|
|
**Validates: Requirements 3.2**
|
|
|
|
### Property 10: Anomaly detection sensitivity
|
|
*For any* metric value that deviates from baseline by more than sensitivity threshold, an anomaly SHALL be detected.
|
|
**Validates: Requirements 4.1**
|
|
|
|
### Property 11: Severity score validity
|
|
*For any* detected anomaly, the severity score SHALL be between 0.0 and 1.0.
|
|
**Validates: Requirements 4.2**
|
|
|
|
### Property 12: Resource tracking completeness
|
|
*For any* workflow execution, CPU, memory, and GPU metrics SHALL be tracked.
|
|
**Validates: Requirements 5.1**
|
|
|
|
### Property 13: Insight generation consistency
|
|
*For any* workflow with performance issues, at least one actionable insight SHALL be generated.
|
|
**Validates: Requirements 6.1**
|
|
|
|
### Property 14: Insight prioritization correctness
|
|
*For any* set of insights, they SHALL be ordered by priority_score in descending order.
|
|
**Validates: Requirements 6.4**
|
|
|
|
### Property 15: Filter application correctness
|
|
*For any* query with filters, only records matching all filter criteria SHALL be returned.
|
|
**Validates: Requirements 7.1**
|
|
|
|
### Property 16: Export format validity
|
|
*For any* report export, the output SHALL be valid according to the target format specification (PDF, CSV, JSON).
|
|
**Validates: Requirements 7.3**
|
|
|
|
### Property 17: Comparison calculation accuracy
|
|
*For any* two workflows being compared, the difference calculations SHALL be mathematically correct.
|
|
**Validates: Requirements 8.1**
|
|
|
|
### Property 18: Real-time latency guarantee
|
|
*For any* real-time metric request, the response SHALL be delivered within 1 second.
|
|
**Validates: Requirements 9.1**
|
|
|
|
### Property 19: Retention policy enforcement
|
|
*For any* data older than its retention period, it SHALL be archived or deleted according to policy.
|
|
**Validates: Requirements 10.2**
|
|
|
|
### Property 20: Archive data integrity
|
|
*For any* archived data, it SHALL be retrievable and match the original data when decompressed.
|
|
**Validates: Requirements 10.3**
|
|
|
|
## Integration Points
|
|
|
|
### With Execution Loop
|
|
- Hook into execution start/complete events
|
|
- Collect step-level metrics during execution
|
|
- Minimal performance impact (<1% overhead)
|
|
|
|
### With Self-Healing System
|
|
- Integrate recovery metrics
|
|
- Track recovery success rates
|
|
- Correlate failures with recovery attempts
|
|
|
|
### With Dashboard
|
|
- Provide REST API for metrics
|
|
- WebSocket for real-time updates
|
|
- Export endpoints for reports
|
|
|
|
## Performance Considerations
|
|
|
|
### Optimization Strategies
|
|
|
|
1. **Async Collection**: Buffer metrics and persist asynchronously
|
|
2. **Query Caching**: Cache frequently accessed aggregations
|
|
3. **Index Optimization**: Strategic indexes on time-series data
|
|
4. **Data Partitioning**: Partition by time for efficient queries
|
|
5. **Archive Strategy**: Move old data to compressed archive
|
|
|
|
### Scalability Targets
|
|
|
|
- Handle 1000+ workflow executions per hour
|
|
- Support 10,000+ steps per hour
|
|
- Real-time queries < 1 second
|
|
- Historical queries < 5 seconds
|
|
- Storage growth < 1GB per month
|
|
|
|
## Testing Strategy
|
|
|
|
### Property-Based Testing
|
|
Use Hypothesis to test correctness properties:
|
|
- Generate random execution data
|
|
- Verify statistical calculations
|
|
- Test anomaly detection with synthetic data
|
|
- Validate query filters and aggregations
|
|
|
|
### Integration Testing
|
|
- End-to-end metric collection and analysis
|
|
- Real-time analytics with simulated workflows
|
|
- Archive and retention policy testing
|
|
- Dashboard integration testing
|
|
|
|
### Performance Testing
|
|
- Load testing with high metric volume
|
|
- Query performance benchmarking
|
|
- Real-time latency testing
|
|
- Storage growth monitoring
|