Files
rpa_vision_v3/.kiro/specs/admin-monitoring/design.md
Dom a7de6a488b feat: replay E2E fonctionnel — 25/25 actions, 0 retries, SomEngine via serveur
Validé sur PC Windows (DESKTOP-58D5CAC, 2560x1600) :
- 8 clics résolus visuellement (1 anchor_template, 1 som_text_match, 6 som_vlm)
- Score moyen 0.75, temps moyen 1.6s
- Texte tapé correctement (bonjour, test word, date, email)
- 0 retries, 2 actions non vérifiées (OK)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 14:04:41 +02:00

381 lines
12 KiB
Markdown

# Design Document: Admin Monitoring System
## Overview
This design document describes the architecture and implementation of a comprehensive monitoring and administration system for RPA Vision V3. The system extends the existing web dashboard with workflow chain management, trigger configuration, Prometheus metrics integration, centralized logging, and log download capabilities.
## Architecture
```mermaid
graph TB
subgraph "Admin Dashboard"
UI[Web Interface]
API[Flask API]
WS[WebSocket Handler]
end
subgraph "Monitoring Core"
Logger[Centralized Logger]
Metrics[Prometheus Metrics]
Collector[Metrics Collector]
end
subgraph "Management"
ChainMgr[Chain Manager]
TriggerMgr[Trigger Manager]
LogExporter[Log Exporter]
end
subgraph "Storage"
ChainStore[(Chains JSON)]
TriggerStore[(Triggers JSON)]
LogStore[(Log Files)]
end
UI --> API
UI --> WS
API --> ChainMgr
API --> TriggerMgr
API --> Logger
API --> LogExporter
API --> Metrics
ChainMgr --> ChainStore
TriggerMgr --> TriggerStore
Logger --> LogStore
Logger --> Metrics
Collector --> Metrics
WS --> Collector
```
## Components and Interfaces
### 1. Centralized Logger (`core/monitoring/logger.py`)
```python
@dataclass
class LogEntry:
timestamp: datetime
level: str # INFO, WARNING, ERROR, DEBUG
component: str
message: str
workflow_id: Optional[str] = None
node_id: Optional[str] = None
metadata: Dict[str, Any] = field(default_factory=dict)
def to_dict(self) -> Dict[str, Any]: ...
class RPALogger:
def __init__(self, component: str, log_file: Optional[str] = None): ...
def info(self, message: str, workflow_id: str = None, **metadata): ...
def warning(self, message: str, workflow_id: str = None, **metadata): ...
def error(self, message: str, workflow_id: str = None, **metadata): ...
def debug(self, message: str, workflow_id: str = None, **metadata): ...
def workflow_start(self, workflow_id: str, **metadata): ...
def workflow_end(self, workflow_id: str, success: bool, duration: float): ...
def get_recent_logs(self, limit: int = 100) -> List[LogEntry]: ...
def export_logs(self, start_time: datetime = None, end_time: datetime = None) -> str: ...
def get_logger(component: str) -> RPALogger: ...
```
### 2. Prometheus Metrics (`core/monitoring/metrics.py`)
```python
# Counters
workflow_executions_total = Counter(
'workflow_executions_total',
'Total workflow executions',
['workflow_id', 'status']
)
log_entries_total = Counter(
'log_entries_total',
'Total log entries',
['level', 'component']
)
chain_executions_total = Counter(
'chain_executions_total',
'Total chain executions',
['chain_id', 'status']
)
trigger_fires_total = Counter(
'trigger_fires_total',
'Total trigger fires',
['trigger_type', 'workflow_id']
)
# Histograms
workflow_duration_seconds = Histogram(
'workflow_duration_seconds',
'Workflow execution duration',
['workflow_id']
)
# Gauges
active_workflows = Gauge('active_workflows', 'Number of active workflows')
error_rate = Gauge('error_rate', 'Current error rate percentage')
```
### 3. Chain Manager (`core/monitoring/chain_manager.py`)
```python
@dataclass
class WorkflowChain:
chain_id: str
name: str
workflows: List[str] # Ordered list of workflow_ids
status: str # active, inactive, running
created_at: datetime
last_execution: Optional[datetime] = None
success_rate: float = 0.0
class ChainManager:
def __init__(self, storage_path: Path): ...
def list_chains(self) -> List[WorkflowChain]: ...
def get_chain(self, chain_id: str) -> Optional[WorkflowChain]: ...
def create_chain(self, name: str, workflows: List[str]) -> WorkflowChain: ...
def validate_workflows_exist(self, workflow_ids: List[str]) -> bool: ...
def execute_chain(self, chain_id: str, on_progress: Callable) -> ChainExecutionResult: ...
def delete_chain(self, chain_id: str) -> bool: ...
```
### 4. Trigger Manager (`core/monitoring/trigger_manager.py`)
```python
@dataclass
class Trigger:
trigger_id: str
trigger_type: str # schedule, file, manual
workflow_id: str
config: Dict[str, Any]
enabled: bool
created_at: datetime
last_fired: Optional[datetime] = None
class TriggerManager:
def __init__(self, storage_path: Path): ...
def list_triggers(self) -> List[Trigger]: ...
def get_trigger(self, trigger_id: str) -> Optional[Trigger]: ...
def create_trigger(self, trigger_type: str, workflow_id: str, config: Dict) -> Trigger: ...
def validate_config(self, trigger_type: str, config: Dict) -> bool: ...
def enable_trigger(self, trigger_id: str) -> bool: ...
def disable_trigger(self, trigger_id: str) -> bool: ...
def delete_trigger(self, trigger_id: str) -> bool: ...
```
### 5. Log Exporter (`core/monitoring/log_exporter.py`)
```python
class LogExporter:
def __init__(self, logs_path: Path): ...
def export_to_zip(
self,
start_time: Optional[datetime] = None,
end_time: Optional[datetime] = None
) -> io.BytesIO: ...
def get_execution_logs(self, start: datetime, end: datetime) -> List[Dict]: ...
def get_error_logs(self, start: datetime, end: datetime) -> List[Dict]: ...
def get_metrics_summary(self) -> Dict: ...
```
### 6. API Endpoints (additions to `web_dashboard/app.py`)
```python
# Chains API
@app.route('/api/chains')
def api_chains(): ...
@app.route('/api/chains', methods=['POST'])
def create_chain(): ...
@app.route('/api/chains/<chain_id>/execute', methods=['POST'])
def execute_chain(chain_id): ...
# Triggers API
@app.route('/api/triggers')
def api_triggers(): ...
@app.route('/api/triggers', methods=['POST'])
def create_trigger(): ...
@app.route('/api/triggers/<trigger_id>/toggle', methods=['POST'])
def toggle_trigger(trigger_id): ...
# Logs API
@app.route('/api/logs/download')
def download_logs(): ...
# Metrics API
@app.route('/metrics')
def prometheus_metrics(): ...
```
## Data Models
### Chain Storage Format (`data/chains/*.json`)
```json
{
"chain_id": "chain_001",
"name": "Complete Process",
"workflows": ["wf_login", "wf_data_entry", "wf_submit"],
"status": "active",
"created_at": "2024-11-29T10:00:00",
"last_execution": "2024-11-29T14:30:00",
"success_rate": 92.5,
"execution_history": [
{
"timestamp": "2024-11-29T14:30:00",
"success": true,
"duration": 45.2,
"failed_at": null
}
]
}
```
### Trigger Storage Format (`data/triggers/*.json`)
```json
{
"trigger_id": "trigger_001",
"trigger_type": "schedule",
"workflow_id": "wf_login",
"config": {
"interval_seconds": 3600,
"start_time": "08:00",
"end_time": "18:00"
},
"enabled": true,
"created_at": "2024-11-29T10:00:00",
"last_fired": "2024-11-29T14:00:00"
}
```
### Log Entry Format
```json
{
"timestamp": "2024-11-29T14:30:15.123",
"level": "INFO",
"component": "execution",
"message": "Workflow started",
"workflow_id": "wf_001",
"node_id": "login_node",
"metadata": {
"trigger": "schedule",
"user": "system"
}
}
```
## Correctness Properties
*A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
### Property 1: Chain listing completeness
*For any* set of chains stored in the system, the chains API endpoint SHALL return all chains with their complete workflow sequences and status information.
**Validates: Requirements 1.1**
### Property 2: Chain workflow validation
*For any* chain creation request with workflow references, if any referenced workflow does not exist, the creation SHALL fail with a validation error.
**Validates: Requirements 1.2**
### Property 3: Chain execution stops on failure
*For any* chain execution where a workflow fails, the chain execution SHALL stop at the failed workflow and not execute subsequent workflows.
**Validates: Requirements 1.4**
### Property 4: Trigger listing completeness
*For any* set of triggers stored in the system, the triggers API endpoint SHALL return all triggers with type, workflow_id, and enabled status.
**Validates: Requirements 2.1**
### Property 5: Trigger state persistence
*For any* trigger enable/disable operation, the new state SHALL be persisted and returned correctly on subsequent queries.
**Validates: Requirements 2.3**
### Property 6: Prometheus metrics format validity
*For any* request to the /metrics endpoint, the response SHALL be valid Prometheus exposition format parseable by Prometheus.
**Validates: Requirements 3.1**
### Property 7: Workflow execution counter increment
*For any* workflow execution (success or failure), the workflow_executions_total counter SHALL increment by exactly 1 with correct labels.
**Validates: Requirements 3.2**
### Property 8: Workflow duration histogram recording
*For any* completed workflow execution with a measured duration, the workflow_duration_seconds histogram SHALL record that duration.
**Validates: Requirements 3.3**
### Property 9: Log entry structure completeness
*For any* log entry created by the logging system, the entry SHALL contain timestamp, level, component, and message fields.
**Validates: Requirements 4.1**
### Property 10: Workflow log metadata inclusion
*For any* log entry created with workflow context, the entry metadata SHALL include workflow_id and node_id when provided.
**Validates: Requirements 4.2**
### Property 11: Log filtering correctness
*For any* log query with filter parameters, all returned entries SHALL match the specified filter criteria.
**Validates: Requirements 4.3**
### Property 12: Log counter synchronization
*For any* log entry written, the corresponding Prometheus log counter SHALL be incremented by 1.
**Validates: Requirements 4.4**
### Property 13: ZIP archive validity
*For any* log download request, the response SHALL be a valid ZIP archive.
**Validates: Requirements 5.1**
### Property 14: ZIP archive contents
*For any* log download, the ZIP archive SHALL contain execution_logs.json, error_logs.json, and metrics.json files.
**Validates: Requirements 5.2**
### Property 15: Date range filtering
*For any* log download with date range parameters, all log entries in the archive SHALL have timestamps within the specified range.
**Validates: Requirements 5.4**
## Error Handling
### Chain Errors
- `ChainNotFoundError`: Chain ID does not exist
- `WorkflowNotFoundError`: Referenced workflow does not exist
- `ChainExecutionError`: Error during chain execution with failure point
### Trigger Errors
- `TriggerNotFoundError`: Trigger ID does not exist
- `InvalidTriggerConfigError`: Trigger configuration is invalid
- `WorkflowNotFoundError`: Target workflow does not exist
### Log Errors
- `LogExportError`: Error generating log archive
- `InvalidDateRangeError`: Start date is after end date
## Testing Strategy
### Property-Based Testing Library
The implementation will use **Hypothesis** for Python property-based testing.
### Test Configuration
- Minimum 100 iterations per property test
- Each property test tagged with: `**Feature: admin-monitoring, Property {number}: {property_text}**`
### Unit Tests
- Test individual component methods
- Test API endpoint responses
- Test error handling paths
### Property-Based Tests
Each correctness property will have a corresponding property-based test:
1. **Property 1-2**: Generate random chain configurations, verify listing and validation
2. **Property 3**: Generate chains with failing workflows, verify execution stops
3. **Property 4-5**: Generate random triggers, verify listing and state persistence
4. **Property 6-8**: Generate workflow executions, verify metrics format and values
5. **Property 9-12**: Generate log entries, verify structure and counter sync
6. **Property 13-15**: Generate log data, verify ZIP contents and filtering
### Integration Tests
- End-to-end chain execution flow
- Trigger firing and workflow execution
- Log download with various filters