Files
rpa_vision_v3/.kiro/specs/visual-workflow-builder-vision-refactor/design.md
Dom a7de6a488b feat: replay E2E fonctionnel — 25/25 actions, 0 retries, SomEngine via serveur
Validé sur PC Windows (DESKTOP-58D5CAC, 2560x1600) :
- 8 clics résolus visuellement (1 anchor_template, 1 som_text_match, 6 som_vlm)
- Score moyen 0.75, temps moyen 1.6s
- Texte tapé correctement (bonjour, test word, date, email)
- 0 retries, 2 actions non vérifiées (OK)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 14:04:41 +02:00

18 KiB

Design Document

Overview

Cette refonte complète du Visual Workflow Builder transforme l'interface en un système 100% vision-based, éliminant tous les sélecteurs traditionnels et implémentant une approche de sélection visuelle pure conforme à l'architecture RPA Vision V3.

Architecture

System Architecture

graph TB
    subgraph "Frontend (React/TypeScript)"
        VWB[Visual Workflow Builder]
        VS[Visual Selector Component]
        VTP[Visual Target Preview]
        RSV[Reference Screenshot Viewer]
    end
    
    subgraph "Backend APIs"
        SCS[Screen Capture Service]
        EDE[Element Detection Engine]
        VTM[Visual Target Manager]
        EGS[Embedding Generation Service]
    end
    
    subgraph "Core RPA Vision V3"
        SC[Screen Capturer]
        UD[UI Detector]
        FE[Fusion Engine]
        EM[Embedding Manager]
    end
    
    VWB --> VS
    VS --> SCS
    SCS --> SC
    SCS --> UD
    UD --> EDE
    EDE --> FE
    FE --> EM
    EM --> VTM
    VTM --> VTP
    VTP --> RSV

Component Hierarchy

VisualWorkflowBuilder/
├── VisualCanvas/                    # Canvas principal avec workflow nodes
│   ├── VisualNode/                 # Node avec preview visuel
│   └── VisualConnection/           # Connexions entre nodes
├── VisualPalette/                  # Palette d'actions visuelles
│   └── VisualActionCard/           # Carte d'action avec icône
├── VisualPropertiesPanel/          # Panneau de propriétés visuelles
│   ├── VisualTargetConfig/         # Configuration de cible visuelle
│   ├── ReferenceScreenshotView/    # Affichage de capture de référence
│   └── VisualMetadataDisplay/      # Métadonnées visuelles
├── VisualScreenSelector/           # Sélecteur d'écran (refonte complète)
│   ├── ScreenCaptureView/          # Vue de capture d'écran
│   ├── ElementDetectionOverlay/    # Overlay de détection d'éléments
│   └── BoundingBoxRenderer/        # Rendu des boîtes de délimitation
└── VisualValidationPanel/          # Panneau de validation visuelle
    ├── TargetPreview/              # Aperçu de la cible
    └── ValidationFeedback/         # Feedback de validation

Components and Interfaces

1. VisualScreenSelector (Refonte Complète)

Responsabilités:

  • Capture d'écran en temps réel
  • Détection automatique d'éléments UI
  • Affichage de l'image de référence avec overlays précis
  • Sélection visuelle pure sans sélecteurs CSS/XPath

Interface:

interface VisualScreenSelectorProps {
  open: boolean;
  onClose: () => void;
  onElementSelected: (target: VisualTarget) => void;
  captureMode: 'fullscreen' | 'window' | 'region';
}

interface DetectedElement {
  id: string;
  bounds: BoundingBox;
  elementType: ElementType;
  confidence: number;
  textContent?: string;
  visualFeatures: VisualFeatures;
}

interface BoundingBox {
  x: number;
  y: number;
  width: number;
  height: number;
  screenScale: number;
  dpiScale: number;
}

2. ReferenceScreenshotView

Responsabilités:

  • Affichage de l'image de capture de référence
  • Overlay précis des éléments sélectionnés
  • Zoom et pan pour inspection détaillée
  • Gestion des différentes résolutions d'écran

Interface:

interface ReferenceScreenshotViewProps {
  screenshot: string; // Base64 image data
  selectedElement?: DetectedElement;
  detectedElements: DetectedElement[];
  onElementHover: (element: DetectedElement | null) => void;
  onElementClick: (element: DetectedElement) => void;
  zoomLevel: number;
  panOffset: { x: number; y: number };
}

3. VisualTargetConfig

Responsabilités:

  • Configuration visuelle des cibles
  • Suppression complète des champs CSS/XPath
  • Interface basée sur les métadonnées visuelles
  • Validation en temps réel

Interface:

interface VisualTargetConfigProps {
  target: VisualTarget;
  onTargetUpdate: (target: VisualTarget) => void;
  onValidate: () => Promise<ValidationResult>;
  showAdvancedOptions: boolean;
}

interface VisualTarget {
  id: string;
  screenshot: string; // Image de l'élément avec overlay
  embedding: Float32Array;
  boundingBox: BoundingBox;
  confidence: number;
  metadata: VisualMetadata;
  contextualInfo: ContextualInfo;
  validationStatus: ValidationStatus;
}

4. ElementDetectionOverlay

Responsabilités:

  • Rendu des overlays de détection en temps réel
  • Gestion des interactions hover/click
  • Affichage des informations de confiance
  • Animation fluide des transitions

Interface:

interface ElementDetectionOverlayProps {
  elements: DetectedElement[];
  hoveredElement?: DetectedElement;
  selectedElement?: DetectedElement;
  onElementInteraction: (element: DetectedElement, action: 'hover' | 'click') => void;
  overlayStyle: OverlayStyle;
}

interface OverlayStyle {
  hoverColor: string;
  selectedColor: string;
  borderWidth: number;
  opacity: number;
  animationDuration: number;
}

Data Models

VisualTarget (Enhanced)

interface VisualTarget {
  // Identification
  id: string;
  signature: string;
  
  // Données visuelles
  screenshot: string; // Image de référence avec overlay
  embedding: Float32Array;
  boundingBox: BoundingBox;
  
  // Métadonnées de détection
  confidence: number;
  elementType: ElementType;
  textContent?: string;
  
  // Informations contextuelles
  contextualInfo: {
    screenSize: { width: number; height: number };
    captureTimestamp: string;
    surroundingElements: DetectedElement[];
    relativePosition: string;
  };
  
  // Validation et qualité
  validationStatus: 'pending' | 'valid' | 'invalid' | 'needs_review';
  validationScore: number;
  lastValidated?: Date;
  
  // Métadonnées visuelles
  visualMetadata: {
    colorProfile: ColorProfile;
    textualFeatures: TextualFeatures;
    spatialFeatures: SpatialFeatures;
    accessibilityInfo: AccessibilityInfo;
  };
}

ScreenCaptureResult

interface ScreenCaptureResult {
  screenshot: string; // Base64 image data
  detectedElements: DetectedElement[];
  captureMetadata: {
    timestamp: Date;
    screenResolution: { width: number; height: number };
    dpiScale: number;
    captureRegion: BoundingBox;
    processingTime: number;
  };
  qualityMetrics: {
    imageQuality: number;
    detectionConfidence: number;
    elementCount: number;
  };
}

ValidationResult

interface ValidationResult {
  isValid: boolean;
  confidence: number;
  issues: ValidationIssue[];
  suggestions: string[];
  visualFeedback: {
    highlightAreas: BoundingBox[];
    warningMessages: string[];
    successIndicators: string[];
  };
}

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

Property 1: Selector Generation Prohibition

For any workflow configuration generated by the Visual_Workflow_Builder, the configuration should never contain CSS selector strings or XPath selector strings Validates: Requirements 1.3

Property 2: Visual Selection Method Exclusivity

For any action configuration interface, only visual selection methods should be available as configuration options Validates: Requirements 1.4

Property 3: Screen Capture Trigger

For any element selection request, the Visual_Selector should initiate screen capture automatically Validates: Requirements 2.1

Property 4: Reference Screenshot Display

For any screen capture result, the Visual_Selector should display the reference screenshot with detected elements highlighted Validates: Requirements 2.2

Property 5: Hover Bounding Box Response

For any mouse hover event over a detected element, the Visual_Selector should display a precise bounding box overlay Validates: Requirements 2.3

Property 6: Click Visual Target Creation

For any click event on a detected element, the Visual_Selector should automatically create a Visual_Target object Validates: Requirements 2.4

Property 7: Metadata Visibility

For any detected element, the Visual_Selector should display confidence scores and metadata information Validates: Requirements 2.5

Property 8: Zoom Pan Functionality

For any reference screenshot display, zoom and pan controls should function correctly and affect the display Validates: Requirements 2.6

Property 9: Visual Target Screenshot Association

For any Visual_Target in the workflow, the Visual_Workflow_Builder should display its associated reference screenshot Validates: Requirements 3.1

Property 10: Green Border Overlay

For any captured element in a reference screenshot, the element should be displayed with a green border overlay Validates: Requirements 3.2

Property 11: Action Target Image Display

For any configured action being viewed, the Visual_Workflow_Builder should show the target element image Validates: Requirements 3.3

Property 12: Contextual Information Inclusion

For any reference screenshot, contextual information including timestamp and screen size should be included and accurate Validates: Requirements 3.4

Property 13: Screenshot Enlargement

For any reference screenshot, enlargement controls should work correctly to provide detailed view Validates: Requirements 3.5

Property 14: Pixel Perfect Alignment

For any detected UI element, the bounding box should be aligned with the element within acceptable pixel tolerance Validates: Requirements 4.1

Property 15: Screen Scaling Compensation

For any overlay display across different screen configurations, the Visual_Selector should correctly account for screen scaling and resolution Validates: Requirements 4.2

Property 16: Real-time Bounding Box Updates

For any mouse hover state change, bounding boxes should update immediately in real-time Validates: Requirements 4.3

Property 17: Selection State Persistence

For any selected element, the bounding box should persist with correct coordinates after selection Validates: Requirements 4.4

Property 18: Multi-monitor Coordinate Mapping

For any multi-monitor setup, the Visual_Selector should handle coordinate mapping correctly across all displays Validates: Requirements 4.5

Property 19: Screen Capture API Integration

For any screen capture operation, the Visual_Workflow_Builder should use the existing Screen_Capture_Service API with proper parameters Validates: Requirements 5.1

Property 20: Element Detection Integration

For any element detection operation, the Visual_Workflow_Builder should properly integrate with the Element_Detection_Engine Validates: Requirements 5.2

Property 21: Visual Target Manager Usage

For any Visual_Target storage or retrieval operation, the Visual_Workflow_Builder should use the Visual_Target_Manager Validates: Requirements 5.3

Property 22: Embedding System Integration

For any embedding generation, the Visual_Workflow_Builder should leverage the existing embedding generation system Validates: Requirements 5.4

Property 23: RPA Pipeline Compatibility

For any generated workflow, the Visual_Workflow_Builder should maintain compatibility with the core RPA Vision V3 pipeline Validates: Requirements 5.5

Property 24: Visual Target Thumbnails

For any Visual_Target in the interface, the Visual_Workflow_Builder should display thumbnail previews Validates: Requirements 6.2

Property 25: Visual Interaction Feedback

For any user interaction, the Visual_Workflow_Builder should provide appropriate visual feedback Validates: Requirements 6.4

Property 26: Confidence Score Display

For any Visual_Target, the Visual_Workflow_Builder should display accurate confidence scores Validates: Requirements 7.1

Property 27: Element Type Detection Display

For any detected element, the Visual_Workflow_Builder should show the correct element type (button, input, link, etc.) Validates: Requirements 7.2

Property 28: Contextual Information Display

For any Visual_Target, the Visual_Workflow_Builder should display accurate contextual information including surrounding elements and position Validates: Requirements 7.3

Property 29: Embedding Quality Metrics

For any Visual_Target, the Visual_Workflow_Builder should show meaningful embedding quality metrics Validates: Requirements 7.4

Property 30: Validation Status Display

For any Visual_Target, the Visual_Workflow_Builder should provide accurate and updated validation status Validates: Requirements 7.5

Property 31: Capture Button Response

For any "Capture Screen" button click, the Screen_Capture_Service should initiate real-time screenshot capture Validates: Requirements 8.1

Property 32: Interactive Element Detection Completeness

For any test screen with interactive elements, the Screen_Capture_Service should detect all interactive UI elements Validates: Requirements 8.2

Property 33: Sub-pixel Coordinate Precision

For any element coordinate data returned, the Screen_Capture_Service should provide sub-pixel precision Validates: Requirements 8.3

Property 34: Display Configuration Compatibility

For any screen resolution and DPI setting combination, the Screen_Capture_Service should handle the configuration correctly Validates: Requirements 8.4

Property 35: Capture Failure Fallback

For any screen capture failure scenario, the Screen_Capture_Service should provide working fallback mechanisms Validates: Requirements 8.5

Property 36: Selection Highlight Overlay

For any element selection, the Visual_Workflow_Builder should highlight the element with a colored overlay Validates: Requirements 9.1

Property 37: Action Preview Display

For any selected element, the Visual_Workflow_Builder should show a preview of the action that will be performed Validates: Requirements 9.2

Property 38: Validation Error Indicators

For any validation failure, the Visual_Workflow_Builder should display visual error indicators Validates: Requirements 9.3

Property 39: Real-time Detection Feedback

For any element detection operation, the Visual_Workflow_Builder should provide real-time feedback during processing Validates: Requirements 9.4

Property 40: Selection Testing Capability

For any element selection, the Visual_Workflow_Builder should allow testing the selection before saving Validates: Requirements 9.5

Property 41: Screen Capture Performance

For any screen capture operation, the Visual_Workflow_Builder should complete the capture in less than 2 seconds Validates: Requirements 10.1

Property 42: Element Detection Performance

For any element detection operation, the Visual_Workflow_Builder should complete detection in less than 3 seconds Validates: Requirements 10.2

Property 43: Async Operation Loading Indicators

For any asynchronous operation, the Visual_Workflow_Builder should provide loading indicators Validates: Requirements 10.3

Property 44: Screenshot Caching Efficiency

For any reference screenshot access, the Visual_Workflow_Builder should cache and retrieve screenshots efficiently Validates: Requirements 10.4

Property 45: Responsive Image Optimization

For any screen size configuration, the Visual_Workflow_Builder should optimize image display appropriately Validates: Requirements 10.5

Error Handling

Screen Capture Failures

  • Timeout Handling: Implement 5-second timeout for capture operations with user notification
  • Permission Errors: Graceful handling of screen capture permission denials
  • Multi-monitor Issues: Fallback to primary monitor if multi-monitor capture fails
  • Resolution Conflicts: Automatic scaling adjustment for resolution mismatches

Element Detection Failures

  • Low Confidence Elements: Warning indicators for elements below 70% confidence
  • No Elements Detected: Clear messaging and retry options when no elements are found
  • Overlapping Elements: Disambiguation UI for overlapping or nested elements
  • Performance Degradation: Progressive quality reduction for performance optimization

Visual Target Validation

  • Invalid Targets: Clear error messages for targets that fail validation
  • Outdated Screenshots: Automatic re-capture suggestions for stale references
  • Coordinate Misalignment: Real-time correction suggestions for misaligned targets
  • Embedding Failures: Fallback to basic visual matching when embeddings fail

Testing Strategy

Unit Testing

  • Component Isolation: Test each visual component independently with mock data
  • API Integration: Mock all backend services for frontend component testing
  • Error Scenarios: Comprehensive testing of error handling paths
  • Performance Boundaries: Test performance limits and degradation scenarios

Property-Based Testing

  • Visual Target Generation: Test Visual_Target creation across various element types and screen configurations
  • Coordinate Precision: Verify bounding box accuracy across different DPI and scaling settings
  • Screenshot Processing: Test image processing pipeline with various image formats and sizes
  • Integration Workflows: Verify end-to-end workflows from capture to execution

Integration Testing

  • Backend API Integration: Test real integration with Screen_Capture_Service and Element_Detection_Engine
  • Cross-browser Compatibility: Verify functionality across different browsers and versions
  • Multi-monitor Support: Test coordinate mapping and capture across multiple display configurations
  • Performance Under Load: Test system behavior with multiple concurrent capture operations

User Acceptance Testing

  • Visual Selection Workflow: End-to-end testing of the complete visual selection process
  • Reference Screenshot Accuracy: Verify that reference screenshots accurately represent selected elements
  • Workflow Creation: Test complete workflow creation using only visual selection methods
  • Error Recovery: Test user experience during error scenarios and recovery processes