Files
rpa_vision_v3/docs/specs/requirements.md
Dom a27b74cf22 v1.0 - Version stable: multi-PC, détection UI-DETR-1, 3 modes exécution
- Frontend v4 accessible sur réseau local (192.168.1.40)
- Ports ouverts: 3002 (frontend), 5001 (backend), 5004 (dashboard)
- Ollama GPU fonctionnel
- Self-healing interactif
- Dashboard confiance

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 11:23:51 +01:00

16 KiB

Document de Requirements - Workflow Graph Implementation

Introduction

Ce document définit les exigences pour l'implémentation de l'architecture Workflow Graph du système RPA Vision V2. Le système transforme progressivement des captures d'écran brutes en workflows sémantiques appris, permettant une automatisation basée sur la compréhension visuelle plutôt que sur des coordonnées de clics.

L'architecture suit une approche en 5 couches : RawSession (capture brute) → ScreenState (analyse multi-modale) → UIElement Detection (détection sémantique) → State Embedding (fusion multi-modale) → Workflow Graph (modélisation en graphe avec apprentissage progressif).

Glossaire

  • System : Le système RPA Vision V2
  • ScreenState : Représentation structurée d'un état d'écran à 4 niveaux (Raw, Perception, Sémantique UI, Contexte Métier)
  • UIElement : Élément d'interface détecté avec type, rôle et embeddings
  • State Embedding : Vecteur unique (fingerprint) fusionnant toutes les modalités d'un écran
  • WorkflowNode : Template d'état d'écran dans un graphe de workflow
  • WorkflowEdge : Transition (action) entre deux nodes
  • Workflow Graph : Graphe complet modélisant un workflow avec états d'apprentissage
  • Learning State : État de progression (OBSERVATION, COACHING, AUTO_CANDIDATE, AUTO_CONFIRMÉ)
  • RawSession : Enregistrement brut des événements utilisateur avec screenshots
  • Embedding : Vecteur numérique représentant une modalité (image, texte, UI)
  • FAISS Index : Index de recherche de similarité pour embeddings
  • VLM : Vision-Language Model (modèle vision-langage)

Requirements

Requirement 1

User Story: En tant que développeur système, je veux capturer fidèlement les sessions utilisateur avec tous les événements et screenshots, afin de pouvoir analyser et apprendre les workflows.

Acceptance Criteria

  1. WHEN THE System captures a user session THEN THE System SHALL record all mouse events with precise timestamps and window context
  2. WHEN THE System captures a user session THEN THE System SHALL record all keyboard events with key combinations and window context
  3. WHEN THE System captures a user session THEN THE System SHALL take screenshots at each significant event with unique identifiers
  4. WHEN THE System saves a RawSession THEN THE System SHALL serialize it to JSON format with schema version "rawsession_v1"
  5. WHEN THE System loads a RawSession THEN THE System SHALL deserialize it from JSON and validate schema compatibility

Requirement 2

User Story: En tant que développeur système, je veux transformer chaque screenshot en ScreenState structuré à 4 niveaux, afin d'avoir une représentation riche et exploitable de l'état d'écran.

Acceptance Criteria

  1. WHEN THE System processes a screenshot THEN THE System SHALL create a ScreenState with Raw level containing image path and metadata
  2. WHEN THE System processes a screenshot THEN THE System SHALL create Perception level with image embedding using OpenCLIP
  3. WHEN THE System processes a screenshot THEN THE System SHALL detect text using VLM and create text embeddings
  4. WHEN THE System processes a screenshot THEN THE System SHALL detect UI elements and create Sémantique UI level
  5. WHEN THE System processes a screenshot THEN THE System SHALL extract window context and create Contexte Métier level
  6. WHEN THE System saves a ScreenState THEN THE System SHALL serialize it to JSON with all 4 levels preserved

Requirement 3

User Story: En tant que développeur système, je veux détecter les éléments UI de manière sémantique avec types et rôles, afin de pouvoir les identifier et les manipuler indépendamment de leur position exacte.

Acceptance Criteria

  1. WHEN THE System detects UI elements THEN THE System SHALL identify regions of interest using VLM
  2. WHEN THE System detects UI elements THEN THE System SHALL classify each element with a semantic type (button, text_input, checkbox, etc.)
  3. WHEN THE System detects UI elements THEN THE System SHALL assign a semantic role to each element (primary_action, cancel, form_input, etc.)
  4. WHEN THE System detects UI elements THEN THE System SHALL extract visual features (dominant color, shape, size category)
  5. WHEN THE System detects UI elements THEN THE System SHALL generate dual embeddings (image embedding and text embedding) for each element
  6. WHEN THE System detects UI elements THEN THE System SHALL compute a confidence score for each detection
  7. WHEN THE System saves UIElements THEN THE System SHALL serialize them to JSON with all attributes and embeddings references

Requirement 4

User Story: En tant que développeur système, je veux fusionner toutes les modalités d'un écran en un State Embedding unique, afin de pouvoir comparer et matcher des états d'écran de manière robuste.

Acceptance Criteria

  1. WHEN THE System creates a State Embedding THEN THE System SHALL compute image embedding from the full screenshot
  2. WHEN THE System creates a State Embedding THEN THE System SHALL compute text embedding from all detected text concatenated
  3. WHEN THE System creates a State Embedding THEN THE System SHALL compute title embedding from window title
  4. WHEN THE System creates a State Embedding THEN THE System SHALL compute UI embedding by averaging all UIElement embeddings
  5. WHEN THE System creates a State Embedding THEN THE System SHALL fuse all embeddings using weighted combination with configurable weights
  6. WHEN THE System creates a State Embedding THEN THE System SHALL normalize the final embedding vector
  7. WHEN THE System compares two State Embeddings THEN THE System SHALL compute cosine similarity between vectors
  8. WHEN THE System saves a State Embedding THEN THE System SHALL store the vector in FAISS index and save metadata to JSON

Requirement 5

User Story: En tant que développeur système, je veux modéliser les workflows comme des graphes explicites avec Nodes et Edges, afin de représenter clairement les états et transitions.

Acceptance Criteria

  1. WHEN THE System creates a WorkflowNode THEN THE System SHALL define a screen template with window constraints
  2. WHEN THE System creates a WorkflowNode THEN THE System SHALL define required text patterns for matching
  3. WHEN THE System creates a WorkflowNode THEN THE System SHALL define required UI elements with roles and types
  4. WHEN THE System creates a WorkflowNode THEN THE System SHALL compute an embedding prototype from sample ScreenStates
  5. WHEN THE System creates a WorkflowNode THEN THE System SHALL set a minimum similarity threshold for matching
  6. WHEN THE System saves a WorkflowNode THEN THE System SHALL serialize it to JSON with all template constraints

Requirement 6

User Story: En tant que développeur système, je veux définir les transitions entre nodes comme des WorkflowEdges avec actions sémantiques, afin de spécifier comment naviguer dans le workflow.

Acceptance Criteria

  1. WHEN THE System creates a WorkflowEdge THEN THE System SHALL define source and target nodes
  2. WHEN THE System creates a WorkflowEdge THEN THE System SHALL define action type (mouse_click, key_press, text_input, compound)
  3. WHEN THE System creates a WorkflowEdge THEN THE System SHALL define target element by semantic role rather than coordinates
  4. WHEN THE System creates a WorkflowEdge THEN THE System SHALL define selection policy for target element (first, last, by_similarity)
  5. WHEN THE System creates a WorkflowEdge THEN THE System SHALL define pre-conditions and post-conditions for validation
  6. WHEN THE System creates a WorkflowEdge THEN THE System SHALL track execution statistics (success count, failure count, avg time)
  7. WHEN THE System saves a WorkflowEdge THEN THE System SHALL serialize it to JSON with all action details and stats

Requirement 7

User Story: En tant que développeur système, je veux assembler Nodes et Edges en Workflow Graph complet avec métadonnées, afin d'avoir une représentation complète du workflow.

Acceptance Criteria

  1. WHEN THE System creates a Workflow Graph THEN THE System SHALL define entry nodes and end nodes
  2. WHEN THE System creates a Workflow Graph THEN THE System SHALL validate that all edges reference existing nodes
  3. WHEN THE System creates a Workflow Graph THEN THE System SHALL detect cycles and branching in the graph
  4. WHEN THE System creates a Workflow Graph THEN THE System SHALL assign a unique workflow_id
  5. WHEN THE System creates a Workflow Graph THEN THE System SHALL initialize learning state to OBSERVATION
  6. WHEN THE System saves a Workflow Graph THEN THE System SHALL serialize it to JSON with all nodes, edges and metadata

Requirement 8

User Story: En tant que développeur système, je veux implémenter les états d'apprentissage progressif (OBSERVATION, COACHING, AUTO_CANDIDATE, AUTO_CONFIRMÉ), afin de permettre au système d'apprendre graduellement.

Acceptance Criteria

  1. WHEN THE System initializes a workflow THEN THE System SHALL set learning state to OBSERVATION
  2. WHEN THE System has observed a workflow 5 times with similarity > 0.90 THEN THE System SHALL transition to COACHING state
  3. WHEN THE System has assisted a workflow 10 times with success rate > 0.90 THEN THE System SHALL transition to AUTO_CANDIDATE state
  4. WHEN THE System has executed a workflow 20 times in AUTO_CANDIDATE with success rate > 0.95 THEN THE System SHALL be eligible for AUTO_CONFIRMÉ state
  5. WHEN THE System transitions learning state THEN THE System SHALL log the transition with reason and timestamp
  6. WHEN THE System is in AUTO_CONFIRMÉ state and confidence drops below 0.90 THEN THE System SHALL rollback to COACHING state

Requirement 9

User Story: En tant que développeur système, je veux matcher un ScreenState actuel contre les WorkflowNodes existants, afin de reconnaître dans quel état du workflow on se trouve.

Acceptance Criteria

  1. WHEN THE System matches a ScreenState THEN THE System SHALL compute State Embedding for current screen
  2. WHEN THE System matches a ScreenState THEN THE System SHALL search FAISS index for similar node prototypes
  3. WHEN THE System matches a ScreenState THEN THE System SHALL validate window constraints for candidate nodes
  4. WHEN THE System matches a ScreenState THEN THE System SHALL validate required text patterns for candidate nodes
  5. WHEN THE System matches a ScreenState THEN THE System SHALL validate required UI elements for candidate nodes
  6. WHEN THE System matches a ScreenState THEN THE System SHALL return best matching node with confidence score
  7. WHEN THE System matches a ScreenState and no node matches above threshold THEN THE System SHALL return null match

Requirement 10

User Story: En tant que développeur système, je veux exécuter les actions définies dans WorkflowEdges en trouvant les UIElements par rôle, afin d'automatiser le workflow de manière robuste.

Acceptance Criteria

  1. WHEN THE System executes a WorkflowEdge THEN THE System SHALL find target UIElement by semantic role in current ScreenState
  2. WHEN THE System executes a mouse_click action THEN THE System SHALL click on the center of the matched UIElement
  3. WHEN THE System executes a text_input action THEN THE System SHALL type text into the matched UIElement
  4. WHEN THE System executes a compound action THEN THE System SHALL execute all steps in sequence
  5. WHEN THE System executes an action THEN THE System SHALL wait for post-conditions to be satisfied
  6. WHEN THE System executes an action THEN THE System SHALL verify transition to expected target node
  7. WHEN THE System executes an action and post-conditions fail THEN THE System SHALL log failure and rollback if possible

Requirement 11

User Story: En tant que développeur système, je veux détecter automatiquement les patterns répétés dans les RawSessions, afin de construire les Workflow Graphs sans intervention manuelle.

Acceptance Criteria

  1. WHEN THE System analyzes a RawSession THEN THE System SHALL group events by window context
  2. WHEN THE System analyzes a RawSession THEN THE System SHALL create ScreenStates for all screenshots
  3. WHEN THE System analyzes a RawSession THEN THE System SHALL compute State Embeddings for all ScreenStates
  4. WHEN THE System analyzes a RawSession THEN THE System SHALL detect repeated sequences using embedding similarity
  5. WHEN THE System detects a repeated sequence THEN THE System SHALL cluster similar ScreenStates into candidate nodes
  6. WHEN THE System detects a repeated sequence THEN THE System SHALL identify transitions as candidate edges
  7. WHEN THE System detects a repeated sequence with 3+ repetitions THEN THE System SHALL propose a Workflow Graph

Requirement 12

User Story: En tant que développeur système, je veux persister tous les artefacts (ScreenStates, Embeddings, Workflow Graphs) de manière structurée, afin de pouvoir les recharger et les analyser.

Acceptance Criteria

  1. WHEN THE System saves a ScreenState THEN THE System SHALL write JSON file with schema version
  2. WHEN THE System saves embeddings THEN THE System SHALL write numpy arrays to .npy files
  3. WHEN THE System saves embeddings THEN THE System SHALL add vectors to FAISS index
  4. WHEN THE System saves a Workflow Graph THEN THE System SHALL write JSON file with all nodes and edges
  5. WHEN THE System loads a Workflow Graph THEN THE System SHALL deserialize JSON and reconstruct graph structure
  6. WHEN THE System loads embeddings THEN THE System SHALL load FAISS index and metadata mappings
  7. WHEN THE System saves artifacts THEN THE System SHALL organize files by date and workflow_id

Requirement 13

User Story: En tant que développeur système, je veux valider la qualité des State Embeddings, afin de m'assurer qu'ils discriminent bien les différents états.

Acceptance Criteria

  1. WHEN THE System validates State Embeddings THEN THE System SHALL compute intra-node similarity (states of same node should be similar)
  2. WHEN THE System validates State Embeddings THEN THE System SHALL compute inter-node similarity (states of different nodes should be dissimilar)
  3. WHEN THE System validates State Embeddings THEN THE System SHALL compute embedding quality score as ratio of intra/inter similarity
  4. WHEN THE System validates State Embeddings and quality score is below 0.70 THEN THE System SHALL log warning
  5. WHEN THE System validates State Embeddings THEN THE System SHALL report discriminative power metric

Requirement 14

User Story: En tant que développeur système, je veux gérer les erreurs de matching et d'exécution de manière robuste, afin que le système soit résilient aux changements d'UI.

Acceptance Criteria

  1. WHEN THE System fails to match a ScreenState to any node THEN THE System SHALL log the unmatched state with screenshot
  2. WHEN THE System fails to find a target UIElement by role THEN THE System SHALL try fallback strategies (visual similarity, position)
  3. WHEN THE System fails to execute an action THEN THE System SHALL log the failure with context
  4. WHEN THE System detects UI change (similarity drop) THEN THE System SHALL pause execution and notify user
  5. WHEN THE System is in AUTO_CONFIRMÉ and confidence drops THEN THE System SHALL rollback to COACHING state
  6. WHEN THE System encounters repeated failures on same edge THEN THE System SHALL mark edge as problematic

Requirement 15

User Story: En tant que développeur système, je veux optimiser les performances du système, afin que le matching et l'exécution soient rapides (< 400ms).

Acceptance Criteria

  1. WHEN THE System computes State Embedding THEN THE System SHALL complete in less than 100ms
  2. WHEN THE System matches ScreenState against nodes THEN THE System SHALL complete FAISS search in less than 50ms
  3. WHEN THE System detects UI elements THEN THE System SHALL complete detection in less than 200ms
  4. WHEN THE System executes an action THEN THE System SHALL complete execution in less than 50ms
  5. WHEN THE System processes a ScreenState end-to-end THEN THE System SHALL complete in less than 400ms total