test(e2e): harness replay reproductible — mock client Léa V1 contre serveur réel

Réduit le cycle debug d'un workflow de 1-2 min (replay manuel via Windows + Léa V1 + maquette) à ~2-5s (mock client Linux contre serveur de streaming localhost:5005). 30-60× plus rapide. Architecture : - tools/test_replay_e2e.py — harness CLI (~580 lignes), reproduit la chaîne réelle : VWB /api/v3/execute-windows → streaming /replay/raw → boucle /replay/next côté harness avec resolve_target sur un screenshot fixture → POST /replay/result. Pas de modification serveur. - tests/e2e/test_urgence_aiva_demo.py — wrapper pytest (smoke). - tests/e2e/urgence_aiva_demo_expected.yaml — référence générée par --export-expected, pour comparaison régression auto. - pytest.ini — ajout du marqueur e2e. Usage : python tools/test_replay_e2e.py --execution-mode autonomous --max-iter 120 --verbose python tools/test_replay_e2e.py --single-step 8 --shot <heartbeat>.png python tools/test_replay_e2e.py --expected tests/e2e/urgence_aiva_demo_expected.yaml pytest tests/e2e -v -m e2e Sortie : tableau Markdown step × méthode × score × pos × status × diag. Limitations connues (extensions post-démo) : - Une seule fixture screenshot pour tout le replay → click_anchor réalistes échouent dès qu'on dépasse l'écran fixture. Carte step_id → fixture à venir. - extract_text/table/t2a_decision exécutés côté serveur, observables mais pas modifiables. - Pas de simulation screenshot_after → ReplayVerifier (Critic VLM) ne tourne pas. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 22:11:07 +02:00
parent 7847a0e829
commit 35fd6cf4c5
5 changed files with 1012 additions and 0 deletions
--- a/tests/e2e/test_urgence_aiva_demo.py
+++ b/tests/e2e/test_urgence_aiva_demo.py
@@ -0,0 +1,118 @@
+"""Tests E2E du workflow Urgence_aiva_demo via le harness mock client.
+
+Marqueurs : @pytest.mark.e2e @pytest.mark.slow
+Pré-requis : streaming server (5005) + VWB (5002) actifs.
+
+Lancement :
+    pytest tests/e2e -v -m e2e
+
+Le test est un smoke check : il vérifie qu'on arrive à lancer un replay,
+poller les actions et que le harness termine sans crash. Il n'exige PAS
+que tous les steps réussissent (le screenshot fixture peut être obsolète).
+"""
+from __future__ import annotations
+
+from pathlib import Path
+
+import pytest
+import requests
+
+from tools.test_replay_e2e import (
+    ReplayMockClient,
+    _find_latest_heartbeat,
+    _load_token,
+    DEFAULT_BASE_URL,
+    DEFAULT_VWB_URL,
+)
+
+WORKFLOW_ID = "wf_a38aeebea5e6_1778162737"  # Urgence_aiva_demo
+
+
+def _server_alive(url: str, timeout: float = 2.0) -> bool:
+    try:
+        resp = requests.get(f"{url}/health", timeout=timeout)
+        return resp.status_code == 200
+    except Exception:
+        return False
+
+
+def _vwb_alive(url: str, timeout: float = 2.0) -> bool:
+    try:
+        # VWB n'a pas /health, on tape /api/v3/session/state
+        resp = requests.get(f"{url}/api/v3/session/state", timeout=timeout)
+        return resp.status_code in (200, 404)
+    except Exception:
+        return False
+
+
+@pytest.fixture(scope="module")
+def streaming_url() -> str:
+    if not _server_alive(DEFAULT_BASE_URL):
+        pytest.skip(f"Streaming server inactif sur {DEFAULT_BASE_URL}")
+    return DEFAULT_BASE_URL
+
+
+@pytest.fixture(scope="module")
+def vwb_url() -> str:
+    if not _vwb_alive(DEFAULT_VWB_URL):
+        pytest.skip(f"VWB backend inactif sur {DEFAULT_VWB_URL}")
+    return DEFAULT_VWB_URL
+
+
+@pytest.fixture(scope="module")
+def heartbeat() -> str:
+    path = _find_latest_heartbeat()
+    if not path or not Path(path).exists():
+        pytest.skip("Aucun heartbeat fixture disponible sur disque")
+    return path
+
+
+@pytest.mark.e2e
+@pytest.mark.slow
+def test_urgence_aiva_demo_smoke(streaming_url, vwb_url, heartbeat):
+    """Smoke : lance et déroule le workflow Urgence_aiva_demo via le harness.
+
+    Vérifie que :
+    - le harness peut compiler et lancer le replay (pas d'exception réseau)
+    - au moins quelques steps sont reportés (la chaîne tourne)
+    - aucune exception non gérée n'est levée
+    """
+    import time as _time
+    import uuid as _uuid
+
+    ts = _time.strftime("%Y%m%dT%H%M%S")
+    client = ReplayMockClient(
+        base_url=streaming_url,
+        vwb_url=vwb_url,
+        token=_load_token(),
+        session_id=f"test_e2e_pytest_{ts}_{_uuid.uuid4().hex[:6]}",
+        machine_id=f"test_e2e_pytest_machine_{ts}",
+        screenshot_path=heartbeat,
+        verbose=False,
+        auto_resume=True,
+        execution_mode="autonomous",
+        timeout_poll=10.0,
+        single_step=None,
+        max_iter=80,
+    )
+
+    try:
+        client.cancel_stale_replays()
+        client.register_session()
+        info = client.start_replay(WORKFLOW_ID)
+        assert info.get("replay_id"), f"replay_id absent : {info}"
+        assert info.get("total_actions", 0) > 0
+        client.run()
+    finally:
+        try:
+            client.cancel_replay()
+        except Exception:
+            pass
+
+    # Le harness doit avoir produit au moins quelques rapports
+    assert len(client.reports) > 0, "Aucune action reportée — harness cassé ?"
+
+    # Le 1er step est un wait synthétique injecté par VWB → doit être OK
+    first = client.reports[0]
+    assert first.action_type == "wait", f"1er step inattendu : {first}"
+    assert first.status == "OK"