Files
Geniusia_v2/PIX2STRUCT_BENCHMARK_RESULTS.md
2026-03-05 00:20:25 +01:00

2.4 KiB

Pix2Struct vs CLIP Benchmark Results

Date

2024-11-19

Test Configuration

  • Hardware: AMD Ryzen 9 9950X, RTX 5070 12GB
  • Device: CPU (to compare fairly)
  • Test Images: 5 synthetic UI screenshots with buttons

Results Summary

Metric CLIP ViT-B/32 Pix2Struct Base Winner
Embedding Dimension 512 768 -
Time per Image 19.78ms 2895.68ms CLIP (146x faster)
UI Discrimination 0.1636 0.0178 CLIP (9x better)
Model Size ~350MB ~1.13GB CLIP

Detailed Analysis

Speed

  • CLIP: 19.78ms per image (batch mode)
  • Pix2Struct: 2895.68ms per image (batch mode)
  • Verdict: CLIP is 146x faster

Accuracy (UI Discrimination)

Test: Distinguish "Submit" button from "Cancel" button

  • CLIP:

    • Submit vs Submit: 1.0000
    • Submit vs Cancel: 0.8364
    • Discrimination: 0.1636
  • Pix2Struct:

    • Submit vs Submit: 1.0000
    • Submit vs Cancel: 0.9822
    • Discrimination: 0.0178

Verdict: CLIP discriminates 9x better between different UI elements

Why Pix2Struct Underperforms

  1. Not optimized for simple UI elements: Pix2Struct is designed for complex documents and structured layouts, not simple buttons
  2. Encoder pooling: We use mean pooling of encoder states, which may lose spatial information
  3. Training data mismatch: Pix2Struct was trained on documents/screenshots, but our test is very simple

Recommendation

Use CLIP for GeniusIA v2 RPA

Reasons:

  1. Much faster (146x) - critical for real-time RPA
  2. Better discrimination - more accurate workflow matching
  3. Smaller model - less memory, faster loading
  4. Already working well - proven in tests

When to consider Pix2Struct:

  • Complex document understanding
  • Layout-heavy applications
  • When you have time for slow inference (3s per image)

Configuration

For GeniusIA v2, use:

embedding_manager = EmbeddingManager(model_name="clip")

Pix2Struct remains available as an option but is not recommended for this use case.

Future Work

If we want to improve beyond CLIP:

  1. Fine-tune CLIP on RPA-specific data (Phase 3)
  2. Try DINOv2 (Meta) - good for visual features
  3. Try SigLIP (Google) - improved CLIP variant
  4. Custom lightweight CNN trained specifically for UI elements