# Pix2Struct vs CLIP Benchmark Results ## Date 2024-11-19 ## Test Configuration - **Hardware**: AMD Ryzen 9 9950X, RTX 5070 12GB - **Device**: CPU (to compare fairly) - **Test Images**: 5 synthetic UI screenshots with buttons ## Results Summary | Metric | CLIP ViT-B/32 | Pix2Struct Base | Winner | |--------|---------------|-----------------|--------| | **Embedding Dimension** | 512 | 768 | - | | **Time per Image** | 19.78ms | 2895.68ms | **CLIP (146x faster)** | | **UI Discrimination** | 0.1636 | 0.0178 | **CLIP (9x better)** | | **Model Size** | ~350MB | ~1.13GB | **CLIP** | ## Detailed Analysis ### Speed - **CLIP**: 19.78ms per image (batch mode) - **Pix2Struct**: 2895.68ms per image (batch mode) - **Verdict**: CLIP is **146x faster** ### Accuracy (UI Discrimination) Test: Distinguish "Submit" button from "Cancel" button - **CLIP**: - Submit vs Submit: 1.0000 - Submit vs Cancel: 0.8364 - **Discrimination: 0.1636** ✅ - **Pix2Struct**: - Submit vs Submit: 1.0000 - Submit vs Cancel: 0.9822 - **Discrimination: 0.0178** ❌ **Verdict**: CLIP discriminates **9x better** between different UI elements ### Why Pix2Struct Underperforms 1. **Not optimized for simple UI elements**: Pix2Struct is designed for complex documents and structured layouts, not simple buttons 2. **Encoder pooling**: We use mean pooling of encoder states, which may lose spatial information 3. **Training data mismatch**: Pix2Struct was trained on documents/screenshots, but our test is very simple ## Recommendation **Use CLIP for GeniusIA v2 RPA** Reasons: 1. ✅ **Much faster** (146x) - critical for real-time RPA 2. ✅ **Better discrimination** - more accurate workflow matching 3. ✅ **Smaller model** - less memory, faster loading 4. ✅ **Already working well** - proven in tests **When to consider Pix2Struct:** - Complex document understanding - Layout-heavy applications - When you have time for slow inference (3s per image) ## Configuration For GeniusIA v2, use: ```python embedding_manager = EmbeddingManager(model_name="clip") ``` Pix2Struct remains available as an option but is **not recommended** for this use case. ## Future Work If we want to improve beyond CLIP: 1. **Fine-tune CLIP** on RPA-specific data (Phase 3) 2. Try **DINOv2** (Meta) - good for visual features 3. Try **SigLIP** (Google) - improved CLIP variant 4. Custom **lightweight CNN** trained specifically for UI elements