2.4 KiB
2.4 KiB
Pix2Struct vs CLIP Benchmark Results
Date
2024-11-19
Test Configuration
- Hardware: AMD Ryzen 9 9950X, RTX 5070 12GB
- Device: CPU (to compare fairly)
- Test Images: 5 synthetic UI screenshots with buttons
Results Summary
| Metric | CLIP ViT-B/32 | Pix2Struct Base | Winner |
|---|---|---|---|
| Embedding Dimension | 512 | 768 | - |
| Time per Image | 19.78ms | 2895.68ms | CLIP (146x faster) |
| UI Discrimination | 0.1636 | 0.0178 | CLIP (9x better) |
| Model Size | ~350MB | ~1.13GB | CLIP |
Detailed Analysis
Speed
- CLIP: 19.78ms per image (batch mode)
- Pix2Struct: 2895.68ms per image (batch mode)
- Verdict: CLIP is 146x faster
Accuracy (UI Discrimination)
Test: Distinguish "Submit" button from "Cancel" button
-
CLIP:
- Submit vs Submit: 1.0000
- Submit vs Cancel: 0.8364
- Discrimination: 0.1636 ✅
-
Pix2Struct:
- Submit vs Submit: 1.0000
- Submit vs Cancel: 0.9822
- Discrimination: 0.0178 ❌
Verdict: CLIP discriminates 9x better between different UI elements
Why Pix2Struct Underperforms
- Not optimized for simple UI elements: Pix2Struct is designed for complex documents and structured layouts, not simple buttons
- Encoder pooling: We use mean pooling of encoder states, which may lose spatial information
- Training data mismatch: Pix2Struct was trained on documents/screenshots, but our test is very simple
Recommendation
Use CLIP for GeniusIA v2 RPA
Reasons:
- ✅ Much faster (146x) - critical for real-time RPA
- ✅ Better discrimination - more accurate workflow matching
- ✅ Smaller model - less memory, faster loading
- ✅ Already working well - proven in tests
When to consider Pix2Struct:
- Complex document understanding
- Layout-heavy applications
- When you have time for slow inference (3s per image)
Configuration
For GeniusIA v2, use:
embedding_manager = EmbeddingManager(model_name="clip")
Pix2Struct remains available as an option but is not recommended for this use case.
Future Work
If we want to improve beyond CLIP:
- Fine-tune CLIP on RPA-specific data (Phase 3)
- Try DINOv2 (Meta) - good for visual features
- Try SigLIP (Google) - improved CLIP variant
- Custom lightweight CNN trained specifically for UI elements