Initial commit

2026-03-05 00:20:25 +01:00
commit dcd4de9945
1954 changed files with 669380 additions and 0 deletions
--- a/PIX2STRUCT_BENCHMARK_RESULTS.md
+++ b/PIX2STRUCT_BENCHMARK_RESULTS.md
@@ -0,0 +1,79 @@
+# Pix2Struct vs CLIP Benchmark Results
+
+## Date
+2024-11-19
+
+## Test Configuration
+- **Hardware**: AMD Ryzen 9 9950X, RTX 5070 12GB
+- **Device**: CPU (to compare fairly)
+- **Test Images**: 5 synthetic UI screenshots with buttons
+
+## Results Summary
+
+| Metric | CLIP ViT-B/32 | Pix2Struct Base | Winner |
+|--------|---------------|-----------------|--------|
+| **Embedding Dimension** | 512 | 768 | - |
+| **Time per Image** | 19.78ms | 2895.68ms | **CLIP (146x faster)** |
+| **UI Discrimination** | 0.1636 | 0.0178 | **CLIP (9x better)** |
+| **Model Size** | ~350MB | ~1.13GB | **CLIP** |
+
+## Detailed Analysis
+
+### Speed
+- **CLIP**: 19.78ms per image (batch mode)
+- **Pix2Struct**: 2895.68ms per image (batch mode)
+- **Verdict**: CLIP is **146x faster**
+
+### Accuracy (UI Discrimination)
+Test: Distinguish "Submit" button from "Cancel" button
+
+- **CLIP**: 
+  - Submit vs Submit: 1.0000
+  - Submit vs Cancel: 0.8364
+  - **Discrimination: 0.1636** ✅
+
+- **Pix2Struct**:
+  - Submit vs Submit: 1.0000
+  - Submit vs Cancel: 0.9822
+  - **Discrimination: 0.0178** ❌
+
+**Verdict**: CLIP discriminates **9x better** between different UI elements
+
+### Why Pix2Struct Underperforms
+
+1. **Not optimized for simple UI elements**: Pix2Struct is designed for complex documents and structured layouts, not simple buttons
+2. **Encoder pooling**: We use mean pooling of encoder states, which may lose spatial information
+3. **Training data mismatch**: Pix2Struct was trained on documents/screenshots, but our test is very simple
+
+## Recommendation
+
+**Use CLIP for GeniusIA v2 RPA**
+
+Reasons:
+1. ✅ **Much faster** (146x) - critical for real-time RPA
+2. ✅ **Better discrimination** - more accurate workflow matching
+3. ✅ **Smaller model** - less memory, faster loading
+4. ✅ **Already working well** - proven in tests
+
+**When to consider Pix2Struct:**
+- Complex document understanding
+- Layout-heavy applications
+- When you have time for slow inference (3s per image)
+
+## Configuration
+
+For GeniusIA v2, use:
+```python
+embedding_manager = EmbeddingManager(model_name="clip")
+```
+
+Pix2Struct remains available as an option but is **not recommended** for this use case.
+
+## Future Work
+
+If we want to improve beyond CLIP:
+1. **Fine-tune CLIP** on RPA-specific data (Phase 3)
+2. Try **DINOv2** (Meta) - good for visual features
+3. Try **SigLIP** (Google) - improved CLIP variant
+4. Custom **lightweight CNN** trained specifically for UI elements
+