Initial commit
This commit is contained in:
79
PIX2STRUCT_BENCHMARK_RESULTS.md
Normal file
79
PIX2STRUCT_BENCHMARK_RESULTS.md
Normal file
@@ -0,0 +1,79 @@
|
||||
# Pix2Struct vs CLIP Benchmark Results
|
||||
|
||||
## Date
|
||||
2024-11-19
|
||||
|
||||
## Test Configuration
|
||||
- **Hardware**: AMD Ryzen 9 9950X, RTX 5070 12GB
|
||||
- **Device**: CPU (to compare fairly)
|
||||
- **Test Images**: 5 synthetic UI screenshots with buttons
|
||||
|
||||
## Results Summary
|
||||
|
||||
| Metric | CLIP ViT-B/32 | Pix2Struct Base | Winner |
|
||||
|--------|---------------|-----------------|--------|
|
||||
| **Embedding Dimension** | 512 | 768 | - |
|
||||
| **Time per Image** | 19.78ms | 2895.68ms | **CLIP (146x faster)** |
|
||||
| **UI Discrimination** | 0.1636 | 0.0178 | **CLIP (9x better)** |
|
||||
| **Model Size** | ~350MB | ~1.13GB | **CLIP** |
|
||||
|
||||
## Detailed Analysis
|
||||
|
||||
### Speed
|
||||
- **CLIP**: 19.78ms per image (batch mode)
|
||||
- **Pix2Struct**: 2895.68ms per image (batch mode)
|
||||
- **Verdict**: CLIP is **146x faster**
|
||||
|
||||
### Accuracy (UI Discrimination)
|
||||
Test: Distinguish "Submit" button from "Cancel" button
|
||||
|
||||
- **CLIP**:
|
||||
- Submit vs Submit: 1.0000
|
||||
- Submit vs Cancel: 0.8364
|
||||
- **Discrimination: 0.1636** ✅
|
||||
|
||||
- **Pix2Struct**:
|
||||
- Submit vs Submit: 1.0000
|
||||
- Submit vs Cancel: 0.9822
|
||||
- **Discrimination: 0.0178** ❌
|
||||
|
||||
**Verdict**: CLIP discriminates **9x better** between different UI elements
|
||||
|
||||
### Why Pix2Struct Underperforms
|
||||
|
||||
1. **Not optimized for simple UI elements**: Pix2Struct is designed for complex documents and structured layouts, not simple buttons
|
||||
2. **Encoder pooling**: We use mean pooling of encoder states, which may lose spatial information
|
||||
3. **Training data mismatch**: Pix2Struct was trained on documents/screenshots, but our test is very simple
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Use CLIP for GeniusIA v2 RPA**
|
||||
|
||||
Reasons:
|
||||
1. ✅ **Much faster** (146x) - critical for real-time RPA
|
||||
2. ✅ **Better discrimination** - more accurate workflow matching
|
||||
3. ✅ **Smaller model** - less memory, faster loading
|
||||
4. ✅ **Already working well** - proven in tests
|
||||
|
||||
**When to consider Pix2Struct:**
|
||||
- Complex document understanding
|
||||
- Layout-heavy applications
|
||||
- When you have time for slow inference (3s per image)
|
||||
|
||||
## Configuration
|
||||
|
||||
For GeniusIA v2, use:
|
||||
```python
|
||||
embedding_manager = EmbeddingManager(model_name="clip")
|
||||
```
|
||||
|
||||
Pix2Struct remains available as an option but is **not recommended** for this use case.
|
||||
|
||||
## Future Work
|
||||
|
||||
If we want to improve beyond CLIP:
|
||||
1. **Fine-tune CLIP** on RPA-specific data (Phase 3)
|
||||
2. Try **DINOv2** (Meta) - good for visual features
|
||||
3. Try **SigLIP** (Google) - improved CLIP variant
|
||||
4. Custom **lightweight CNN** trained specifically for UI elements
|
||||
|
||||
Reference in New Issue
Block a user