Initial commit
This commit is contained in:
293
docs/archive/old-summaries/EMBEDDING_SYSTEM_INTEGRATION_GUIDE.md
Normal file
293
docs/archive/old-summaries/EMBEDDING_SYSTEM_INTEGRATION_GUIDE.md
Normal file
@@ -0,0 +1,293 @@
|
||||
# Guide d'Intégration du Système d'Embeddings
|
||||
|
||||
## Vue d'ensemble
|
||||
|
||||
Le nouveau système d'embeddings est maintenant prêt à être intégré dans GeniusIA v2. Ce guide explique comment l'utiliser.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Orchestrator │
|
||||
│ - Gère les workflows │
|
||||
│ - Collecte les exemples de fine-tuning │
|
||||
└────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ EmbeddingManager │
|
||||
│ - Sélection de modèle (CLIP recommandé) │
|
||||
│ - Cache LRU (1000 entrées) │
|
||||
│ - Fallback automatique │
|
||||
└────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
┌────────┴────────┐
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌──────────────────┐
|
||||
│ CLIPEmbedder │ │ LightweightFine │
|
||||
│ │ │ Tuner │
|
||||
│ - Embeddings │ │ - Collecte │
|
||||
│ - Fine-tune │ │ - Trigger auto │
|
||||
└──────────────┘ └──────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────┐
|
||||
│ FAISSIndex │
|
||||
│ - Recherche │
|
||||
│ - Persistence│
|
||||
└──────────────┘
|
||||
```
|
||||
|
||||
## Utilisation dans l'Orchestrator
|
||||
|
||||
### 1. Initialisation
|
||||
|
||||
```python
|
||||
from geniusia2.core.embedders import EmbeddingManager, LightweightFineTuner, FAISSIndex
|
||||
|
||||
class Orchestrator:
|
||||
def __init__(self, config):
|
||||
# Initialize embedding system
|
||||
self.embedding_manager = EmbeddingManager(
|
||||
model_name="clip", # Recommandé
|
||||
cache_size=1000,
|
||||
fallback_enabled=True
|
||||
)
|
||||
|
||||
# Initialize FAISS index
|
||||
self.faiss_index = FAISSIndex(
|
||||
dimension=self.embedding_manager.get_dimension()
|
||||
)
|
||||
|
||||
# Initialize fine-tuner
|
||||
self.fine_tuner = LightweightFineTuner(
|
||||
embedder=self.embedding_manager.embedder,
|
||||
trigger_threshold=10, # Fine-tune tous les 10 exemples
|
||||
max_examples=1000
|
||||
)
|
||||
|
||||
# Load checkpoint if exists
|
||||
self.fine_tuner.load_checkpoint("orchestrator_finetuning")
|
||||
```
|
||||
|
||||
### 2. Génération d'Embeddings
|
||||
|
||||
```python
|
||||
def analyze_screenshot(self, screenshot_pil: Image.Image):
|
||||
"""Analyser un screenshot et générer son embedding."""
|
||||
# Generate embedding (avec cache automatique)
|
||||
embedding = self.embedding_manager.embed(screenshot_pil)
|
||||
|
||||
return embedding
|
||||
```
|
||||
|
||||
### 3. Recherche de Workflows Similaires
|
||||
|
||||
```python
|
||||
def find_similar_workflows(self, screenshot_pil: Image.Image, k=5):
|
||||
"""Trouver les workflows similaires via FAISS."""
|
||||
# Generate embedding
|
||||
embedding = self.embedding_manager.embed(screenshot_pil)
|
||||
|
||||
# Search in FAISS
|
||||
results = self.faiss_index.search(embedding, k=k)
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
### 4. Ajout d'Exemples pour Fine-tuning
|
||||
|
||||
```python
|
||||
def on_workflow_accepted(self, screenshot_pil: Image.Image, workflow_id: str):
|
||||
"""Appelé quand l'utilisateur accepte un workflow."""
|
||||
# Add positive example for fine-tuning
|
||||
self.fine_tuner.add_positive_example(
|
||||
image=screenshot_pil,
|
||||
workflow_id=workflow_id,
|
||||
metadata={'timestamp': time.time()}
|
||||
)
|
||||
|
||||
# Save checkpoint periodically
|
||||
if self.fine_tuner.training_count % 5 == 0:
|
||||
self.fine_tuner.save_checkpoint("orchestrator_finetuning")
|
||||
|
||||
def on_workflow_rejected(self, screenshot_pil: Image.Image, workflow_id: str):
|
||||
"""Appelé quand l'utilisateur rejette un workflow."""
|
||||
# Add negative example for fine-tuning
|
||||
self.fine_tuner.add_negative_example(
|
||||
image=screenshot_pil,
|
||||
workflow_id=workflow_id,
|
||||
metadata={'timestamp': time.time()}
|
||||
)
|
||||
```
|
||||
|
||||
### 5. Sauvegarde à l'Arrêt
|
||||
|
||||
```python
|
||||
def shutdown(self):
|
||||
"""Appelé à l'arrêt de l'application."""
|
||||
# Wait for any ongoing fine-tuning
|
||||
self.fine_tuner.wait_for_training(timeout=30)
|
||||
|
||||
# Save checkpoint
|
||||
self.fine_tuner.save_checkpoint("orchestrator_finetuning")
|
||||
|
||||
# Save FAISS index
|
||||
self.faiss_index.save("data/workflow_embeddings")
|
||||
```
|
||||
|
||||
## Migration depuis l'Ancien Système
|
||||
|
||||
### Ancien Code (EmbeddingsManager)
|
||||
|
||||
```python
|
||||
# Ancien
|
||||
from .embeddings_manager import EmbeddingsManager
|
||||
|
||||
embeddings = EmbeddingsManager()
|
||||
embedding = embeddings.encode_image(numpy_image) # numpy BGR
|
||||
```
|
||||
|
||||
### Nouveau Code (EmbeddingManager)
|
||||
|
||||
```python
|
||||
# Nouveau
|
||||
from .embedders import EmbeddingManager
|
||||
from PIL import Image
|
||||
import cv2
|
||||
|
||||
embedding_manager = EmbeddingManager(model_name="clip")
|
||||
|
||||
# Convertir numpy BGR → PIL RGB
|
||||
image_rgb = cv2.cvtColor(numpy_image, cv2.COLOR_BGR2RGB)
|
||||
pil_image = Image.fromarray(image_rgb)
|
||||
|
||||
embedding = embedding_manager.embed(pil_image)
|
||||
```
|
||||
|
||||
### Compatibilité dans VisionAnalysis
|
||||
|
||||
Le code dans `vision_analysis.py` est déjà compatible avec les deux systèmes:
|
||||
|
||||
```python
|
||||
# Détecte automatiquement quel système est utilisé
|
||||
if self._use_new_system:
|
||||
# Nouveau système
|
||||
region_rgb = cv2.cvtColor(region, cv2.COLOR_BGR2RGB)
|
||||
pil_image = Image.fromarray(region_rgb)
|
||||
embedding = self.embeddings.embed(pil_image)
|
||||
else:
|
||||
# Ancien système
|
||||
embedding = self.embeddings.encode_image(region)
|
||||
```
|
||||
|
||||
## Configuration Recommandée
|
||||
|
||||
```python
|
||||
config = {
|
||||
"embedding": {
|
||||
"model": "clip", # "clip" ou "pix2struct" (non recommandé)
|
||||
"cache_size": 1000,
|
||||
"fallback_enabled": True
|
||||
},
|
||||
"fine_tuning": {
|
||||
"enabled": True,
|
||||
"trigger_threshold": 10, # Fine-tune tous les 10 exemples
|
||||
"max_examples": 1000,
|
||||
"checkpoint_dir": "data/fine_tuning"
|
||||
},
|
||||
"faiss": {
|
||||
"index_path": "data/workflow_embeddings"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Métriques et Monitoring
|
||||
|
||||
### Statistiques du Cache
|
||||
|
||||
```python
|
||||
stats = embedding_manager.get_stats()
|
||||
print(f"Cache hit rate: {stats['cache_hit_rate']:.1%}")
|
||||
print(f"Cache size: {stats['cache_size']}/{stats['cache_capacity']}")
|
||||
```
|
||||
|
||||
### Statistiques du Fine-tuning
|
||||
|
||||
```python
|
||||
stats = fine_tuner.get_stats()
|
||||
print(f"Examples collected: {stats['total_examples']}")
|
||||
print(f"Trainings completed: {stats['training_count']}")
|
||||
print(f"Is training: {stats['is_training']}")
|
||||
|
||||
# Historique des métriques
|
||||
for metrics in stats['metrics_history']:
|
||||
print(f"Training #{metrics['training_number']}: "
|
||||
f"loss={metrics['loss']:.4f}, "
|
||||
f"duration={metrics['duration_seconds']:.1f}s")
|
||||
```
|
||||
|
||||
## Performance Attendue
|
||||
|
||||
### CLIP (Recommandé)
|
||||
- **Embedding**: ~20ms par image (batch)
|
||||
- **Cache hit**: <1ms
|
||||
- **Fine-tuning**: 30s-2min pour 10-100 exemples
|
||||
- **Mémoire**: ~2GB (modèle) + ~500MB (FAISS pour 10k embeddings)
|
||||
|
||||
### Pix2Struct (Non Recommandé)
|
||||
- **Embedding**: ~2900ms par image (146x plus lent)
|
||||
- **Discrimination**: 9x moins précis que CLIP
|
||||
- **Mémoire**: ~4GB (modèle)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Problème: Dimension mismatch dans FAISS
|
||||
|
||||
```python
|
||||
# Solution: Rebuild l'index
|
||||
if faiss_index.rebuild_if_needed(new_dimension):
|
||||
logger.warning("FAISS index rebuilt due to dimension change")
|
||||
```
|
||||
|
||||
### Problème: Fine-tuning bloque l'application
|
||||
|
||||
```python
|
||||
# Vérifier que le fine-tuning est bien en thread séparé
|
||||
assert fine_tuner.training_thread.daemon == True
|
||||
```
|
||||
|
||||
### Problème: Cache ne fonctionne pas
|
||||
|
||||
```python
|
||||
# Vérifier que use_cache=True (défaut)
|
||||
embedding = embedding_manager.embed(image, use_cache=True)
|
||||
```
|
||||
|
||||
## Tests
|
||||
|
||||
Lancer les tests complets:
|
||||
|
||||
```bash
|
||||
# Test du système de base
|
||||
geniusia2/venv/bin/python test_embedding_system.py
|
||||
|
||||
# Benchmark CLIP vs Pix2Struct
|
||||
geniusia2/venv/bin/python test_pix2struct_vs_clip.py
|
||||
```
|
||||
|
||||
## Prochaines Étapes
|
||||
|
||||
1. ✅ Intégrer dans `Orchestrator.__init__()`
|
||||
2. ✅ Connecter aux événements workflow (accept/reject)
|
||||
3. ✅ Ajouter sauvegarde à l'arrêt
|
||||
4. ✅ Tester en conditions réelles
|
||||
5. ✅ Monitorer les métriques de fine-tuning
|
||||
|
||||
## Support
|
||||
|
||||
Pour toute question, voir:
|
||||
- `PIX2STRUCT_BENCHMARK_RESULTS.md` - Résultats des benchmarks
|
||||
- `.kiro/specs/embedding-improvement/` - Spec complète
|
||||
- Tests dans `test_embedding_system.py`
|
||||
|
||||
Reference in New Issue
Block a user