docs(beta): plan 1b — câblage des 7 toggles catégories au moteur (P1-2)
Plan TDD du gating par catégorie : infra disabled_kinds + _CATEGORY_OF (default-deny) + filtre audit Tier 1 (porteur de sûreté PDF), relaxation rescan résiduel NIR/TEL, gates texte Tier 2/3 (dispatchers + selective_rescan + NER + phase-0), garde-fou adresse burn, câblage GUI 7 booléens. Tests comportementaux par catégorie + baseline non-régression. CODE SÉCURITÉ — revue Qwen obligatoire. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,247 @@
|
||||
# GUI V6 bêta — Plan 1b : câblage des 7 toggles « Données à détecter » au moteur
|
||||
|
||||
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans, task-by-task. Steps use checkbox (`- [ ]`). **CODE SÉCURITÉ — revue Qwen obligatoire** (décision spec P1-2).
|
||||
|
||||
**Goal:** Rendre les 7 interrupteurs « Données à détecter » réellement effectifs : décocher une catégorie la laisse en clair en sortie (texte ET PDF) et relâche le filet de sécurité pour cette catégorie — sans jamais démasquer une catégorie non décochée.
|
||||
|
||||
**Architecture:** Masquage inline éclaté (3 passes, ~50 sites, pas de chokepoint). On porte un `disabled_kinds: set[str]` via `cfg` (déjà threadé partout) et on applique un **filtre 3-tiers** : (T1) filtrer l'`audit` avant le burn PDF = porteur de sûreté pour le livrable PDF, **default-deny** ; (T2) gater le texte aux fonctions dispatcher + `selective_rescan` ; (T3) gater les blocs phase-0 multiline. Plus la relaxation du rescan résiduel (NIR/TEL) et un garde-fou adresse. Filet de validation = **tests comportementaux end-to-end par catégorie**.
|
||||
|
||||
**Tech Stack:** Python, pytest. Fichier cœur `anonymizer_core_refactored_onnx.py` (5731 l.) + `gui_v6/`.
|
||||
|
||||
**Référence spec :** `docs/superpowers/specs/2026-06-25-gui-v6-beta-prod-design.md` (chantier D, P1-2, décisions D2/D3 : pas de plancher dur ; `EMAIL/IBAN/IPP/VILLE/FAX` non toggleables = toujours masqués).
|
||||
|
||||
**Mapping catégorie → kinds d'audit** (`_CATEGORY_OF`, default-deny : tout kind absent reste masqué) :
|
||||
- NOM ← NOM, NOM_FORCE, NOM_GLOBAL, NOM_EXTRACTED, NOM_INITIAL, NER_PER, EDS_NOM, EDS_PRENOM
|
||||
- DATE_NAISSANCE ← DATE_NAISSANCE, DATE_NAISSANCE_GLOBAL
|
||||
- ETAB ← ETAB, ETAB_FINESS, ETAB_SPACED, ETAB_GLOBAL, NER_ORG, EDS_HOPITAL
|
||||
- ADRESSE ← ADRESSE, ADDR_FINESS, EDS_ADRESSE *(VILLE/NER_LOC restent toujours masqués — hors des 7 toggles)*
|
||||
- NIR ← NIR
|
||||
- TEL ← TEL *(FAX reste toujours masqué)*
|
||||
- ADHERENT ← ADHERENT
|
||||
|
||||
---
|
||||
|
||||
### Task 1 : Infrastructure — `disabled_kinds` + `_CATEGORY_OF` + filtre audit (Tier 1)
|
||||
|
||||
**Files:** Modify `anonymizer_core_refactored_onnx.py` (add `_CATEGORY_OF`/`_category_of` near placeholders ~l.610 ; add `disabled_kinds` kwarg to `process_pdf` ~l.4973 ; inject into `cfg` after ~l.5002 ; add the audit filter before the PDF write ~l.5553). Test `tests/unit/test_core_category_gating.py`.
|
||||
|
||||
- [ ] **Step 1 — Failing test (audit filter + default-deny).** Create `tests/unit/test_core_category_gating.py`:
|
||||
|
||||
```python
|
||||
import anonymizer_core_refactored_onnx as core
|
||||
|
||||
|
||||
def test_category_of_maps_known_kinds():
|
||||
assert core._category_of("NOM_FORCE") == "NOM"
|
||||
assert core._category_of("NER_PER") == "NOM"
|
||||
assert core._category_of("EDS_HOPITAL") == "ETAB"
|
||||
assert core._category_of("ADDR_FINESS") == "ADRESSE"
|
||||
assert core._category_of("NIR") == "NIR"
|
||||
assert core._category_of("TEL") == "TEL"
|
||||
assert core._category_of("ADHERENT") == "ADHERENT"
|
||||
|
||||
|
||||
def test_category_of_default_deny_for_unknown():
|
||||
# Un kind non mappé NE doit JAMAIS être filtrable (reste masqué). Sécurité.
|
||||
assert core._category_of("EMAIL") is None
|
||||
assert core._category_of("IBAN") is None
|
||||
assert core._category_of("VILLE") is None
|
||||
assert core._category_of("FAX") is None
|
||||
assert core._category_of("INCONNU_XYZ") is None
|
||||
|
||||
|
||||
def test_filter_audit_drops_only_disabled_categories():
|
||||
PiiHit = core.PiiHit
|
||||
audit = [
|
||||
PiiHit(1, "NOM", "Dupont", "[NOM]"),
|
||||
PiiHit(1, "NIR", "1850574...", "[NIR]"),
|
||||
PiiHit(1, "EMAIL", "x@y.fr", "[EMAIL]"),
|
||||
]
|
||||
kept = core._filter_audit_by_disabled(audit, {"NIR"})
|
||||
kinds = {h.kind for h in kept}
|
||||
assert "NIR" not in kinds # NIR décoché → retiré
|
||||
assert "NOM" in kinds # non décoché → conservé
|
||||
assert "EMAIL" in kinds # non toggleable → toujours conservé
|
||||
```
|
||||
|
||||
- [ ] **Step 2 — Run, expect FAIL** (`_category_of`/`_filter_audit_by_disabled` absent): `.venv/bin/pytest tests/unit/test_core_category_gating.py -v`.
|
||||
|
||||
- [ ] **Step 3 — Implement.** In `anonymizer_core_refactored_onnx.py`, after the `PLACEHOLDERS`/`CRITICAL_PII_KEYS` block (~l.610), add:
|
||||
|
||||
```python
|
||||
# --- Gating par catégorie (toggles GUI « Données à détecter ») -------------
|
||||
# Mappe chaque kind d'audit vers l'une des 7 catégories toggleables. Tout kind
|
||||
# ABSENT de cette table est NON filtrable (default-deny → reste masqué). Les
|
||||
# catégories non toggleables (EMAIL/IBAN/IPP/VILLE/FAX/…) ne figurent pas ici.
|
||||
_CATEGORY_OF: dict[str, str] = {
|
||||
"NOM": "NOM", "NOM_FORCE": "NOM", "NOM_GLOBAL": "NOM",
|
||||
"NOM_EXTRACTED": "NOM", "NOM_INITIAL": "NOM",
|
||||
"NER_PER": "NOM", "EDS_NOM": "NOM", "EDS_PRENOM": "NOM",
|
||||
"DATE_NAISSANCE": "DATE_NAISSANCE", "DATE_NAISSANCE_GLOBAL": "DATE_NAISSANCE",
|
||||
"ETAB": "ETAB", "ETAB_FINESS": "ETAB", "ETAB_SPACED": "ETAB",
|
||||
"ETAB_GLOBAL": "ETAB", "NER_ORG": "ETAB", "EDS_HOPITAL": "ETAB",
|
||||
"ADRESSE": "ADRESSE", "ADDR_FINESS": "ADRESSE", "EDS_ADRESSE": "ADRESSE",
|
||||
"NIR": "NIR",
|
||||
"TEL": "TEL",
|
||||
"ADHERENT": "ADHERENT",
|
||||
}
|
||||
|
||||
|
||||
def _category_of(kind: str) -> str | None:
|
||||
"""Catégorie toggleable d'un kind d'audit, ou None si non toggleable."""
|
||||
return _CATEGORY_OF.get(kind)
|
||||
|
||||
|
||||
def _filter_audit_by_disabled(audit: list, disabled_kinds: set) -> list:
|
||||
"""Retire de l'audit les hits dont la catégorie est désactivée (default-deny)."""
|
||||
if not disabled_kinds:
|
||||
return audit
|
||||
return [h for h in audit if _category_of(h.kind) not in disabled_kinds]
|
||||
```
|
||||
|
||||
Add the kwarg to `process_pdf` (signature ~l.4973-4987): append `disabled_kinds: set = None,`. After `cfg = load_dictionaries(config_path)` (~l.5002), add:
|
||||
```python
|
||||
cfg["disabled_kinds"] = set(disabled_kinds or ())
|
||||
```
|
||||
Before the PDF-writing block (~l.5553, right before `if make_vector_redaction:`), add:
|
||||
```python
|
||||
# Tier 1 : retirer du livrable PDF les catégories désactivées par l'utilisateur.
|
||||
anon.audit = _filter_audit_by_disabled(anon.audit, cfg.get("disabled_kinds") or set())
|
||||
```
|
||||
(Adapt `anon.audit` to the actual audit variable name at that point — read the surrounding code; it is the list of `PiiHit` passed to `redact_pdf_vector`/`redact_pdf_raster`.)
|
||||
|
||||
- [ ] **Step 4 — Run, expect PASS:** `.venv/bin/pytest tests/unit/test_core_category_gating.py -v`.
|
||||
- [ ] **Step 5 — Non-régression:** `.venv/bin/pytest tests/unit/ -q` (expect prior count, 0 regression — defaults `disabled_kinds=None` ⇒ no behavior change).
|
||||
- [ ] **Step 6 — Commit:** `git add anonymizer_core_refactored_onnx.py tests/unit/test_core_category_gating.py && git commit -m "feat(core): infra gating par catégorie + filtre audit Tier 1 (P1-2)"`
|
||||
|
||||
---
|
||||
|
||||
### Task 2 : Relaxation du rescan résiduel (NIR/TEL) — couplage sécurité D3
|
||||
|
||||
**Files:** Modify `anonymizer_core_refactored_onnx.py` (`_residual_pii_patterns` ~l.5453-5458 + INSEE-names branch ~l.5470-5490). Test `tests/unit/test_core_category_gating.py` (extend).
|
||||
|
||||
- [ ] **Step 1 — Failing test.** Add to `tests/unit/test_core_category_gating.py` a test that the residual-pattern builder skips NIR/TEL when disabled. First read the code around l.5449-5519 to expose the pattern-building as a testable helper `_build_residual_patterns(disabled_kinds)` (refactor the inline list into this helper). Test:
|
||||
|
||||
```python
|
||||
def test_residual_patterns_skip_disabled_nir_tel():
|
||||
labels_all = {lbl for _pat, lbl in core._build_residual_patterns(set())}
|
||||
assert {"NIR", "EMAIL", "IBAN", "TEL"} <= labels_all
|
||||
labels_no_nir = {lbl for _pat, lbl in core._build_residual_patterns({"NIR"})}
|
||||
assert "NIR" not in labels_no_nir
|
||||
assert "EMAIL" in labels_no_nir and "IBAN" in labels_no_nir # non toggleables restent
|
||||
labels_no_tel = {lbl for _pat, lbl in core._build_residual_patterns({"TEL"})}
|
||||
assert "TEL" not in labels_no_tel
|
||||
```
|
||||
|
||||
- [ ] **Step 2 — Run, expect FAIL.**
|
||||
- [ ] **Step 3 — Implement.** Refactor the inline `_residual_pii_patterns` (~l.5453-5458) into a module function `_build_residual_patterns(disabled_kinds: set) -> list[tuple]` that always includes EMAIL+IBAN, includes NIR only if `"NIR" not in disabled_kinds`, includes TEL only if `"TEL" not in disabled_kinds`. Call it in the residual check with `cfg.get("disabled_kinds") or set()`. Gate the opt-in INSEE-names branch (~l.5470) additionally under `"NOM" not in disabled`.
|
||||
- [ ] **Step 4 — Run, expect PASS.**
|
||||
- [ ] **Step 5 — Non-régression:** `.venv/bin/pytest tests/unit/ -q`.
|
||||
- [ ] **Step 6 — Commit:** `git commit -m "feat(core): relâcher le rescan résiduel pour NIR/TEL décochés (P1-2/D3)"`
|
||||
|
||||
---
|
||||
|
||||
### Task 3 : Gates texte (Tier 2 + Tier 3) — passes de détection + selective_rescan
|
||||
|
||||
**Files:** Modify `anonymizer_core_refactored_onnx.py` at the dispatcher sites listed below. Test `tests/unit/test_core_category_gating_behavior.py` (behavioral, end-to-end on `anonymise_document_regex`).
|
||||
|
||||
**Sites à gater** (lire chaque site avant édition ; pattern : récupérer `disabled = cfg.get("disabled_kinds") or set()` en tête de fonction, puis sauter le sous-bloc `.sub`/`PiiHit` de la catégorie si désactivée) :
|
||||
`_mask_line_by_regex` (~1670), `_kv_value_only_mask` (~2110, incl. subs NOM/label 2098-2106), bloc PERSON-majuscules (~1942-2008 → NOM), `_apply_extracted_names` (~2809 → early-return `text` inchangé si NOM désactivé), `_mask_with_hf` (~3136 → par placeholder NOM/ETAB/ADRESSE), `_mask_with_eds_pseudo` (~3208 → idem via EDS_LABEL_MAP), `selective_rescan` (~4159 → DATE_NAISSANCE 4203, ADRESSE 4205-4207, ETAB 4229-4251, ADHERENT 4200-4201, TEL 4191-4193, NIR 4187-4188), blocs phase-0 multiline DATE_NAISSANCE (~3014) / NIR (~3034).
|
||||
|
||||
- [ ] **Step 1 — Failing behavioral tests.** Create `tests/unit/test_core_category_gating_behavior.py`. For each of the 7 categories, build a minimal `pages_text` containing a clear instance of that category + one instance of a DIFFERENT category, run `anonymise_document_regex(pages_text, [], cfg)` with the category disabled, and assert: the disabled category's value is PRESENT (en clair) in the output, AND the other category is still masked. Example (NIR + TEL) — adapt others by reading the real regexes for realistic inputs:
|
||||
|
||||
```python
|
||||
import anonymizer_core_refactored_onnx as core
|
||||
|
||||
|
||||
def _cfg(disabled):
|
||||
cfg = core.load_dictionaries(None)
|
||||
cfg["disabled_kinds"] = set(disabled)
|
||||
return cfg
|
||||
|
||||
|
||||
def test_disabling_nir_leaves_nir_clear_but_masks_tel():
|
||||
pages = ["NIR : 1 85 05 74 123 456 78\nTél : 05 59 12 34 56"]
|
||||
out, _audit = core.anonymise_document_regex(pages, [], _cfg({"NIR"}))[:2]
|
||||
text = "\n".join(out) if isinstance(out, list) else str(out)
|
||||
assert "1 85 05 74 123 456 78" in text # NIR décoché → en clair
|
||||
assert "05 59 12 34 56" not in text # TEL non décoché → masqué
|
||||
|
||||
|
||||
def test_all_enabled_is_unchanged_baseline():
|
||||
pages = ["NIR : 1 85 05 74 123 456 78"]
|
||||
out, _audit = core.anonymise_document_regex(pages, [], _cfg(set()))[:2]
|
||||
text = "\n".join(out) if isinstance(out, list) else str(out)
|
||||
assert "1 85 05 74 123 456 78" not in text # tout activé → masqué (non-régression)
|
||||
```
|
||||
|
||||
(Write one analogous test per category: NOM, DATE_NAISSANCE, ETAB, ADRESSE, ADHERENT — using inputs that the real regexes detect. Read the regex definitions to craft valid inputs. Verify the exact return shape of `anonymise_document_regex` first.)
|
||||
|
||||
- [ ] **Step 2 — Run, expect FAIL** (categories still masked because text-gates absent).
|
||||
- [ ] **Step 3 — Implement** the gates at each site above. Apply the same `if "CAT" in disabled: <skip this sub>` pattern. Work site by site; after each, re-run the behavioral test for that category.
|
||||
- [ ] **Step 4 — Run, expect ALL PASS** (7 category tests + baseline).
|
||||
- [ ] **Step 5 — Non-régression + gate qualité:** `.venv/bin/pytest tests/unit/ -q` and `.venv/bin/python scripts/evaluate_quality.py` (score must stay A+ with defaults; the synthetic regression gate must pass).
|
||||
- [ ] **Step 6 — Commit:** `git commit -m "feat(core): gates texte par catégorie (Tier 2/3) + selective_rescan (P1-2)"`
|
||||
|
||||
---
|
||||
|
||||
### Task 4 : Garde-fou adresse dans le burn PDF (`_search_pdf_address_lines`)
|
||||
|
||||
**Files:** Modify `anonymizer_core_refactored_onnx.py` (~l.4572 vector, ~l.4744 raster — `_search_pdf_address_lines` is called independently of audit). Test: extend behavioral test (or a focused unit test on the redact function with ADRESSE disabled).
|
||||
|
||||
- [ ] **Step 1 — Failing test:** assert that when ADRESSE is disabled, the independent address-line search is skipped (so addresses aren't burned). Read `redact_pdf_vector`/`redact_pdf_raster` to find how `disabled_kinds` reaches them (pass `cfg["disabled_kinds"]` or the set as a param; the functions already receive `cfg` or can).
|
||||
- [ ] **Step 2 — Run, expect FAIL.**
|
||||
- [ ] **Step 3 — Implement:** guard both `_search_pdf_address_lines(page)` calls with `if "ADRESSE" not in disabled_kinds:`.
|
||||
- [ ] **Step 4 — Run, expect PASS.**
|
||||
- [ ] **Step 5 — Non-régression:** `.venv/bin/pytest tests/unit/ -q`.
|
||||
- [ ] **Step 6 — Commit:** `git commit -m "feat(core): garde-fou adresse burn PDF si catégorie décochée (P1-2)"`
|
||||
|
||||
---
|
||||
|
||||
### Task 5 : Câblage GUI — 7 booléens → moteur
|
||||
|
||||
**Files:** Modify `gui_v6/config_state.py` (7 bool fields + map to `disabled_kinds`), `gui_v6/engine_bridge.py` (`EngineSettings` + `build_engine_kwargs`), `gui_v6/tabs/tab_config.py` (les 7 `_mini_toggle` ~l.351-357 → `variable`+`command` sur `ConfigState`). Tests `tests/unit/test_gui_v6_category_toggles.py`.
|
||||
|
||||
- [ ] **Step 1 — Failing test.** Create `tests/unit/test_gui_v6_category_toggles.py`:
|
||||
|
||||
```python
|
||||
from gui_v6.config_state import ConfigState
|
||||
|
||||
|
||||
def test_default_all_categories_enabled_means_no_disabled_kinds():
|
||||
es = ConfigState().to_engine_settings()
|
||||
assert es.disabled_kinds == frozenset()
|
||||
|
||||
|
||||
def test_unchecking_nir_and_etab_propagates_as_disabled_kinds():
|
||||
cs = ConfigState()
|
||||
cs.mask_nir = False
|
||||
cs.mask_etab = False
|
||||
es = cs.to_engine_settings()
|
||||
assert es.disabled_kinds == frozenset({"NIR", "ETAB"})
|
||||
|
||||
|
||||
def test_build_engine_kwargs_passes_disabled_kinds():
|
||||
from gui_v6.engine_bridge import EngineSettings, build_engine_kwargs
|
||||
es = EngineSettings(disabled_kinds=frozenset({"TEL"}))
|
||||
kwargs = build_engine_kwargs(es)
|
||||
assert kwargs["disabled_kinds"] == frozenset({"TEL"})
|
||||
```
|
||||
|
||||
- [ ] **Step 2 — Run, expect FAIL.**
|
||||
- [ ] **Step 3 — Implement.**
|
||||
- `gui_v6/config_state.py`: add 7 bool fields (default True): `mask_noms, mask_ddn, mask_etab, mask_adresse, mask_nir, mask_tel, mask_adherent`. In `to_engine_settings`, build `disabled_kinds = frozenset(cat for field, cat in [(self.mask_noms,"NOM"),(self.mask_ddn,"DATE_NAISSANCE"),(self.mask_etab,"ETAB"),(self.mask_adresse,"ADRESSE"),(self.mask_nir,"NIR"),(self.mask_tel,"TEL"),(self.mask_adherent,"ADHERENT")] if not field)` and pass it to `EngineSettings`.
|
||||
- `gui_v6/engine_bridge.py`: add `disabled_kinds: frozenset = frozenset()` to `EngineSettings`; in `build_engine_kwargs`, add `kwargs["disabled_kinds"] = settings.disabled_kinds`.
|
||||
- `gui_v6/tabs/tab_config.py`: wire each of the 7 `_mini_toggle` to a `ctk.BooleanVar` bound to the matching `ConfigState` field with a `command` that writes it back. (Read the current `_mini_toggle` signature; follow the existing pattern used by other wired toggles in this tab.)
|
||||
- [ ] **Step 4 — Run, expect PASS** + `.venv/bin/python Pseudonymisation_Gui_V6.py --self-test`.
|
||||
- [ ] **Step 5 — Non-régression GUI:** `.venv/bin/pytest tests/unit/ -k gui_v6 -q`.
|
||||
- [ ] **Step 6 — Commit:** `git commit -m "feat(gui): câbler les 7 toggles catégories au moteur (P1-2)"`
|
||||
|
||||
---
|
||||
|
||||
## Self-review (couverture spec P1-2 + map)
|
||||
- T1 audit filter (Task 1) · rescan relax NIR/TEL (Task 2) · text gates incl. selective_rescan + NER paths + phase-0 (Task 3) · address burn guard (Task 4) · GUI wiring (Task 5). ✓
|
||||
- Default-deny vérifié (Task 1 test `EMAIL/IBAN/VILLE/FAX → None`). EMAIL/IBAN/IPP/VILLE/FAX toujours masqués. ✓
|
||||
- Baseline « tout activé = non-régression » testée (Task 3) + `evaluate_quality` A+ gate. ✓
|
||||
- **Risque** : un site texte oublié ⇒ la catégorie reste masquée dans le texte (test rouge le détecte) mais JAMAIS de fuite croisée (default-deny). Le livrable PDF est garanti par T1 (audit filter) seul.
|
||||
- **Revue Qwen obligatoire** sur Tasks 1-4 (cœur sécurité) avant exécution/après implémentation.
|
||||
Reference in New Issue
Block a user