feat(referentials): validation ATIH 2018 des codes médicaux

Ajoute une couche de validation post-extraction contre les référentiels
officiels de l'ATIH (Agence Technique de l'Information sur
l'Hospitalisation) pour 2018. Zéro tolérance sur les codes T2A : un
code invalide est flaggé, et une correction par plus proche voisin
(Levenshtein ≤ 1) est proposée.

Contenu :
- pipeline/referentials.py : API publique is_valid_{cim10,ccam,ghm,ghs},
  get_cim10_libelle, nearest_cim10, ghm_to_ghs. CLI --build/--test/--stats.
- pipeline/validation.py    : annote un JSON d'extraction avec un bloc
  `_validation` par page (codes valides/invalides + suggestions + cross-
  checks GHM↔GHS).
- referentials/sources/     : données brutes ATIH publiques (CIM-10 ClaML
  2019 substitut, CCAM v5 2018, GHM v2018, tarifs fév. 2018).
- referentials/atih_2018.sqlite : base SQLite prête à l'emploi
  (11 623 CIM-10 · 8 147 CCAM · 2 593 GHM · 5 329 couples GHM→GHS).
- tests/test_referentials.py : 11 tests unitaires (11/11 passent).
- annotate_validation.py    : script qui annote tous les JSONs V2 en
  place et produit validation_report.md.

Note CIM-10 : la version 2018 ATIH n'est publiée qu'en PDF, ClaML 2019
est utilisée en substitut (écart connu ≈ 60 codes / 11 600).

Gestion des suffixes PMSI : `*` (CMA exclue par le DP) et `+N`
(extension PMSI) sont strippés avant validation, le code racine seul
est comparé au référentiel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Dom
2026-04-24 15:06:01 +02:00
parent ed4d9bd765
commit 6df590ae95
17 changed files with 156052 additions and 0 deletions

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@@ -0,0 +1,283 @@
<?xml version="1.0" encoding="UTF-8"?>
<!ENTITY % rubric.simple "#PCDATA | Reference | Term">
<!ENTITY % rubric.complex "%rubric.simple; | Para | Include |
IncludeDescendants| Fragment | List | Table">
<!ELEMENT ClaML (
Meta*,
Identifier*,
Title,
Authors?,
Variants?,
ClassKinds,
UsageKinds?,
RubricKinds,
Modifier*,
ModifierClass*,
Class*)
>
<!ATTLIST ClaML
version CDATA #REQUIRED
>
<!ELEMENT Variants (Variant+)>
<!ELEMENT Variant (#PCDATA)>
<!ATTLIST Variant
name ID #REQUIRED
>
<!ELEMENT Meta EMPTY>
<!ATTLIST Meta
name CDATA #REQUIRED
value CDATA #REQUIRED
variants IDREFS #IMPLIED
>
<!ELEMENT Identifier EMPTY>
<!ATTLIST Identifier
authority NMTOKEN #IMPLIED
uid CDATA #REQUIRED
>
<!ELEMENT Title (#PCDATA)>
<!ATTLIST Title
name NMTOKEN #REQUIRED
version CDATA #IMPLIED
date CDATA #IMPLIED
>
<!ELEMENT Authors (Author* )>
<!ELEMENT Author (#PCDATA)>
<!ATTLIST Author
name ID #REQUIRED
>
<!ELEMENT ClassKinds (ClassKind+)>
<!ELEMENT RubricKinds (RubricKind+)>
<!ELEMENT UsageKinds (UsageKind+)>
<!ELEMENT ClassKind (Display*)>
<!ATTLIST ClassKind
name ID #REQUIRED
>
<!ELEMENT RubricKind (Display*)>
<!ATTLIST RubricKind
name ID #REQUIRED
inherited (true|false) "true"
>
<!ELEMENT UsageKind EMPTY>
<!ATTLIST UsageKind
name ID #REQUIRED
mark CDATA #REQUIRED
>
<!ELEMENT Display (#PCDATA)>
<!ATTLIST Display
xml:lang NMTOKEN #REQUIRED
variants IDREF #IMPLIED
>
<!ELEMENT Modifier (
Meta*,
SubClass*,
Rubric*,
History*)
>
<!ATTLIST Modifier
code NMTOKEN #REQUIRED
variants IDREFS #IMPLIED
>
<!ELEMENT ModifierClass (
Meta*,
SuperClass,
SubClass*,
Rubric*,
History*)
>
<!ATTLIST ModifierClass
modifier NMTOKEN #REQUIRED
code NMTOKEN #REQUIRED
usage IDREF #IMPLIED
variants IDREFS #IMPLIED
>
<!ELEMENT Class (
Meta*,
SuperClass*,
SubClass*,
ModifiedBy*,
ExcludeModifier*,
Rubric*,
History*)
>
<!ATTLIST Class
code CDATA #REQUIRED
kind IDREF #REQUIRED
usage IDREF #IMPLIED
variants IDREFS #IMPLIED
>
<!ELEMENT ModifiedBy (
Meta*,
ValidModifierClass*)
>
<!ATTLIST ModifiedBy
code NMTOKEN #REQUIRED
all (true|false) "true"
position CDATA #IMPLIED
variants IDREFS #IMPLIED
>
<!ELEMENT ExcludeModifier EMPTY>
<!ATTLIST ExcludeModifier
code NMTOKEN #REQUIRED
variants IDREFS #IMPLIED
>
<!ELEMENT ValidModifierClass EMPTY>
<!ATTLIST ValidModifierClass
code NMTOKEN #REQUIRED
variants IDREFS #IMPLIED
>
<!ELEMENT Rubric (
Label+,
History*)
>
<!ATTLIST Rubric
id ID #IMPLIED
kind IDREF #REQUIRED
usage IDREF #IMPLIED
>
<!ELEMENT Label (%rubric.complex;)*>
<!ATTLIST Label
xml:lang NMTOKEN #REQUIRED
xml:space (default|preserve) "default"
variants IDREFS #IMPLIED
>
<!ELEMENT History (#PCDATA)>
<!ATTLIST History
author IDREF #REQUIRED
date NMTOKEN #REQUIRED
>
<!ELEMENT SuperClass EMPTY>
<!ATTLIST SuperClass
code CDATA #REQUIRED
variants IDREFS #IMPLIED
>
<!ELEMENT SubClass EMPTY>
<!ATTLIST SubClass
code CDATA #REQUIRED
variants IDREFS #IMPLIED
>
<!ELEMENT Reference (#PCDATA)>
<!ATTLIST Reference
class CDATA #IMPLIED
authority NMTOKEN #IMPLIED
uid NMTOKEN #IMPLIED
code CDATA #IMPLIED
usage IDREF #IMPLIED
variants IDREFS #IMPLIED
>
<!ELEMENT Para (%rubric.simple;)*>
<!ATTLIST Para
class CDATA #IMPLIED
>
<!ELEMENT Fragment (%rubric.simple;)*>
<!ATTLIST Fragment
class CDATA #IMPLIED
usage IDREF #IMPLIED
type (item | list) "item"
>
<!ELEMENT Include EMPTY>
<!ATTLIST Include
class CDATA #IMPLIED
rubric IDREF #REQUIRED
>
<!ELEMENT IncludeDescendants EMPTY>
<!ATTLIST IncludeDescendants
code NMTOKEN #REQUIRED
kind IDREF #REQUIRED
>
<!ELEMENT List (ListItem+)>
<!ATTLIST List
class CDATA #IMPLIED
>
<!ELEMENT ListItem (
%rubric.simple;
| Para
| Include
| List
| Table)*
>
<!ATTLIST ListItem
class CDATA #IMPLIED
>
<!ELEMENT Table (
Caption?,
THead?,
TBody?,
TFoot?)
>
<!ATTLIST Table
class CDATA #IMPLIED
>
<!ELEMENT Caption (%rubric.simple;)*>
<!ATTLIST Caption
class CDATA #IMPLIED
>
<!ELEMENT THead (Row+)>
<!ATTLIST THead
class CDATA #IMPLIED
>
<!ELEMENT TBody (Row+)>
<!ATTLIST TBody
class CDATA #IMPLIED
>
<!ELEMENT TFoot (Row+)>
<!ATTLIST TFoot
class CDATA #IMPLIED
>
<!ELEMENT Row (Cell*)>
<!ATTLIST Row
class CDATA #IMPLIED
>
<!ELEMENT Cell (
%rubric.simple;
| Para
| Include
| List
| Table)*
>
<!ATTLIST Cell
class CDATA #IMPLIED
rowspan CDATA #IMPLIED
colspan CDATA #IMPLIED
>
<!ELEMENT Term (#PCDATA)>
<!ATTLIST Term
class CDATA #IMPLIED
>

File diff suppressed because it is too large Load Diff

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.