Synonym Norm Development

I. Requirements
Use normalization to aggressively map a term to its synonyms by abstracting away from

  • g: Genitive
  • rs: parenthetical plural forms (s), (es), (ies)
  • o: Punctuation
  • l: cases
  • Ct: spelling variants and inflectional variants

  • remove duplicated spaces
  • trim
  • duplicated results

II. Developments

  • Approach 1 (Ct on input term):
    • use lvg -f:g:rs:o:l:Ct
    • Ct is to get the citation form on the input term
    • fast performance
    • lower coverage rate (98% of method bellows)
    • Example 1:
      IDTermnorm termsynonym substitutionsCUI
      KP102818CLOTTING FACTOR DEFICIENCY, CONGENITAL
      • ...
      not found

  • Approach 2 (Ct on every words of input term):
    • Use lvg -f:g:rs:o:l:Ct
    • Customize Ct to get the citation form on every words of the input term
    • More mutation and results slower performance and high coverage rate
    • Example 1:
      IDTermnorm termsynonym substitutionsCUI
      KP102818CLOTTING FACTOR DEFICIENCY, CONGENITALclot factor deficiency congenital
      • coagulation factor deficiency hereditary
      • ...
      C0272316
    • However, still misses some mapping when the citation form has punctuation, such as "carcino-embryonic" is the citation of "carcinoembryonic"
    • Example 2:
      IDTermnorm termsynonym substitutionsCUI
      KP194142Elevated carcinoembryonic antigenelevate carcino-embryonic antigen
      • increase carcino-embryonic antigen
      • increased carcino-embryonic antigen
      • high carcino-embryonic antigen
      • ...
      C0549371

  • Approach 3 (Move Ct before removing punctuation):
    • Use lvg -f:g:rs:Ct:l:o
    • Example 2:
      IDTermnorm termsynonym substitutionsCUI
      KP194142Elevated carcinoembryonic antigenelevate carcino embryonic antigen
      • increase carcino embryonic antigen
      • increased carcino embryonic antigen
      • high carcino embryonic antigen
      • ...
      C0549371
      • elevate cea
      C0742014
    • Add remove genitive after Ct:
      • E0000135|Addison's disease|Addisons disease
      • There are no records with CT has (s), (es), (ies), so no need for -f:rs
      • Use Database for CUI mapping to improve performance

III. Comparisons

 Approach 1
(Ct on term)
Approach 2
(CuiMap)
Approach 3
(Smt)
PerformanceFast
  • KP: 27 min.
  • VA: 23 min.
Slow
  • KP: 78 min.
  • VA: 321 min.
Fast
  • KP: 22 min.
  • VA: 68 min.
Coverage-KP
(26890 terms)
  • CUI with Norm: 12165 - 45.24%
  • CUI with 1 synonyms: 1673 - 6.22%
  • CUI with 2 synonyms: 168 - 0.62%
  • No CUI found: 12884 - 47.91%
  • Total term-CUIs found: 31643
  • CUI with Norm: 12165 - 45.24%
  • CUI with 1 synonyms: 1692 - 6.29%
  • CUI with 2 synonyms: 174 - 0.65%
  • No CUI found: 12859 - 47.82%
  • Total term-CUIs found: 31660
  • CUI with Norm: 12165 - 45.24%
  • CUI with 1 synonyms: 1692 - 6.29%
  • CUI with 2 synonyms: 174 - 0.65%
  • No CUI found: 12859 - 47.82%
  • Total term-CUIs found: 31661
Coverage-VA
(21221 terms)
  • CUI with Norm: 16937 - 79.81%
  • CUI with 1 synonyms: 221 - 1.04%
  • CUI with 2 synonyms: 12 - 0.06%
  • No CUI found: 4051 - 19.09%
  • Total term-CUIs found: 27478
  • CUI with Norm: 16937 - 79.81%
  • CUI with 1 synonyms: 228 - 1.07%
  • CUI with 2 synonyms: 15 - 0.07%
  • No CUI found: 4041 - 19.04%
  • Total term-CUIs found: 27498
  • CUI with Norm: 16937 - 79.81%
  • CUI with 1 synonyms: 228 - 1.07%
  • CUI with 2 synonyms: 15 - 0.07%
  • No CUI found: 4041 - 19.04%
  • Total term-CUIs found: 27498

IV. Notes
In practice, we only normalize key of the synonym pair. This might cause non-symmetric issues. For example:
synonym pair: impaired|abnormality are stored as follows in the database table:

normalized keysynonym
impairabnormality
abnormalityimpaired impair|abnormality

The mapping results in non-symmetric lookup:

  • abnormality -> abnormality -> impaired
  • impair -> impair -> abnormality (not symmetric)
  • impaired -> impair -> abnormality