Normalize

  • Short Description: Normalize the input text in a non-canonical way.

  • Full Description: This process involves abstracting away from case, inflection, citation, and word order. It also involves removing stop words, possessives, parenthetic plural forms, stripping diacritics, splitting ligatures, normalizing non-ASCII Unicode to ASCII, and replacing punctuation with spaces from the input term. Specifically, this normalization is somewhat equivalent to the combined flow options (in this order as well) -f:q0:g:rs:o:t:l:B:Ct:q7:q8:w . That is, map non-ASCII Unicode symbols and punctuation to ASCII, then remove genitives, then remove parenthetic plural forms, then replace punctuation with spaces, then remove stop words, then lowercase, then uninflect each word, then take each of the citation words, then Unicode core normalization, then strip or map non-ASCII Unicode. and finally sort word in an alphabetic order,

    This flow differs from the traditional norm flow (LuiNorm) in two aspects. First, it retrieves all uninflected form and then use citation form. Second, it does not map uninflected (citation) forms to canonical forms. This has advantages, such as: it does not require knowing the universe of words before indexing; it does not rely upon an additional lookup into the canonical table; and returns known ambiguity when appropriate.

    No effect on the -m option. "none" is added at the end of the output.

  • Difference:
    1. Differences are caused by the differences of each flow components
    2. Map non-ASCII symbols and punctuation to ASCII
    3. Utilize Unicode core norm to convert non-ASCII Unicode to ASCII, which perform (2008):
      • Map Unicode symbols and punctuation to ASCII
      • Map Unicode to ASCII
      • Split ligatures
      • Strip diacritics
    4. Add features of removing parenthetic forms before lowering case (2007)
    5. Add features of retrieving citation for each base form before word order sorting


  • Features:
    1. map Unicode symbols and punctuation to ASCII
    2. remove genitives
    3. remove parenthetic plural forms
    4. replace punctuation with spaces
    5. remove stop words
    6. lowercase
    7. uninflect each word
    8. retrieve a citation (first in the alphabetical order) for each uninflected word
    9. Unicode core norm to
      • map Unicode symbols and punctuation to ASCII
      • map Unicode to ASCII
      • split ligatures
      • strip diacritics
    10. strip or map non-ASCII Unicode characters
    11. sort words by order


  • Symbol: N

  • Examples:
    
    shell> lvg -f:N
    left
    left|left|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    left|leave|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    
    Hodgkin's diseases, NOS
    Hodgkin's diseases, NOS|disease hodgkin|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    
    Down's Syndrome
    Down's Syndrome|down syndrome|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    
    Acetolyses
    Acetolyses|acetolyze|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    Acetolyses|acetolysis|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1
    
    Lung cancer
    Lung cancer|cancer lung|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    Cancer, lung
    Cancer, lung|cancer lung|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    
    Paget's disease-scapula
    Paget's disease-scapula|disease paget scapula|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    Scapula, Paget Disease
    Scapula, Paget Disease|disease paget scapula|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    
    Dysenterie amibienne (aiguë)
    Dysenterie amibienne (aiguë)|aigue amibienne dysenterie|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    
    Abdomen CT Adrenal Mass(es) Bilateral
    Abdomen CT Adrenal Mass(es) Bilateral|abdomen adrenal bilateral ct mass|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    
    sequelae of; injury, nerve, roots and plexus(es), spinal
    sequelae of; injury, nerve, roots and plexus(es), spinal|injury nerve plexus root sequela spinal|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    sequelae of; injury, nerve, roots and plexus(es), spinal|injury nerve plexus roots sequela spinal|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    
    proofread
    proofread|proofread|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    proof-read
    proof-read|proof read|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    proof read
    proof read|proof read|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    
    ɑ-Tech™
    ɑ-Tech™|alpha tech|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    
    “Quote”
    “Quote”|quote|2047|1|q0+g+rs+o+t+l+B+Ct+q7+q8+w|1|
    
    
    More examples

  • Implementation Logic:
    1. use flow component q0 to map non-ASCII Unicode symbols and punctuation to ASCII
    2. use flow component g to remove genitive
    3. use flow component rs to remove parenthetic plural forms
    4. use flow component o replace punctuation with space
    5. use flow component t to strip stop words
    6. use flow component l lowercase all characters
    7. use flow component B uninflect words
    8. retrieve a citation form (first in the alphabetical order) for uninflect words
    9. use flow component q7 to normalize non-ASCII Unicode to ASCII
    10. use flow component q8 to strip or map non-ASCII Unicode characters
    11. use flow component w sort words by ASCII order

  • Source Code: ToNormalize.java

  • Hierarchy: Object -> Transformation -> ToNormalize