• LuiNorm
  • Java


Introduction

LuiNorm is a version of Norm used in the UMLS Metathesaurus to collect strings into terms (represented by lui's). It resembles Norm in that it provides a representation that abstracts away from case, inflection, stop words, genitive markers, punctuation, diacritics, ligature, symbols, and word order. It provides an additional level of abstraction by ignoring spelling variation as well. LuiNorm differs from Norm in returning a single uninflected output for any input. In lvg, the process of uninflection without creating multiple forms even for ambiguously inflected input is called canonicalization. The canonicalization process involves a level of inaccuracy avoided in Norm. For example, the canonicalized form of "left" is "leaf", an unrelated word. Another example, "abcess" and "abscessed" have no direct relationship; they are in the same canonical class through "abscess". In other words, the output of luiNorm is not necessary one of the Norm outputs for the same input.

LuiNorm is more or less equivalent to the combined lvg flow options -f:q7:g:rs:o:t:l:B:C:q8:w. That is,

  1. q7: Unicode Core Norm
    • map Unicode symbols and punctuation to ASCII
    • map Unicode to ASCII
    • split ligatures
    • strip diacritics
  2. g: then remove genitives,
  3. rs: then remove parenthetic plural forms of (s), (es), (ies), (S), (ES), and (IES)
  4. o: then replace punctuation with space,
  5. t: then remove stop words,
  6. l: then lowercase,
  7. B: then uninflect each word,
  8. C: then take each of the uninflected words and map them to their canonical form,
  9. q8: then strip or map non-ASCII Unicode characters,
  10. w: and finally word order sort.

The process of canonicalization pre-computes all uninflected forms and then arranges these into classes composed of terms that could be expanded to the same inflected form. The canonical form is an arbitrarily chosen member of this class and represents all the members of the class. For examples, the terms "left", "leave", and "leaf" are all included in one such class, and the canonical form is leaf. It is chosen by following principles:

  1. representative member is chosen from the lexicon if possible
  2. representative member doesn't contains non-ASCII characters if possible
  3. shortest member of the class
  4. alphabetically first

Earlier versions (before 2001) of Norm used this canonicalization approach to uninflection. LuiNorm provides backward compatibility with those earlier versions. It is useful when there is a need to insure a one to one correspondence between an input term and an output term as in the case of lui assignment. Otherwise, Norm is the recommended normalization method.

Since 2004, LuiNorm abstracts away from spelling variation by using the lexical name (base form) of as the un-inflected form rather than the citation form for uninflection. For example, "coloring" and "colouring" will be uninflected as "color". In addition, in 2004 release, LuiNorm is enhanced to strip diacritics, split ligatures, and return the synonyms of Unicode symbols if the character is not ASCII, diacritics, or ligatures. UTF-8 is used for the input and output in norm.

In 2005 release, LuiNorm is enhanced to remove parenthetic plural forms of (s), (es), (ies), (S), (ES), and (IES). However, LuiNorm will not remove these patterns when they are not plural forms, such as in chemical terms, protein, or mathematical equations. For example, "Inj oth musc(s)/tend(s)" is normalized to "inj musc oth tend" and "Abdomen CT Adrenal Mass(es) Bilateral" is normalized to "abdomen adrenal bilateral ct mass".

In 2006 release, LuiNorm is enhanced to handle multi-words term (with irreg inflectional variants). For example, "club feet" and "club-feet" are all normalized to "club foot".

In 2007 release, LuiNorm was enhanced to produce ASCII only output by returning symbol name at the end of normalization if the result is not ASCII. Also, the canonical algorithm was enhanced to generate less canonical classes and bigger range of canonical neighborhood.

In 2008 release, LuiNorm is enhanced to utilize Unicode core norm (-f:q7) to convert non-ASCII Unicode characters to ASCII. This operation includes mapping Unicode symbols and punctuation to ASCII, mapping Unicode to ASCII, splitting ligatures, and stripping diacritics. Another flow component (-f:q8) is used at the end of the normalization to strip or map non-ASCII Unicode to ensure pure ASCII outputs.

Setup

Follow the installation instructions to install lexical tool and run luiNorm program. Check on the following items only if you don't use the provided script to install Lexical tools.

  • CLASSPATH:
    1. include Lexical tools distribution jar file, ${LVG_DIR}/lib/lvg${YEAR}dist.jar, in your CLASSPATH
    2. include lvg top directory, ${LVG_DIR}, in your CLASSPATH

  • Database: use default DB, HSqlDb or your own DB (requires table reloaded).

  • Configuration File: assign the full path of the top directory of lvg${YEAR} to a variable named LVG_DIR in configuration file, ${LVG_DIR}/data/config/lvg.properties

Test Run

Output Format

LuiNorm copies its input from standard in to standard output, then appends the canonical normalized term to the output. Output consists of:

Input line Output term
This may be one or more fields. This is the canonical normalized term from the input line.

Global Behavior Options

Please refer to design document

Input Field Options

Please refer to design document