• Norm
  • Java


Introduction

Norm creates an abstract representation of text strings allowing users to ignore alphabetic case, inflection, spelling variants, punctuation, genitive markers, stop words, diacritics, symbols, ligatures, and word order. The normalized string is a version of the original string in lower case, without punctuation, genitive markers, or stop words, diacritics, ligatures, with each word in its uninflected (citation) form, the words sorted in alphabetical order, and normalize non-ASCII Unicode characters to ASCII by mapping punctuation and symbols to ASCII, mapping Unicode to ASCII, stripping diacritics, splitting ligatures, and stripping non-ASCII Unicode characters. Lexical variants which differ only in those ways will have the same normalized form. Norm is used to create the normalized string and word indexes to the UMLS Metathesaurus and is used to access those indexes.

Normalization encapsulates the lvg flow options -f:q0:g:rs:o:t:l:B:Ct:q7:q8:w. That is,

  1. q0: map Unicode symbols and punctuation to ASCII
  2. g: remove genitives,
  3. rs: then remove parenthetic plural forms of (s), (es), (ies), (S), (ES), and (IES),
  4. o: then replace punctuation with spaces,
  5. t: then remove stop words,
  6. l: then lowercase,
  7. B: then uninflect each word,
  8. Ct: then get citation form for each base form,
  9. q7: then Unicode Core Norm
    • map Unicode symbols and punctuation to ASCII
    • map Unicode to ASCII
    • split ligatures
    • strip diacritics
  10. q8: then strip or map non-ASCII Unicode characters,
  11. w: and finally sort the words in alphabetic order.

There may be more than one normalized form for a particular string. Some English inflected forms have more than one uninflected form. For example, "scleroses" could be the plural of the noun "sclerosis" or the third person singular of the verb "sclerose". In this version of Norm, multiple uninflected forms are returned for ambiguously inflected forms as recorded in the SPECIALIST lexicon.

In 2004 release, Norm is enhanced to normalize spelling variants by returning the citation form of the uninflected base form instead of the base form. For example, both "coloring" and "colouring" are normalized into "color". In addition, in 2004 release, Norm is enhanced to strip diacritics, split ligatures, and return the synonyms of Unicode symbols if the character is not ASCII, diacritics, or ligatures. This feature is modified after 2008 release (see below).

The minimum word size for normalization differs from the lvg default word size. In the normalization process, the minimum word size is one or more characters.

In 2006 release, Norm is enhanced to remove parenthetic plural forms of (s), (es), (ies), (S), (ES), and (IES). However, Norm will not remove these patterns when they are not plural forms, such as in chemical terms, protein, or mathematical equations. For example, "Inj oth musc(s)/tend(s)" is normalized to "inj musc oth tend" and "Abdomen CT Adrenal Mass(es) Bilateral" is normalized to "abdomen adrenal bilateral ct mass". Also, Norm is enhanced to handle term with irreg inflectional variants better. For example, "proofread", "proof-read", and "proof read" are all normalized to "proof read".

Unicode is commonly used in the recent years. UTF-8 is used as the default format for the input and output in norm (lvg) since 2004. Also, UTF-8 is introduced in the SPECIALIST Lexicon since then. The citation form could contain non-ASCII characters and result in the output of norm contains non-ASCII characters. Such as the citation form of "varon" is "varón". Norm was enhanced in 2007 to resolve this issue and produce ASCII only outputs.

In 2008 release, Norm is enhanced to utilize Unicode core norm (-f:q7) to convert non-ASCII Unicode characters to ASCII. This operation includes mapping Unicode symbols and punctuation to ASCII, mapping Unicode to ASCII, splitting ligatures, and stripping diacritics. Another flow component (-f:q8) is then followed to strip or map non-ASCII Unicode to ensure pure ASCII outputs.

Setup

Follow the installation instructions to install lexical tool and run the norm program. Check on the following items only if you don't use the provided script to install Lexical tools.

  • CLASSPATH:
    1. include the Lexical tools distribution jar file, ${LVG_DIR}/lib/lvg${YEAR}dist.jar, in your CLASSPATH.
    2. include the lvg top directory, ${LVG_DIR}, in your CLASSPATH.

  • Database: use the default DB, HSqlDb or your own DB (requires loading data into DB tables).

  • Configuration File: assign the full path of the top directory of lvg${YEAR} to a variable named LVG_DIR in the configuration file, ${LVG_DIR}/data/config/lvg.properties.

Test Run

Output Format

Norm copies its input from standard input to standard output with the normalized term appended. Output consists of:

Input line Output term
This may be one or more fields. This is the normalized term from the input line.

Global Behavior Options

Please refer to design document

Input Field Options

Please refer to design document