Canonize Algorithm

Introduction:

Program, Canonize, creates the canonical or equivalent class forms, given a set of words. This program is run to create the luiNorm canonical forms when a new or updated set of words is added. Typically this program is used (run) once a year. The words are in the forthcoming UMLS Metathesaurus and Specialist Lexicon are the input words for this Canonize program.

A core LVG technique is to "uninflect" input terms to their base forms. This process occasionally results in two legitimate uninflected forms for the same inflected input. For example, left uninflects to both left and leave reflecting its ambiguity as either adjective or verb. A technique to manage this ambiguity produces only one "canonical" base form for any given input term. The process of canonicalization pre-computes all uninflected forms (of all bases and their spelling variants) and then arranges these into classes composed of terms that could be expanded to the same inflected form. The canonical form is an arbitrarily chosen member of this class and represents all the members of the class. For example, the terms left, leave, and leaf are all included in one such class, and the canonical form is leaf, known to lexicon first, pure ASCII, shortest, and then the alphabetical order. The member of the class is chosen to be a form from the lexicon if possible. This is an attempt to limit the number of word fragments that shows up as canonical representations of the class of terms.

In the Java implementation, spelling variants knowledge is also implemented in the algorithm. In other words, spelling variants will have same canonical form. For example, "dependent" and "dependant" are spelling variants and have same canonical form as "dependant".

The canonical form is, in reality, a representation of a class of terms, rather than a word. As such, its numeric representation of that class is just as valid.

Input

The input file includes a list of words, which are all words known in the target that you are going to. In our case, it's all the words from the Lexicon plus all the words from the Metathesaurus (atoms.data), put through wordInd -s:1, then sorted and uniqued. The sorting and uniquing make the canonize program run more efficiently.

Output:

A file includes all canonized data for all uninflected words from UMLS Metathesaurus and The Specialist Lexicon. The format of the output is as follows:

uninflected formcanonical formcanonical Id

For example:

leafleaf436283
leftleaf436283
leaveleaf436283

Algorithm:

I. Pre-Process

  • Lexicon: Generate an unique word list from lexicon by using tokenize flow.
    => get inflections (field 1) from infl.data
  • Metathesaurus: Get the word list file, atoms.data, from Metathesaurus.
    => get atoms.data (from OCCS)
  • Combine these two files into a file, tokenize words (from terms), lowercase words, and filter out duplicated words, and then sort it out alphabetically.
  • Get uninflected forms (base forms) of words from above file (baseList.data).
  • Get all base forms (baseFromLex.data, including spelling variants) from LEXICON (infl.data). This step is needed for adding multiple words base forms in the list. For examples, "proof-read" and "proof read" are tokenized into "proof" and "read" from previous steps and thus deleted.
  • Combine base forms from above two (baseList.data and baseFromLex.data).
  • Lowercased, sort and unique all uninflected forms (base forms) from above file.

II. Core-Process

  • Go through this process for all uninflected forms (uniqueBaseList.data) from the Pre-process
  • Update base list: base forms and spelling variants
    • Add base to base list
    • Get spelling Variants base (inflection=1) from lvg and add them to base list
    • Make sure it is unique in base list
  • Update inflectional forms of all terms of above base list
    • Update variable "classIdByBase_" by checking if base exist in DB canon.BASE table
    • Update inflection list
      • Get inflectional variants from lvg
      • Add base form (unique) to inflection list if no inflectional variants if found (words not in Lexicon may not have inflection forms).
      • Make sure it is unique in inflection list
  • Check if this new equivalent class is related to any existed equivalent class. An equivalent class is related to another equivalent class if
    1. one of their base forms is the same (existed classId in canon.BASE table)
    2. one of their inflectional forms is the same (existed classId in canon.INFLECTION table)

    Check if this new equivalent class is related to any existed equivalent class:

    • If not:
      • create a new equivalent class with base
      • add this new equivalent class to canon.CANON
      • Update canon.BASE
      • Update canon.INFLECTION
    • If yes:
      • If the base is related to two classes by inflections (ie. a base with multiple inflections and related to different unrelated classes)
        => Combine two classes by updating canon.CANON and update classId in canon.INFLECTION (use smaller ID)
      • Add the base form to the appropriated equivalent class in canon.CANON
      • Update classId and citation form(s) to canon.BASE
      • Update classId and inflection form(s) to canon.INFLECTION

    Notes:

    • Suggest to use Java class Hashtable( ) for canon.BASE, canon.INFLECTION, canon.CANON (instead of using DB) to speed up the searching process. The "base" and "class Id" will be the key and value for base hash table. The "inflection" and "class Id" are used to be the key and value for inflection hash table. Thus, use these two hash tables to decide and map the related equivalent class.
    • All input terms and terms during computations are lowercased.
    • Due to the size limit of JVM, database should be used to replace Hashtable(). Three database tables are used in this application.

    • canon.BASE: used to check if equivalent class are related by base and spelling variants base
      BASE
      NameTypePropertiesNotes
      basevarchar(100)Not null, primary key (Unique)base (spelling vars)
      classIdintNot null, indexEquivalent class id

    • canon.BASE: used to check if equivalent class are related by inflectional variants
      INFLECTION
      NameTypePropertiesNotes
      inflectionvarchar(100)Not null, primary key (Unique)Base
      classIdintNot null, indexEquivalent class id

    • canon.CANON: used to store equivalent class for canonical forms
      CANON
      NameTypePropertiesNotes
      basevarchar(100)Not null, indexBase
      classIdintNot null, indexEquivalent class id

III. Post-Process

  • Generate canonical form with canonical Id to canonical.data
  • Go through canon.CANON and print out each equivalent class with non-zero bases and reassign classNo. The canonical form is selected from the equivalent class as follows:
    • in LEXICON
    • pure ASCII
    • shortest (in length)
    • alphabetical order
  • Check the output of canon form:
    • Run option 7 in ${CANON_BIN}/1.RunCanonAll
      • check with ${CANON_DATA}/nonAscii.data
      • check with ${CANON_DATA}/canonical.nonAscii.char
      • check with ${CANON_DATA}/canonical.nonAscii.line

    • HSqlDb is used to replace MySql for handling Unicode correctly in 2009. Also, JavaDb was implemented and dropped due to its slow performance (100 time slower).

Data Structure of equivalent class:

  • Id
  • Base forms (uninflected form)

  • The canonical form is chosen from base forms:
    • exists in lexicon
    • term without non-ASCII characters first
    • then, shorter length from base list
    • then, alphabetical order from base list
    in the same equivalent class.

Comparison to the algorithm in C version:

The Java version uses a forward algorithm to compute all canonical forms. In other words, the Java version retrieve all citation forms for all base words (input); retrieve uninflected form for all citation forms; then compares citation forms and inflected forms to see if words are related.

The implementation in C uses reversed algorithm. Again, it define two based forms are related when they have same inflected forms. However, it limits the domain of all uninflected forms to the words list from Lexicon and Metathesaurus. In other words, it go through all words list to check if they have same base. If so, make the base form related.

The algorithm can be summarized as follows:

  • Use the words list as input (Lexicon and Metathesaurus).
  • Find base forms for all words
  • Use all base forms as key and put into a hash table (base table).
  • Initialize all value to 0 in base table
  • Go through words list again; find the base form for all words.
  • Mark value (class ID) by an incremental number if the key (base) is not found before.
  • Don't change value if it is not 0.
  • For those words with multiple base forms, choose the one with minimum value and make all other value to the minimum (related class).
  • The elements with same value (class ID) are related class.
  • Index on the class ID by creating another hash table.
  • Print out results.

The major differences between C and Java implementation are Java have wider range for a related equivalent class. This is caused by those words only exist in Metathesaurus and not in Lexicon. The inflectional forms (generated by rules) may make it related to a class which is known by Lexicon, and vice versa.