Classification Types

I. Introduction

A new enhanced feature called the classification type (CT) is a proposed addition to the Lexicon to impove the performance of NLP applications that use the Lexicon. These classification types can be archaic, source, informal, or other.

  • Terms classified as archaic, such as cozen, colde and benight, are considered no longer in common use in modern corpora (such as MEDLINE). These terms may have modern equivalents in the same lexical record (colde for cold) or in separate ones (ye for the). Archaic terms can be excluded when dealing with modern corpora.
  • normalization on spelling variants from foreign English into US English is needed if the source is from a foreign country. For example, British English (analyse, leukaemia, tumour) can be normalized to US English (analyze, leukemia, tumor). These terms are classified as source.
  • consumers often use informal language when they ask questions. For example, bomb for success, or grandpa for grandfather are used primarily in colloquial contexts. The performance of automated consumer question understanding could be improved if the Lexicon provides informal terms with their cross-referenced (CR) formal terms (synonyms). For example, grandpa (no concept found in UMLS) can be effectively mapped to grandfather (C0337475) with query expansion by substituting formal synonyms for their informal terms.

II. Design

III. Implementation

  • New records:
    Classification types are added to new lexical records during lexicon building by NLM linguists through a Web-based tool, LexBuild. New GUI components for adding CTs and the enhanced LexCheck software for validating syntax and contents of CTs are integrated in LexBuild.

  • Exisitng records:
    CT tagging on existing lexical records is also done through LexBuild.
    • class_type=unassigned is added to all existing records, and removed after it is tagged.
      • Backup (and remove suffix) all DB tables to ${LEX_BUILD}/data/DbTables/Tables.mmddyy
      • inject class_type=unassigned to DB Tables files
        • shell> cd ${LEX_BUILD}/Tools/LoadDb
        • shell> injectClassTypeUnassigned
          input directory: ${LEX_BUILD}/data/DbTables/Tables.mmddyy
          output directory: ${LEX_BUILD}/data/DbTables/Tables.mmddyy.ij
        • link Tables.mmddyy.ij to Tables
      • Load Tables files (class_type=unassigned) to DB
        • shell> DbScript
          6, 12 (drop tables, all)
          1, 12 (create tables, all)
          2, 12 (load tables, all)
    • Computer-aided features for retrieving records by specified patterns, like suffix, prefix, substring, category, EUI, etc. are implemented for systematic tagging.

  • Lexicon tables:
    Post-process programs are implemented to generate new Lexicon tables, including archaic terms, spelling variants with originated sources and informal terms with their formal synonyms, for the Lexicon annual release.