Sorting order of base forms (for citation form and spelling variants)

Offical Announcment

I. Introduction
Base forms are the uninflected forms of a lexical term. They include citation form (base=...) and spelling variants (spelling_variant=...). Citation form is not a preferred term. It is an arbitrarily chosen base form of a lexical record before 2013 release. In 2014, an enhanced algorithm is implemented to uniquely choose the citation form for achieving LexRecord cross reference check task as well as improving other NLP tasks. All base forms (citation form and all spelling variants) are sorted in an order (described below) automatically during the LexBuilding process. The citation form is then assigned to the top one from the list of sorted base forms.

II. Sorting Order Details
The sorting order applied in 2014 release by Lexical System Group (LSG) are detailed as bellows:

  1. Pure ASCII first
  2. No punctuation first
  3. Shorter length first
  4. By Alphabetic order

Java API for this base sorting algorithm is available at:

IV. Impact Tests
Theoretically, results of Lexical Tools flow components that associated with citation forms might be different because the sorting order might assign different base forms as citation forms. These changes are not considered as errors because citation forms were chosen arbitrarily in the previous releases. We conducted a number of tests on 2013 release and compared results using Lexicons with and without new sorting order of base forms to confirm this inference. From our observation:

  • Only 6 flow components associated with citation forms have different results
  • The portion of difference from these 6 flow components is very small because only ~7.15% of LexRecords have changed citation forms from this new sorting order
  • No unexpected issues were found in these tests

Bellows are the results for these tests:

  • Lexicon.2013:
    About 7.15% (= 33,598 / 469,992) of LexRecords have changed citation forms. Both Lexicon are available bellows:

  • Lexical Tools Unit Test:
    We run through all unit tests for Lexical Tools with database tables derived from the Lexicon with base sorting order. This test includes 62 flow components and 47 options. As expected, results from 6 flow components associated with citation forms have changed:
    • Flow components: 6 flow components has different results
      • -f:a, acronym expansions use citation forms as references fa.diff
      • -f:A, acronyms use citation forms as references fA.diff
      • -f:An, antiNorm includes citation forms flow component (-f:Ct) fAn.diff
      • -f:Bn, normalized uninflect words uses citation forms fBn.diff
      • -f:Ct, citation forms are changed fCt.diff
      • -f:nom, nominalization uses citation forms as references fnom.diff
      • -f:N, normalize uses citation forms flow component (-f:Ct) fN.diff
    • Options: no option has different results

  • UMLS.2013AA:
    We tested the Norm with database tables derived from the Lexicon with base sorting order on MRCONSO.RRF to generate MRXNS_ENG.RRF and MRXNW_ENG.RRF. As expected, only small portion of terms and words that associated with citation changed have different results: