Normalize Uninflected Words

  • Short Description: Normalize uninflected words for an input term.

  • Full Description:

    This flow component is used to retrieve a normalized uninflected word. This flow gets lexical name (citation form) of each word from the input term and returns all combinations of these lexical names. In 2014 release, an algorithm is developed to uniquely determine the citation form and the order of associated spelling variants. If the lexical name of a word is not found in the lexicon, the first item on the rule based uninflected term list (in alphabetical order) will be used.

    A heuristic within the uninflection flow that should be pointed out is that words that, by rule uninflect to more than ten forms, are treated differently. In such case, the by rule forms are not used, rather, only the input form is used as the uninflected form. The reasoning behind the heuristic is that the aggressive rule generated forms when not pruned can produce an explosive amount of irrelevant forms.

    An additional heuristic has also been implemented within the inflectional morphology unit to limit spurious variants. If a term goes through an uninflectional morphology mutation, and the term is not known to the lexicon, but its rule generated form is known to the lexicon, this variant is thrown out, because it is likely to be wrong.

    The results are sorted by length, case insensitive alphabetical order.

    No effects on the -m flag option. "none" is added at the end of the output.

  • Difference: None (new flow component)

  • Features:
    1. The input term is viewed as a sequence of words, each word is used to find the lexical name(s). An uninflected term is return by rule if a word is not found in Lexicon. The result returns all combinations of lexical names/uninflected words.


  • Symbol: Bn

  • Examples:
    
    shell> lvg -f:Bn
    coloring
    coloring|color|2047|1|Bn|1|
    
    colouring
    colouring|color|2047|1|Bn|1|
    
    glutamines
    glutamines|glutamin|2047|1|Bn|1|
    
    More examples

  • Implementation Logic:
    1. Tokenize each word from the input term by using StringTokenizer.
    2. Find the lexical name for each word.
      • Find lexical names from fact (Database).
      • If no result from fact, find uninflected terms from rule (Trie).
      • Filter out terms of the result from rule if it is in Database.
      • Return the first rule based uninflected term in the alphabetical order.
    3. Lowercase all lexical names/uninflected terms.
    4. Check if the number of total permutations is greater than the limit of outputs defined in configuration file.
      • If so, use the input term as the output.
      • Otherwise, return all combinations for all forms of each word in the input term.
    5. Sort the result by length, case insensitive alphabetical order (Util.LvgComparator)

  • Source Code: ToNormUninflectWords.java

  • Hierarchy: Object -> Transformation -> ToNormUninflectWords