Canonicalize

  • Short Description: Retrieves the pre-computed canonical form from the input which is assumed to be an uninflected form.

  • Full Description:

    A core LVG technique is to "uninflect" input terms to their base form. This process occasionally results in two or more legitimate uninflected forms for the same inflected input.

    For example, left uninflects to both left and leave reflecting its ambiguity as an adjective or verb. A technique to manage this ambiguity produces only one "canonical" base form for any given input term. The process of canonicalization pre-computes all uninflected forms and then arranges these into classes composed of terms that could be expanded to the same inflected form. The canonical form is an arbitrarily chosen member of this class and represents all the members of the class.

    For example, the terms left, leave, and leaf are all included in one such class, and the canonical form is leaf, the shortest member of the class first and then by alphabetically order. Additionally, the member of the class is chosen to be a form from the lexicon and pure ASCII if possible. This is an attempt to limit the number of word fragments that show up as canonical representations of the class of terms.

    In addition, same canonical forms are returned for spelling variants by using citation form. For example, "analog" and "analogue" have same canonical form of "analog". There is always only one record from the result of this flow component.

    A set of numbers is returned on the additional information output field when the -m option is specified. These numbers are the numeric form of the canonical forms. Please refer to canonical form design documents for details.

  • Difference: None

  • Features:
    1. The input term is viewed as a sequence of words, each word (assuming is uninflected) is looked up to find the canonical form(s). What is returned is the first combination of these canonical form(s).


  • Symbol: C

  • Examples:
    
    shell> lvg -f:C
    being
    being|i|2047|16777215|C|1|
    
    shell> lvg -f:C -m
    being
    being|i|2047|16777215|C|1|206326|
    
    color
    color|color|2047|16777215|C|1|383975|
    
    colour
    colour|color|2047|16777215|C|1|383975|
    
    colored
    colored|color|2047|16777215|C|1|383975|
    
    coloured
    coloured|color|2047|16777215|C|1|383975|
    
    More examples

  • Implementation Logic:
    1. Tokenize each word from the input term by using StringTokenizer
    2. Find the canonical form from Canonical table in database for each word.
    3. Return the combination for all found canonical forms of each word in the input term.

  • Source Code: ToCanonicalize.java

  • Hierarchy: Object -> Transformation -> ToCanonicalize