Generate Derivational Variants

  • Short Description: Generate derivational variants

  • Full Description:

    Derivational variants are terms which are related derivational process. In linguistics, a derivation derives a new word from an existing word by adding or removing an affix (prefix or suffix) to it. Through this process, meaning or/and category might change. In other words, derivation is used to form new words, such as "un-happy" and "happi-ness" are from "happy" by prefix and suffix, respectively. In Lexical tools, derivational pairs define direct derivations. A derivational pair include two derivational related terms (include base form and category). It is bi-directional. Only one affix is allowed in a derivational pair. For example, "un-happi-ness" is not a direct derivation of "happy", instead, it is a derivation of "unhappy". Accordingly, "un-happi-ness" and "happy" do not compose a derivational pair. Often, the derivational variant changes syntactic category from the original term. Derivational variants are generated by facts: a case insensitive lookup in a database table of known derivations, or rules: by adding, changing or removing common suffixes (case sensitive). Facts include derivations from prefix, suffix, and zero derivations since 2012 release. Please refer details to the derivational variants design documents.

    Derivational variants are generated by facts (a pre-computed derivational table) and morphology rules. Facts are stored in database and retrieved by SQL query. Rules are stored and retrieved through Trie mechanism. There are two new heuristic rules implemented in the Java version to filter out non-realistic derivational variants generated by rules. They are governed by:

    • Min. length of a term:
      If the length of a term is too small (default value is 3), the word is usually an acronym or does not have too much meaning. Such terms could be filtered out by this rule.

    • Min. length of stem in trie tree:
      The stem length is the length of the word minus the length of input suffix rule. If the length of stem is too short, usually, the generated derivational variants are not good guess (from the rules) and should be filtered out. This is used in trie algorithm to filter out such cases.

      For example,


      The length of input suffix (ic$) is 2. If the input term is "zoic", the length of stem ("zo") is 2 (= 4 - 2). Accordingly, the rule-generated derivational variant, "zoy", is filtered out from the derivational variants of "zoic" by this rule (with default value 3).

    The values of above two variables are configurable in the configuration tool (${LVG_DIR}/data/config/ The default value are 3 and 3 for both Min. length of a term (MIN_TERM_LENGTH) and Min. length of stem in trie tree (DIR_TRIE_STEM_LENGTH), respectively.

    Results from both facts and rules are combined, sorted, then filter out those with same output terms, output category and input category. Finally, a derivational flow specific filter options (-kd:int) is applied. These options include known to LEXICON only (default: 1), known to LEXICON or all (2), and all (3).

    The -m flag is used to display the additional information that can be retrieved with the derivation flow. The additional information consists of two parts: The fact or rule that generates the derivation variants and the fact or rule that was applied to the derivational form to produce the output.

  • Difference:
    1. The Java version shows all variants from different rules while C version shows one variants from different rules if the variant are the same.
    2. Facts includes wider coverage of zero derivations, prefix derivations, and suffix derivations since 2012.

  • Features:
    1. Fact: Find all derivational variants from derivation table.
    2. Rules: Find all derivational variants from morphology rules.
    3. Assign category and inflection for all outputs.
    4. Remove duplicates
    5. Filter outputs according to the restriction flag (-kd)
    6. Display outputs by alphabetic order

  • Symbol: d

  • Examples:
    shell> lvg -f:d -m
    More examples

  • Implementation Logic:
    • Use both facts and rules.
    • Facts:
      1. Performs a case insensitive search on the input term and term1 in the derivation table.
      2. Performs a case insensitive search on the input term and term2 in the derivation table.
      3. Check if the input categories are legal.
      4. Assigns term, category, inflection (base) for both source and target.
    • Rules:
      1. Uses persistent trie to apply rules (and check exceptions) on the input term.
      2. Assigns term, category, inflection (base) for both source and target.
    • Combine facts and rules
    • Sort outputs by the case insensitive alphabetical order
    • Remove duplicates (same output term, output category, and input category)
    • Filter results according to the restriction filter

  • Source Code:

  • Hierarchy: Object -> Transformation -> ToDerivation