Generate Derivational Variants

  • Short Description: Generate derivational variants

  • Full Description:

    Derivational variants are terms which are related by a derivational process. In linguistics, a derivation derives a new word from an existing word by adding or removing an affix (prefix or suffix) to it. Through this process, meaning or/and category might change. In other words, derivation is used to form new words, such as "un-happy" and "happi-ness" are from "happy" by prefix and suffix, respectively. In Lexical tools, derivational pairs define direct derivations. A derivational pair (dPair) includes two derivational related terms (include base form and category) if and only if they are 1 derivational step. It is bi-directional. Only one affix is allowed in a derivational pair. For example, "un-happi-ness" is not a direct derivation of "happy", instead, it is a derivation of "unhappy". Accordingly, "un-happi-ness" and "happy" do not compose a derivational pair. Derivations with more than 1 derivational step can't be retrieved by this flow, such as happily and (-> happy ->) happiness. However, they can be retrieved by recursive derivational flow (-f:R). Often, the derivational variant changes syntactic category from the original term. Derivational variants are generated by facts: a case insensitive lookup in a database table of known derivations, or rules: by adding, changing or removing common suffixes (case sensitive).

    In 2012, a systematic approach was used to add prefix and zero (same spelling with different category) derivations to facts. In 2013, a systematic approach was used to add suffix derivations to facts. In addition, filter options with derivation types (zeroD, suffixD, prefixD) and negations (negative derivational pairs, such as happy and unhappy, effort and effortless) are added. In 2014, 63 more prefixes were added to generate prefixD pairs. All dPairs from previous (before 2012) Facts were also validated and added (if the dPair has EUIs, valid, and not duplicated). In addition, the optimized SD-Rule set is integrated as default in Trie to reach above 95% of precision and recall rate. In 2015, 2 new prefixes and 13 new SD-Rules are added to this system to cover more derivations in this system. Please refer details to the derivational variants design documents. The default filter options for derivations are:

    Filter OptionsDescriptions
    -kd:1restriction to facts only
    -kdt:ZSPderivation types are zeroD, suffixD, and prefixD
    -kdn:Onegations are otherwise (non-negative)

    Derivational variants are generated by facts (a pre-computed derivational table) and morphology rules. Facts are stored in database and retrieved by SQL query. Rules are stored and retrieved through Trie mechanism. There are two new heuristic rules implemented in the Java version to filter out non-realistic derivational variants generated by rules. They are governed by:

    • Min. length of a term:
      If the length of a term is too small (default value is 3), the word is usually an acronym or does not have too much meaning. Such terms could be filtered out by this rule.

    • Min. length of stem in trie tree:
      The stem length is the length of the word minus the length of input suffix rule. If the length of stem is too short, usually, the generated derivational variants are not good guess (from the rules) and should be filtered out. This is used in trie algorithm to filter out such cases.

      For example,

      RULE|ic$|adj|base|y$|noun|base

      The length of input suffix (ic$) is 2. If the input term is "zoic", the length of stem ("zo") is 2 (= 4 - 2). Accordingly, the rule-generated derivational variant, "zoy", is filtered out from the derivational variants of "zoic" by this rule (with default value 3).

    The values of above two variables are configurable in the configuration tool (${LVG_DIR}/data/config/lvg.properties). The default value are 3 and 3 for both Min. length of a term (MIN_TERM_LENGTH) and Min. length of stem in trie tree (DIR_TRIE_STEM_LENGTH), respectively.

    Results from both facts and rules are combined, sorted, then filter out those with same output terms, output category and input category. Finally, a derivational flow specific filter options (-kd:int) is applied. These options include known to LEXICON only (default: 1), known to LEXICON or all (2), and all (3).

    The -m flag is used to display the additional information that can be retrieved with the derivation flow. The additional information consists of two parts:

    • fact: FACT|D-1|CAT-1|EUI-1|D-2|CAT-2|EUI-2|D-Type|Negation|prefix|
    • rule: RULE|suffix-1|CAT-1|base|suffix-2|CAT-2|base|

    Please notes that only suffix rules are applied in derivations.

  • Difference:
    1. The Java version shows all variants from different rules while C version shows one variants from different rules if the variant are the same.
    2. Facts are generated by a systematic methodology to have wider coverage of zero derivations, prefix derivations, and suffix derivations since 2012.


  • Features:
    1. Fact: Find all derivational variants from derivation table.
    2. Rules: Find all derivational variants from morphology rules.
    3. Assign category and inflection for all outputs.
    4. Remove duplicates
    5. Filter outputs according to the restriction flag (-kd)
    6. Display outputs by alphabetic order


  • Symbol: d

  • Examples:
    
    shell> lvg -f:d -m
    multiple
    multiple|multiplicity|128|1|d|1|FACT|multiple|1|E0041326|multiplicity|128|E0041348|S|O|None|
    multiple|multiply|2|1|d|1|FACT|multiple|1|E0041326|multiply|2|E0041350|S|O|None|
    multiple|multiple|128|1|d|1|FACT|multiple|1|E0041326|multiple|128|E0041327|Z|O|None|
    multiple|multiply|1024|1|d|1|FACT|multiple|1|E0041326|multiply|1024|E0041349|S|O|None|
    multiple|pseudomultiple|1|1|d|1|FACT|multiple|1|E0041326|pseudomultiple|1|E0620850|P|O|pseudo|
    multiple|pseudo-multiple|1|1|d|1|FACT|multiple|1|E0041326|pseudo-multiple|1|E0620850|P|O|pseudo-|
    multiple|submultiple|128|1|d|1|FACT|multiple|128|E0041327|submultiple|128|E0224586|P|O|sub|
    multiple|multiple|1|1|d|1|FACT|multiple|128|E0041327|multiple|1|E0041326|Z|O|None|
    
    help
    help|helpful|1|1|d|1|FACT|help|128|E0031061|helpful|1|E0031066|S|O|None|
    help|helper|128|1|d|1|FACT|help|1024|E0031060|helper|128|E0031062|S|O|None|
    help|helping|128|1|d|1|FACT|help|1024|E0031060|helping|128|E0219271|S|O|None|
    help|help|1024|1|d|1|FACT|help|128|E0031061|help|1024|E0031060|Z|O|None|
    help|self-help|128|1|d|1|FACT|help|128|E0031061|self-help|128|E0055088|P|O|self-|
    help|help|128|1|d|1|FACT|help|1024|E0031060|help|128|E0031061|Z|O|None|
    
    happy
    happy|happily|2|1|d|1|FACT|happy|1|E0030812|happily|2|E0218480|S|O|None|
    happy|happiness|128|1|d|1|FACT|happy|1|E0030812|happiness|128|E0030811|S|O|None|
    
    shell> lvg -f:d -m -kdn:B
    happy
    happy|unhappy|1|1|d|1|FACT|happy|1|E0030812|unhappy|1|E0063156|P|N|un|
    happy|happily|2|1|d|1|FACT|happy|1|E0030812|happily|2|E0218480|S|O|None|
    happy|happiness|128|1|d|1|FACT|happy|1|E0030812|happiness|128|E0030811|S|O|None|
    
    More examples

  • Implementation Logic:
    • Use both facts and rules.
    • Facts:
      1. Performs a case insensitive search on the input term and term1 in the derivation table.
      2. Performs a case insensitive search on the input term and term2 in the derivation table.
      3. Check if the input categories are legal.
      4. Assigns term, category, inflection (base) for both source and target.
    • Rules:
      1. Uses persistent trie to apply rules (and check exceptions) on the input term.
      2. Assigns term, category, inflection (base) for both source and target.
    • Combine facts and rules
    • Sort outputs by the case insensitive alphabetical order
    • Remove duplicates (same output term, output category, and input category)
    • Filter results according to the restriction filter

  • Source Code: ToDerivation.java

  • Hierarchy: Object -> Transformation -> ToDerivation