- Short Description: Reduce each word to its uninflected (citation, morphological base) form(s)
- Full Description:
Lvg can uninflect both words and terms. That is, it can make plural nouns into singular nouns, inflected verbs into their infinitive forms, and adjectives and adverbs into their positive forms.
No effect on the -m flag option. "none" is added at the end of the output.
There is a subtle difference between uninflecting the input as terms (-f:b) and uninflecting the input as a sequence of words (-f:B). The difference is that when the input is viewed as one term, a quick lookup of this term is made, and if it is not found, then the rules kick in to create an uninflected form. When the input is viewed as a sequence of words, each word is looked up to find the uninflected form. What is returned is every combination of uninflected words, rather than one rule generated. As an example of this difference, take the term alpha beta which is not in the lexicon as a term, but where alpha is in the lexicon and beta is in the lexicon. When this is pushed through the command
lvg -f:b -m, the result is:
alpha beta|alpha beta|1|1|b|1|RULE|$|adj|base|$|adj|base alpha beta|alpha beta|2|1|b|1|RULE|$|adv|base|$|adv|base alpha beta|alpha beta|128|1|b|1|RULE|$|noun|base|$|noun|base alpha beta|alpha beta|1024|1|b|1|RULE|$|verb|base|$|verb|base alpha beta|alpha beton|128|512|b|1|RULE|a$|noun|plural|on$|noun|singular alpha beta|alpha betum|128|512|b|1|RULE|a$|noun|plural|um$|noun|singular
As can be seen, rules were applied to the end of the term, in this case beta to come up with uninflected rule generated forms for alpha beta. When the input is viewed as a sequence of words however, the resulting uninflection is different. When the command
lvg -f:B is used, the result is:
alpha beta|alpha beta|2047|1|B|1|
A heuristic within this uninflection flow that should be pointed out is that words that, by rule uninflect to more than ten forms, are treated differently. In such case, the by rule forms are not used, rather, only the input form is used as the uninflected form. For example, the nonsense term PIIA clA CuUM TIAA, only has one uninflected form as a result of this heuristic because each word of these terms generates three variants a piece. Where as the nonsense term PIIA clA cuUM produces nine normalized forms due to the rule generated uninflected forms. The reasoning behind the heuristic is that the aggressive rule generated forms when not pruned can produce an explosive amount of irrelevant forms.
An additional heuristic has also been implemented within the uninflectional morphology unit to limit spurious variants. If a term goes through an uninflectional morphology mutation, and the term is not known to the lexicon, but its rule generated form is known to the lexicon, this variant is thrown out, because it is likely to be wrong.
The results are sorted by length and then case incentive alphabetical order.
- The inflection scheme is redesigned in new version. A new database table for the fact of inflections is created. Accordingly, results are different.
- The Java version stores cases of each uninflected term in data base (IDB). In a word, the results are case sensitive.
- The Java version always shows results by dictionary order with length (length first, alphabetic order, case incentive). Thus, it may be different from previous version (first by category frequency and then by dictionary order).
- The Java version is capable of handling punctuations. The old version strip punctuations from the input term first before uninflecting it.
For example: "isn't" should be uninflected as "be" not "isn t"
For example: "wasn't" should be uninflected as "be" not "wasn t"
For example: "doesn't" should be uninflected as "do" not "doesn t"
For example: "won't" should be uninflected as "will" not "win t"
For example: "Vit's" should be uninflected as "Vit" not "Vit s"
For example: "cit's" should be uninflected as "cit" not "cit s"
- The input term is viewed as a sequence of words, each word is looked up to find the uninflected form(s). What is returned is every combination of uninflected words, rather than one rule generated.
shell> lvg -f:B alpha beta alpha beta|alpha beta|2047|1|B|1| left left|left|2047|1|B|1| left|leave|2047|1|B|1| data data|data|2047|1|B|1| data|datum|2047|1|B|1| left data left data|left data|2047|1|B|1| left data|leave data|2047|1|B|1| left data|left datum|2047|1|B|1| left data|leave datum|2047|1|B|1|More examples
- Tokenize each word from the input term by using StringTokenizer.
- Find all uninflected form(s) for each word.
- Find uninflected terms from fact (Database).
- If no result from fact, find uninflected terms from rule (Trie).
- Filter out terms of the result from rule if it is in Database.
- Lowercase all uninflected terms.
- Check if the number of total permutations is greater than the limit of outputs defined in configuration file.
- If so, use the input term as the output.
- Otherwise, return all combinations for all forms of each word in the input term.
- Sort the results by length, case incentive alphabetical order (Util.LvgComparator)