LVG Morphology

Summary

This document discusses inflectional and derivational morphology. The fact and rule files utilized by LVG are described and their specifications given. Following that is a brief discussion of transmorph, the program that transforms these rule and fact files into a binary file used by the LVG modules. Next an overview of the structures and behavior of the functionality that relies on this data is presented. Then the variant generation interface is described. Finally, some of LVG's added features are presented.

Introduction

Inflectional morphology deals with the different word forms derivable from a give base. In English, this is used to mark nouns for number, i.e., plural, verbs for tense, and adjectives and adverbs for their comparative and superlative. For example, the word "watch" has the noun inflection "watches", and the verb inflections "watches" (present), "watched" (past and past participle), and "watching" (present participle). An adjective such as "lively" has the inflections "livelier" (comparative) and "liveliest" (superlative). To the extent that Greco-Latin inflectional variation is productive in modern English, it can be accounted for. Variations include those illustrated by "bacterium" and "bacteria", "criterion" and "criteria", and "index" and "indices".

Derivational morphology links lexical items that are related grammatically by affixation, but these generally are not in the same word class. For example, "procedure" is a noun that is related derivationally to the adjective "procedural" by the suffix "-al." Derivational morphology is highly idiosyncratic in English and for that reason it is preferable to store these alternatives directly. However, when a particular alternative is missing from the database, rules of morphology can be used heuristically to identify the grammatical relationship between pairs of lexical items.

The morphology modules utilize a combination of rules, exceptions, and facts to generate inflectional and derivational variants of words for different syntactic categories. All control information is stated in text files that are easily modified. The module uses the generality afforded by the rules to produce potential inflections constrained by the exception list for that rule. Terms that do not satisfy any of the current rules but are morphologically related are stated in a separate "facts" file. This file can be modified or extended by the user. The content and formats of these files are discussed below.

Inflectional Morphology Rules

The rules of inflectional morphology are specified in a text file containing six fields separated by the "|" character. As shown in Figure 1, the fields consist of the input suffix, category, and inflection, followed by the output suffix, category and inflection. Immediately following a rule come the exceptions, if any. The exceptions are stated as pairs of terms separated by a "|" and terminated by a ";". Leading whitespace is required on the lines containing the exceptions. The input and output suffix fields specify the suffix to match on and the suffix to be generated respectively.

Input Suffix Input Category Input Inflection Output Suffix Output Category Output Inflection
Exception List (key|value)

ch$|noun|singular|ches$|noun|plural            [1]
Cy$|adj|positive|Cier$|adj|comparative         [2]
us$|noun|singular|i$|noun|plural               [3]
antus|anti;

Figure 1

Lowercase letters in the input suffix file stand for themselves while upper case letters act like variables.

The letter "D" matches any digit 0-9 in the input. The letter "L" matches any letter a-z. Any uppercase vowel matches a vowel in the input, while uppercase consonants other than "D" or "L" match a consonant in the input. In the output suffix, lowercase letters generate themselves, while uppercase letters are unified with their respective instantiations in the input.

For example, the adjective "lively" will match the rule 2 in Figure 1 and generate the comparative form, "livelier" with the variable "C" binding to the consonant "l".

Inflections are not generated from a rule if the results are on its exception list. For example, for the input "antus", the third rule will not generate "anti", but "focus" will generate "foci".

The category field must be one of "adj", "adv", "noun" or "verb" and the inflection field one of "base", "comparative", "superlative", "plural", "present" ( for 3rd person singular), "ing" ( for present participle), "past" ( for simple past tense), "pastpart" (for past participle). The following are recognized as aliases for "base", "singular", "positive", and "infinitive".

In the file, lines beginning with #include and followed by a file name in double quotes are treated as files to be included in place, analogous to the C-preprocessor. This is useful for inclusion of a specific set of rules without complicating the main file. Other lines that start with the "#" character are treated as comments and are ignored.

All rules are bi-directional; the reverse rules and their exceptions are automatically generated. Thus for the rules and exceptions shown in Figure 1, the rules (and exceptions) shown in Figure 2 are generated.

ches$|noun|plural|ch$|noun|singular
Cier$|adj|comparative|Cy$|adj|positive
i$|noun|plural|us$|noun|singular
anti|antus;
Figure 2

Inflectional Morphology Facts

A fact links two terms directly when there is no rule to express this relation. Such explicit linkage is useful in capturing irregular variation, e.g., foot/feet. The format for the fact files is similar to the rule files except that instead of suffixes, the actual terms to be related are listed. Figure 3 shows a sample.

Input Term Input Category Input Inflection Output Term Output Category Output Inflection

draw|verb|base|drew|verb|past
far|adj|positive|further|adj|comparative
child|noun|singular|children|noun|plural
Figure 3

The category and inflection fields can be left empty to signify all categories and inflections. The facts files also allow comments and file inclusion similar to the rules file. Like the rules, facts are also bi-directional.

The current distribution of lvg includes the following inflectional rule and fact files:

  • im.rul
  • plural.rul
  • verbinfl.rul
  • im.fct
  • lcim.fct

Heuristics Rules for Inflection Morphology

Derivational Morphology Rules

The derivational morphology rule and fact files are structured in a similar way to the inflectional morphology rule and fact files. However, there are no inflection fields. It is assumed that the input and output inflections are all base forms.

The derivational rules consist of a set of text files containing four fields separated by the "|" character. As shown in Figure 4, the fields consist of the input suffix and category followed by the output suffix and category. Immediately following a rule come the exceptions, if any. The exceptions are stated as pairs of terms separated by a "|" and terminated by a ";". Leading whitespace is required on the lines containing the exceptions. The input and output suffix fields specify the suffix to match on and the suffix to be generated respectively. Figure 4 shows an example:

Input Suffix Input Category Output Suffix Output Category
  • Exception List
   

ability|noun|able|adj   [1]
Figure 4

The part of speech, or category, can be one of: adj, adv, noun, or verb. For example, ability|noun|able|adj [1] Says that a noun ending in "ability" generates an adjective form with "able" replacing "ability", e.g., readability -> readable

In the file, lines beginning with #include and followed by a file name in double quotes are treated as files to be included in place, analogous to the C-preprocessor. This is useful for including a specific set of rules without complicating the main file. Other lines that start with the "#" character are treated as comments and are ignored.

As with the inflectional rules and facts, the derivational rules are bi-directional; the reverse rules and their exceptions are automatically generated.

Derivational Morphology Facts

A fact links two terms directly when there is no rule to express this relation. Such explicit linkage is useful in capturing irregular derivational variation, e.g., shipment/ship. The format for the fact files are similar to the rule files except that instead of suffixes, the actual terms to be related are listed. Figure 5 shows a sample.

Input Suffix Input Category Output Suffix Output Category

diagnostic|adj|diagnose|verb
Figure 5

As with the inflectional rules and facts, the derivational facts are bi-directional; the reverse facts are automatically generated. The facts files also allow comments and file inclusion similar to the rules files.

The current derivational rule and fact files include

  • dm.rul
  • convers.fct
  • dm.fct
  • pd.fct
  • derive.fct
  • etc.fct
  • nomiz.fct

Fact and Rule Translation

For efficiency, the text rule and fact files are translated into an efficient data structure and saved as a persistent file. The persistent file is a binary file and is not distributed with the release. Rather a program called transmorph is used to create this file. The transmorph program is run as one of the installation components of lvg.

The transmorph program translates these rules and facts into an internal set of "tries" and applies several checks on the data. The rules and facts are first checked for the correct number of fields. If the content of a field is restricted, it is checked for validity (for example, the string in the category field has to refer to an allowed syntactic category). Terms in the exception list are checked to see if their suffixes correspond to those specified by the rule. Duplicates in the exception list are ignored. The translator prints appropriate messages to warn the user of errors and inhibits the creation of the translated rules file if there were any errors.

The Morphology Unit Functionality

The morphology unit, specifically the lsv_variants() api is a key component to lvg's morphological processing. This module generates inflectional and derivational variants for a given input. In addition, the term's input categories and inflections can be specified, and the resulting term's desired output categories and inflections (for the inflectional morphology) can be specified.

The functionality behind this module relies upon the information provided by the facts and rule files.

The output of this variant generator is ordered by the length of the matching suffix. Longer matches come first. Facts are artificially given a large weight so that they automatically filter to the top of the output list.

The facts and rules are organized in a "trie" The trie is organized with a reverse suffix order.

suppose we have rules for the suffices er, ers, est, or, and CEX where C and X are wild cards for consonants, and E is a wild card for a vowel.

The trie will look like


root
/  | \    \          where the +-+
/   |  \    \                   +-+
/    |   \    \
r     s    t    C           is a trie node that has
/  \    |     \    \          rules on it.
e    o   r      s    E
+-+   +-+   \      \    \
+-+   +-+    e      e    X
+-+    +-+    +-+
+-+    +-+    +-+

The nodes of the trie have the rules that get applied, and they have the exceptions on them as well.

An Applications Interface

The application interface is provided by the function lsv_variants() whose prototype is shown below:

LsvOut *lsv_variants(input, nVarP)
LsvInp *input;   /* Input     the input Structure */
int *nVarP;      /* Output    Number of variants produced */

The data structures are defined in lsv.h. This function is embedded within the lsv.a library, created within the $NLS/morph/function/ directory.

The LsvInp Structure is defined as:


typedef struct _lsvInp {
char	*term;
char	*file;
int		moduleType;     /* IM or  DM                              */
lsv_t	inpCats;	/* categories of term, bit or'd together  */
lsv_t	inpInfls;	/* inflections of term, bit or'd together */
lsv_t	outCats;	/* categories of output desired, bit or'd together */
lsv_t	outInfls;	/* inflections of output desired, bit or'd together  */
lsv_t	inpTypes;	/* acronym/abbreviation types. */
lsv_t	outTypes;	/* acronym/abbreviation types. */
int		debug;		/* if >0, then matching fact or rule is copied to output */
}   LsvInp;

/*  output structure for IM
*/
typedef struct _lsvImOut {
char	*var;               /* The resulting term          */
lsv_t	cat;               /* The category                */
lsv_t	infl;             /* The inflection              */
char	*info;		/* matching rule/fact as string */
}   LsvImOut;

/*  output structure for DM
*/
typedef struct _lsvDmOut {
char	*var;                /* The resulting term         */
lsv_t	cat;               /* The category                */
char	*info;		/* matching rule/fact as string  */
}   LsvDmOut;


The LsvOut Structure is defined as


/*  generic output structure
*/
typedef struct _lsvOut {
lsv_t	moduleType;	/* IM, DM, SYN or ACR */
lsv_t	rfType;		/* rule or fact */
union {
LsvImOut    im;
LsvDmOut    dm;
LsvSynOut   syn;        /* Obsolete, no longer used */
LsvAcrOut   acr;        /* Obsolete, no longer used */
}	out;
}   LsvOut;



The Morphology Unit Test Program

A program called testmorph is distributed with lvg. It provides an example application of the morphology module lsv_variants. It also can be used to test the morphology unit directly. The testmorph program takes one optional argument, "-d", to indicate one is testing the derivational morphology functionality. By default, testmorph tests the inflectional morphology functionality. Testmorph expects input terms from the standard input stream. Testmorph produces each inflection or derivation and the fact or rule that applied.

The testmorph program expects to be run from the $NLS/morph/tst/ directory. The path to the im.db and dm.db files are hardwired to the path "../"

An example:

./tstmorph
sleep
sleep -> sleeping ... FACT|sleep|verb|base|sleeping|verb|prespart
sleep -> slept ... FACT|sleep|verb|base|slept|verb|pastpart
sleep -> sleeps ... FACT|sleep|verb|base|sleeps|verb|pres3ps
sleep -> slept ... FACT|sleep|verb|base|slept|verb|past
sleep -> sleeper ... RULE||adj|base|er|adj|comparative
sleep -> sleepest ... RULE||adj|base|est|adj|superlative
sleep -> sleeped ... RULE||verb|base|ed|verb|past
sleep -> sleeping ... RULE||verb|base|ing|verb|prespart
sleep -> sleeper ... RULE||adv|base|er|adv|comparative
sleep -> sleepest ... RULE||adv|base|est|adv|superlative
sleep -> sleeps ... RULE||noun|base|s|noun|plural
sleep -> sleeps ... RULE||verb|base|s|verb|pres3ps

./tstmorph -d
sleep
sleep -> sleepless ... FACT|sleep|noun|sleepless|adj
sleep -> sleeplessness ... FACT|sleep|noun|sleeplessness|noun
sleep -> sleepably ... RULE||verb|ably|adv
sleep -> sleepable ... RULE||verb|able|adj
sleep -> sleepance ... RULE||verb|ance|noun
sleep -> sleepal ... RULE||noun|al|adj
sleep -> sleepant ... RULE||verb|ant|noun
sleep -> sleepary ... RULE||noun|ary|adj
sleep -> sleepation ... RULE||verb|ation|noun
sleep -> sleeped ... RULE||noun|ed|adj
sleep -> sleepant ... RULE||verb|ant|adj
sleep -> sleepable ... RULE||noun|able|adj
sleep -> sleepism ... RULE||noun|ism|noun
sleep -> sleepist ... RULE||noun|ist|noun
sleep -> sleepity ... RULE||adj|ity|noun
sleep -> sleeply ... RULE||adj|ly|adv
sleep -> sleepor ... RULE||verb|or|noun
sleep -> sleepous ... RULE||noun|ous|adj
sleep -> sleepy ... RULE||noun|y|adj
sleep -> sleepness ... RULE||adj|ness|noun
sleep -> sleepment ... RULE||verb|ment|noun
sleep -> sleeper ... RULE||verb|er|noun
sleep -> sleepic ... RULE||noun|ic|adj

LVG Functionality Built on top of the Morphology Unit

Several flows use the morphology unit. These include the inflectional flow (-i), the derivational flow (-d), the uninflect term flow (-b) and the uninflect word flow (-B).

The lsv_variants() module generates variants without regard to whether the variants are valid terms. Lvg has options to further constrain the generated variants to those known to the lexicon. This option is the -k option. This option can be used to constrain the output only to known terms, or to constrain the output only to known terms when there are any and otherwise return the generated terms, or not to constrain the generated terms at all. LVG's inflectional flow by default constrains the generated terms to those known to the lexicon unless none are found, in which case the generated variants are used. LVG's derivational flow by default filters the generated terms to known forms from the lexicon.

Lvg is distributed with a table of the form

inflected form|uninflected form.

that is derived from the SPECIALIST Lexicon. This table is used with LVG's uninflect flows. These flows do a table lookup first, and only if nothing is found in this table is the lsv_variants() used to generate uninflected forms. The intention here is to make the uninflection more correct and faster by using the table lookup.

Related Papers