STMT Tutorial

STMT is used to find all sub-term related functions for NLP projects. This page describes the functionality by going through an example. Please refer to design documents for how the algorithm works.

  • Corpus
    A subterm is a subset of a term that is known to the corpus. The corpus is a collection of known terms. For example, if we want to find all subterm in the Lexicon. The corpus will be all terms in the Lexicon (LexItem). In addition, if we want to know the EUI (Entry Unique Identifier) of the subterm, the corpus could include term to EUI mapping (term|EUI). Accordingly, STMT provides two ways to specify the corpus in the configuration file:
    • CORPUS_FILE: term only
    • SYNONYM_FILE: term|mapping synonyms (such as EUI)

    The CORPUS_FILE is ignored if the SYNONYM_FILE is specified. The key (1st field) of the SYNONYM_FILE will be used as terms in corpus.

  • Normalization
    In NLP practice, normalization is used to aggressively match terms to increase recall rate. For example, case and punctuation are often abstract away because they don't contribute too much in meaning. STMT applies Lexical Tools APIs for the normalization and provides three most used normalizations:

    In addition, user is able to create their own normalization in the Java StmtApi class public abstract Vector Norm(String inStr);

  • Example 1:
    Let say we use LexItemNorm to ignore case and punctuation and have a simple (synonym) corpus as follows:

    Norm keySynonym
    dogcanine
    dogpuppy
    canineK9
    catfeline
    felinekitty
    dog and catpets

    The following examples illustrate basic functions of subterms:

    Input: Dog and cat g and

    FunctionsResults
    In Corpustrue
    The Longest Prefixdog and cat
    Prefixes
    • dog
    • dog and cat
    Subterms
    • cat
    • dog
    • dog and cat
    Subterm Synonym
    Substitutions
    • canine and cat
    • canine and feline
    • dog and feline
    • pets
    • puppy and cat
    • puppy and feline

    Please note that prefix related functions require one-to-one normalization, such as LexItemNorm to work properly.

  • Example 2: Subterm Synonym Substitutions

    The subterm synonym substitution is the most complicated operation in STMT. It includes five steps as described below (using above example).

    StepResults
    normTermdog and cat
    subterms
    • subterm[0]: dog|0|1
    • subterm[1]: dog and cat|0|3
    • subterm[2]: cat|2|3
    subterm patterns
    • Pattern[0]
      • dog|0|1|true
      • and|1|2|false
      • cat|2|3|false
    • Pattern[1]
      • dog|0|1|false
      • and|1|2|false
      • cat|2|3|true
    • Pattern[2]
      • dog|0|1|true
      • and|1|2|false
      • cat|2|3|true
    • Pattern[3]
      • dog and cat|0|3|true
    synonym patterns
    • Pattern[0]
      • dog|0|1|canine|puppy
      • and|1|2
      • cat|2|3
    • Pattern[1]
      • dog|0|1
      • and|1|2
      • cat|2|3|feline
    • Pattern[2]
      • dog|0|1|canine|puppy
      • and|1|2
      • cat|2|3|feline
    • Pattern[3]
      • dog and cat|0|3|pets
    synonym substitution permutations
    • canine and cat (1)
    • canine and feline (2)
    • dog and feline (1)
    • pets (1)
    • puppy and cat (1)
    • puppy and feline (2)