Strip Stop Words

  • Short Description: Strip stop words

  • Full Description:

    Strips stop words from the input term. By definition, a stop word must be:

    1. a high frequency word, such as a preposition.
    2. a grammar word, which does not contribute the meaning of the sentence too much.

    The default stop words are listed in file "data/misc/stopWords.data". They are "of", "and", "with", "for", "nos", "to", "in", "by", "on", "the", "(non mesh)". These stop words are configurable. Cases are ignored.

    No effect on the -m option. "none" is added at the end of the output.

  • Difference:
    1. The Java version trims output terms (remove spaces at the beginning and ending of the term).
    2. The Java version applies Lvg.Util.StripToken( ) class to handle tokens conjoint with punctuations.


  • Features:
    1. Remove stop words from the input terms.
    2. Stop words are defined in a file named stopWords.data.
    3. Lvg allows users to modify the stop words list by modifying this file.


  • Symbol: t

  • Examples:
    
    shell> lvg -f:t
    Bacterial infection in conditions classified elsewhere and of unspecified site
    
    Bacterial infection in conditions classified elsewhere and of unspecified site|
    Bacterial infection conditions classified elsewhere unspecified site|2047|16777215|t|1|
    
    More examples

  • Implementation Logic:
    1. Tokenize (use Lvg.Util.StripToken) each words from the input term.
    2. Load the stop words from flat file.
    3. Strip all stop words (single stop word).
    4. Clean up and compose after the strip.
    5. Strip multiple stop words (such as non mesh).

  • Source Code: ToStripStopWords.java

  • Hierarchy: Object -> Transformation -> ToStripStopWords