SPECIALIST Spelling Suggestion Tools The National Library of Medicine

A SPECIALIST NLP Tool


 

 

Table of Contents

1 Introduction

2 Package Download and Installation

2.1 Package Download

2.2 Package Installation

3 The Tools

3.1 GSpell

3.1.1 Usage

3.1.2 Options

3.1.3 Output

3.1.4 Tunable Parameters

3.1.5 API's

3.1.5.1 Create or update an index

3.1.5.2 Retrieve suggestions for terms

3.1.5.3 Methods from the Candidate class.

3.1.5.4 GSpell GlobalBehavior and Find Options

3.1.5.5 Example Applications

3.1.5.6 Constructors for Apps that cannot set the classpath

3.1.6 Example Scripts

3.1.7 Dictionaries

3.1.8 Known Issues

3.1.9 Future Directions

3.2 Bag-O-WordsPlus

3.2.1 Usage

3.2.2 Options

3.2.3 Output

3.2.4 Tunable Parameters

3.2.5 API's

3.2.5.1 Create an index

3.2.5.2 retrieve suggestions for phrases

3.2.5.3 methods from the Candidate class.

3.2.5.4 Constructors for Apps that cannot set the classpath

3.2.6 Dictionaries

3.3 NGrams

3.3.1 Usage

3.3.2 Options

3.3.3 Output

3.3.4 Tunable Parameters

3.3.5 API's

3.3.5.1 Create or update an index

3.3.5.2 Retrieve suggestions for terms

3.3.5.3 Methods from the Candidate class.

3.3.5.4 GSpell GlobalBehavior and Find Options

3.3.5.5 Example Applications

4 Version

4.1 V0.0.33

4.2 V0.0.32

4.3 V0.0.30

4.4 V0.0.29

4.5 V0.0.28

4.6 V0.0.27

 

SPECIALIST Spelling Suggestion Tools

1 Introduction

Spelling suggestion tools and techniques are useful in a variety of applications from spelling suggestion embedded behind queries of a search engine, to nearest neighbor suggestions to find close or related terms from an index. This package includes a suite of  such tools:

         The GSpell program is a spelling suggestion tool that uses a mix of algorithms to retrieve close neighbors. This application is best suited to applications that index at the word or term level of tokenization. The mix of algorithms includes the NGrams, metaphone, common misspellings, and homophone retrieval tools. Candidates are evaluated by the Levenshtein edit distance, and similar ranked candidates are re-ordered by use of word based corpus frequencies.

         BagOWordsPlus is a phrase retrieval tool. This tool is useful to retrieve closest matching phrases to data such as strings from the Metathesaurus. The retrieval mechanism, as the name implies, is a bag of words implementation. It is aided by the use of GSpell to augment those word tokens that are not initially found. This tool suggests the top phrase matches to the input phrase. The phrase evaluation is based on an edit distance between the query and each suggestion. 

         NGrams is a character based N-gram retrieval tool that anchors the initial retrieval sets on the first and last N-grams of the query term. By default the NGrams are bi-grams.

         MetaphoneRetrieve is a phonetic retrieval tool that uses Lawrence Philips Metaphone algorithm to normalize terms to a rough phonetic representation.

         Homophones is a lookup mechanism into a table of homophones.

         CommonMisspellings is lookup tool to retrieve correct spellings of common misspellings that we've gleaned.

2 Package Download and Installation

2.1 Package Download

Jar file

Description

Size

gspellProjectSrc.jar

Sources Release

Includes all the sources necessary to build Gspell, all the data, and build scripts.

173,453 KB

gspellProjectStd.jar

Standard Release
Includes an install script, all the necessary class and data

70,946 KB

gspellProject.jar

Standard Jar
Includes all the necessary class files to embed this application into another application. This jar also includes the NLS Java Repository packages utils, dbm, store, uncommon, jdbm, installUtils, and simpleCorpusStatistics along with the gspell package.

1,719 KB

gspell.jar

Essential jar
Includes only the gspell package class files, not the other possibly redundant NLS Java Repository packages: utils, dbm, store, uncommon, jdbm, simpleCorpusStatistics, and installUtils.
This gspell jar is useful when integrating with other NLS Java Repository projects, such as the MMTx project.

165 KB

 

2.2 Package Installation

          Un-jar the gspellProjectStd.jar into the location where you want to install gspell.
jar xvf gspellProjectStd.jar

         Change directory to the gspell directory that was just created from the previous step.

         Run either the install.bat or install.sh

This will create the following scripts in the gspell/bin directory:

GSpell.[bat|sh]

GSpellIndex .[bat|sh]

GSpellFind .[bat|sh]

GSpellUpdate .[bat|sh]

GSpellIndexCommon .[bat|sh]

GSpellUpdateCommon .[bat|sh]

bagOWordsPlusIndex.[bat|sh]

bagOWordsPlus.[bat|sh]

homophones.[bat|sh]

metaphone.[bat|sh]

ngrams.[bat|sh]

 

These scripts set up the proper java environment prior to invoking their respective java programs. These scripts accept the options mentioned above.

Invoke the gspell applications via these scripts.

         [Optional] If you are integrating this into other applications, or want to run the java program outside the scripts, set your $CLASSPATH environment to include gspell/lib/gspellProject.jar and the gspell/config  location. This package includes a gspell/dictionaries directory. The applications put the dictionaries in the dictionaries directory. If you want to use an alternative location, use the configuration options --dictionaryDir=  to specify the parent directory of where the dictionary directory will reside.

         Retrieve suggestions from the default dictionary created from the SPECIALIST Lexicon. (See the End User Examples) 

         [Optional]. Create a dictionary of your terms. (See the End User Examples) 

3 The Tools

3.1 GSpell

Gspell is a spelling suggestion tool that uses a mix of algorithms to retrieve close neighbors. This application is best suited to applications that index at the word or term level of tokenization.

Java API's are provided for developers to embed this application in their applications. Two java main programs are provided, GSpell, and CommonMisspellings. GSpell is the general spelling suggestion tool. The auxiliary CommonMispellings program is provided as the mechanism for inserting and updating common misspelling by correct spelling pairs. Once installed, the listed shell scripts around these two java programs are available.

See the output section for an explanation of the the output from the GSpell program.

3.1.1 Usage

GSpell.[sh|bat]

   or

java

  -cp "$TOP/gspell/lib/gspellProject.jar;

       $TOP/gspell/config"

  -ms100m -mx600m

    gov.nih.nlm.nls.gspell.GSpell

  

--index|--update|--find|--export 

   [--dictionaryDir=someFullPath]

    --dictionaryName=NameOfDictionary 

   [--inputFile=YYYYY] [--outputFile=ZZZZZ]

   [--fieldedText] [--termField=X]

 

   [--truncate=N] [--considerNCandidates=N]

   [--maxEditDistance=N] [--correctField=Y]

   [--wordLengthHeuristic=true|false]

   [--reportTime] [--help]

GSpellIndex.[sh|bat]

   [--dictionaryDir=someFullPath]

    --dictionaryName=NameOfDictionary 

   [--inputFile=YYYYY] [--outputFile=ZZZZZ]

   [--fieldedText] [--termField=X]

   [--reportTime] [--help]     

GSpellFind.[sh|bat]

   [--dictionaryDir=someFullPath]

    --dictionaryName=NameOfDictionary 

   [--inputFile=YYYYY] [--outputFile=ZZZZZ]

   [--fieldedText] [--termField=X]

 

   [--truncate=N] [--considerNCandidates=N]

   [--maxEditDistance=N] [--correctField=Y]

   [--wordLengthHeuristic=true|false]

   [--reportTime] [--help]

GSpellUpdate.[sh|bat]

[--dictionaryDir=someFullPath]

    --dictionaryName=NameOfDictionary 

   [--inputFile=YYYYY] [--outputFile=ZZZZZ]

   [--fieldedText] [--termField=X]

 

   [--truncate=N] [--considerNCandidates=N]

   [--maxEditDistance=N] [--correctField=Y]

   [--wordLengthHeuristic=true|false]

   [--reportTime] [--help]

 

3.1.2 Options

 

--index

Create a new dictionary. Take input from either the file specified by the --inputFile parameter, or from standard input. Input terms are expected to be of the format, one term per new line separated line. Lines starting with # are taken to be comments are are ignored. Existing dictionaries of the same name will be overwritten.

A note about memory. This application requires morethan the standard amount of memory when indexing. The total amount of memory used during indexing is reported with the --reportTime option.  It has been observed that 128MB was needed to load a set of 300,000 terms. Please use the java options -MX and -MS to run with the appropriate amount of memory when indexing.

--update

Add terms to an existing dictionary. This option is similar to --index option, except that if the dictionary exists, terms are added to it, rather than having the dictionary overwritten. Take input from either the file specified by the --inputFile parameter, or  from standard input. Input terms are expected to be of the format, one term per new line separated line. Lines starting with # are taken to be comments are are ignored.

--find

Retrieve close suggestions to input terms. Input is either taken from the standard input [the default], or from a file specified via the --inputFile. By default the input is to be in the format of one term per newline delimited line. However, the application can take fielded input (see the --fieldedText and -- termField options).

The output returned contains the input plus four fields. The first of these fields is the suggestion, the second is a ranking, the third field is the method that returned this suggestion, and the last field, will indicate whether or not a term was found, or was found to be correct (in a case insensitive way).

--export

Spew out to the contents of the indexes to standard output, or to the file specified by the -- outputFile option. Caution: this option spews out lots of data. This option is useful really only for debugging purposes. This option may disappear in future releases.

--unicode

Tell the application to expect UTF-8 formated input, and to write out UTF-8 data.

--dictionaryName=

The name of the dictionary to be created or to be referenced. Since a directory is created based on this name, it should be named compliant with the naming rules for directories.

Each dictionary is placed in the dictionaries directory. The dictionaries directory is specified by the --dictionaryDir=  path found in the $GSPELL_ROOT/config/gspellRegistry.cfg file. The $GSPELL_ROOT directory is the base directory where GSpell was installed.

--dictionaryDir=

Dictionary directories containing the indexes are placed in the in the $DictionaryDir. $DictionaryDir is specified in the $GSPELL_ROOT/config/gspellRegistry.cfg file, but can be overuled by the --dictionaryDir=aFullPath command line option. The $GSPELL_ROOT directory is the base directory where GSpell was installed.

A note to those who would like to have dictionaries placed in a location other than $GSPELL_ROOT/dictionaries: specify  the --dictionaryDir= and --dictionaryName= commandline options.              

--inputFile=

Specify the (full path) name of an input file. The semantics of this file refer to being the contents of what is to be index when this option is used with the --index and --update options. The contents of this file is taken as input for retrieval when the --find option is used.

--outputFile=

Specify the name of the output file.

--fieldedText

Indicate that the input is Pipe delimited input, where one of the fields contains the significant term, and the rest of the terms are passed through. This option must be used with the --termField option to indicate the field to retrieve. This option is only relevant with the --find option.

--termField=X

Indicate the field to pick up the term to be looked up. The first field is taken to be 1, not zero.

--truncate=N

Return the top N suggestions.

--considerNCandidates=N

Limit the total number of candidates to be evaluated to N. This parameter has a direct relation to both the speed and accuracy of the application. The more candidates evaluated, the better the suggestions, but the slower the response is.

--maxEditDistance=N

Limit the returned suggestions to those who have edit distances that are less than or equal to N.

--correctField=Y

This option is for debugging and benchmarking purposes. When present, it indicates the field that contains the correct spelling. Further, benchmark statistics are computed when this option is present. Those statistics include the number of responses where the correct term was found in the first, the first five, the first ten, the first hundred suggestions along with how many responses completely missed the mark. The output stats file is labeled dictionaryName.stats and is placed in the directory where the application is kicked off.

--reportTime

This option reports status messages to standard error during processing. It is useful when indexing a large set of terms to determine how long it takes to process 10,000 terms, and how much memory was needed.

--wordLengthHeurstic=
true|false

Limit the candidates to be evaluated to those which are with +/-4 characters of the input term. The nGram index is segmented into bins by the size of input words.

When this heuristic is turned on, the number of candidates is effectively pruned to not even consider words that are known to be outside reasonable edit distances already. This allows additional candidates that might not have been able to be evaluated to be evaluated. This heuristic has two beneficial aspects to it. It speeds up suggestions, and it suggests additional close neighbors.

This heuristic does increase the size of the resulting index by about 50% over an index that did not use this heuristic. Making a non- wordlengthHeurstic index is not an option. This heuristic prunes multi word neighbors that may share a word in common.

The size of target words range from 1 to over 100 characters, and are distributed in a Zipf like distribution (at least for our dictionary), where there are a few 1 character words, some more two character words, a fair amount of three character words, and a huge number of words between four characters and ten characters. The numbers start trailing off slowly after that. There is a precipitous drop off in the number of words that have more than 20 characters. This option is turned on by default.

--version

--CVSVersion

--history

Retrieve the verion of this build

Retrieve the CVS tagged version of this build

Retreive the history notes of this build.

--help

Help

 

3.1.3 Output

GSpell's output for an input term contains the following fields:

Input term

Suggestion

Edit Distance

Rank from 0 to 1
(1.0 is an exact match)

[1]

Method used

Other messages
such as Correct

Frequency from corpus

[2]

Weighted rank

[3]

Here's an example of the output for the misspelled term anonomous.

anonomous

anonomous|anonymous|1.0|0.87|NGrams||402|1999598

anonomous|autonomous|2.0|0.58|NGrams||664|2999336

anonomous|allonomous|2.0|0.58|NGrams||-1|3000001

anonomous|analogous|3.0|0.3|NGrams||1008|3998992

anonomous|anomalous|3.0|0.3|NGrams||890|3999110

anonomous|anonymously|3.0|0.3|NGrams||54|3999946

anonomous|anadromous|3.0|0.3|NGrams||1|3999999

anonomous|anonymes|3.0|0.3|Metaphone||-1|4000001

anonomous|anonyms|3.0|0.3|Metaphone||-1|4000001

anonomous|Axonopus|3.0|0.3|NGrams||-1|4500001

 

Here is an example for the correctly spelled word disease

disease

disease|disease|0.0|1.0|NGrams|Correct|94306|905694

disease|diseases|1.0|0.87|NGrams||18485|1981515

disease|diseased|1.0|0.87|NGrams||855|1999145

disease|dispase|1.0|0.87|NGrams||27|1999973

disease|discase|1.0|0.87|NGrams||3|1999997

disease|diverse|2.0|0.58|NGrams||3134|2996866

disease|disuse|2.0|0.58|NGrams||84|2999916

disease|disperse|2.0|0.58|NGrams||66|2999934

disease|diastase|2.0|0.58|NGrams||25|2999975

disease|dispense|2.0|0.58|NGrams||24|2999976

 

[1] The rank is the normalized ranking with a value between 0 and 1 where 1 is an exact match and 0 is no match at all. The distribution for this normalized score is intended to have edit distances of .5 rank about the .90. The distribution curve is half the bell curve, and intended to allow developers relying on edit distance as the sorting metric to have a metric that ranks words beyond 2 edit distances as quickly getting worse as the curve extends out.

[2] The corpus frequency is taken from the 1999 Medline word frequencies. See the simpleCorpusStatistics notes for details on how to embed other corpus frequencies into this package.

[3] The weighted rank is currently the following:

     

      [ ( the edit distance * the method number)  * 1000 )  ]    + [ MAX FREQ - the frequency of the word found in the corpus]

 

 

    Where

          The edit distance is the Levenshtein edit distance

          The frequency of the word found in the corpus comes from the 1999 Medline word frequency list, modified to throw out words with frequencies less than 4 occurrences, as an efficiency to keep the list manageable.

          The MAX FREQ currently is a constant == 1,000,000.  This number was intended to be the number of occurrences of words in the corpus, but this proved hard to compute and pass on to the ranking method. A suitably large number was chosen as a surrogate.

          The method numbers were intended to weight homophones and corrections to common misspellings higher than other methods.

Method

Method Number

NGrams

1

Metaphone

1

Homophones

.5

Common Misspellings

.5

bagOWordsPlus

1

default

1

 

 

                 

3.1.4 Tunable Parameters

There are tuning parameters that are specific to each dictionary created. These parameters are stored in a configuration file in the dictionary directory. 

Performance tuning variables

--considerNCandidates

The number of candidates that can be evaluated for any given input. The higher the number the better the suggestions are, but the longer it takes to do. A reasonable rate on a sun is about 2000. This parameter alters the behavior of the NGrams algorithm.

--cacheSize

This is the number of grams to keep in memory during processing. The more grams in memory, the faster the response is. This makes a real difference for indexing, where if you can keep all the grams in memory, the faster indexing will go. The downside is more memory is used. For indexing, this may mean bumping up the memory to something along the lines of 100mb for dictionaries of 100,000 terms or more. This variable is used to keep the number of grams in memory to this size. The hash of grams does not grow beyond this number. Ideally, this size should not be set bigger than the number of grams you have. This parameter alters the behavior of the NGrams algorithm.

 

Variables that can be changed to give specific behaviors

--truncate

This variable limits the number of suggestions returned for a given input.

--maxEditDistance

This variable limits suggestions to those which have an edit distance equal to or less than this parameter.

--wordLengthHeuristic

This heuristic retrieves candidate terms that are +/- 4 characters in length larger or smaller than the query term. For example, if you put in a term that is two characters long, the longest term retrieved would be 6 characters.  This heuristic is turned on by default.  This heuristic only affects the behavior of the NGram algorithm.

 

3.1.5 API's

GSpell Package 

 

API builders should view the gspell package. It should be possible to embed this package into other applications using the following constructors and methods for the class GSpell.

Gspell instances can be embedded into single threaded applications with no problem. It can be embedded into threaded applications with the following advice:

Embedding read-only instances of GSpell objects is fine as long as there are no read/write instances present.

It is recommended that in cases where there are updates or read/writes, this object gets implemented as a singleton instance per dictionary, that is shared across threads and processes.

There is NO writer lock and unlock around the update/index method, nor around the cleanup or flush methods to insure that only one writer is alive at any given time, and that there are no readers alive. It is up to the application developer to insure that access is controlled around GSpell Object instances that do writes or updates.

3.1.5.1 Create or update an index

public GSpell(String args[]) throws Exception

public GSpell(GlobalBehavior pSettings) throws Exception
public GSpell(String pDictionaryName, int pMode) throws Exception
public int index( BufferedReader pFStream ) throws Exception
public void index( String pTerm ) throws Exception
public int update( BufferedReader pFStream ) throws Exception
public void update( String pTerm ) throws Exception

3.1.5.2 Retrieve suggestions for terms

public GSpell(GlobalBehavior pSettings) throws Exception

public GSpell(String pDictionaryName) throws Exception
public GSpell(String pDictionaryName, int pMode) throws Exception
public int find( BufferedReader pFStream, PrintWriter pPrintStream ) throws Exception
public Candidate[] find( String pTerm ) throws Exception
public Candidate[] find( FindOptions fOptions, String pTerm ) throws Exception

3.1.5.3 Methods from the Candidate class.

public String getName()
public double getEditDistance()

public double getEditDistance1000()
public double getRank()
public String getFromMethod()
public String getMessage()

public long getCorpusFrequency()

public long getSortValue()

public long getMaxFreq()
public String toString()

3.1.5.4 GSpell GlobalBehavior and Find Options

public GlobalBehavior( String pApplicationName, String pRegistryFile, String args[])


public FindOptions (GlobalBehavior pSettings)

public FindOptions()

public FindOptions (int pConsiderNCandidates, boolean pUseWordLengthHeuristic, int pTruncateSize, pMaxEditDistance)

3.1.5.5 Example Applications

 

Examples that illustrate embedding this in an application are provided in the examples directory.

         Index.java 

         IndexHarder.java 

         Suggest.java 

         SuggestHarder.java 

         Update.java 

 

3.1.5.6 Constructors for Apps that cannot set the classpath

For those applications that cannot set the classpath, use the the following constructor:

GSpell(java.lang.String pConfigPath,

      java.lang.String pDictionaryName,

       int pMode)


The pConfigPath parameter should specify the full path to the ___GSPELL_ROOT/config path.

The dictionaryName should reflect the name of a directory from the ___GSPELL_ROOT/dictionaries/dictionaryName directory, or it should be the full path to the dictionary to be opened.

This constructor is used for those applications that cannot set the classpath, such as java servlet apps.

3.1.6 Example Scripts

Indexing

         Create a 1000 word dictionary 

         Create a 100,000 word dictionary, report the status 

         Create the dictionary for the SPECIALIST Lexicon and create a common misspelling index for the SPECIALIST Lexicon 

Suggestions

         Suggest words from the SPECIALIST Lexicon, taking terms from standard input. 

         Find 1000 words from the SPECIALIST Lexicon, note the time.

3.1.7 Dictionaries

         Data

          The 2003 Lexicon Term List. 

          The 2004 Lexicon Term List

          Common misspellings that have been gathered from a variety of sources.

          List of Homophones

          Indexed Dictionaries

          2003Lexicon.jar

          2004Lexicon.jar

3.1.8 Known Issues

The application is driven by some heuristics that in general work well, but have consequences.

There is a consequence to the bathtub heuristic. If the spelling error occurs in the first or last character of the string, the suggestions, if any will more than likely be wrong. In a future release, two additional techniques may be employed to ameliorate this: a left and right truncation retrieval, and a conservative stemming heuristic. A centroid approach may be looked into as well, but that approach has a high cost of training associated with it.

The quality of the returned suggested terms are directly related to the number of candidates evaluated. The more candidates evaluated, the better, but the cost is the amount of time it takes to return suggestions.

Nick Ide has suggested several improvements that should be adopted:

         Have the suggestions be weighted by the likelihood that the transformation between the bad and good word was caused by some understandable heuristic. For instance it is common to see vowel transformations, where it is less common to see a vowel to consonant transformation. Likewise typo's caused by keyboard configuration are common.

This suggests instituting a confusion matrix a-la Tom Lasco within the edit distance. Tom Lasco augmented gspell with a confusion matrix two years ago for an OCR application.

         Institute a logging function to capture good/bad spelling pairs.

3.1.9 Future Directions

Performance enhancements

         Create threads for each of the spelling techniques. Each is independent of the other, and this would speed the application up by about 20%.

Quality Enhancements

         Investigate whether removing plural sembiants from incoming terms helps or hurts. I have observed that some noticeable portion of misses were caused by the incoming term containing a trailing s, but where only the singular form of the word was in the target.

         Employ a left and right truncation retrieval to enhance recall of "related" terms, such as everything that starts with the word "yellow". Left truncation would enhance recall for those terms that are prefixed by prefixes such as "un" "anti" "pro" "Con".

Misc. Enhancements

         Gather and distribute application specific dictionaries.

3.2 Bag-O-WordsPlus

BagOWordsPlus is a phrase retrieval tool. This tool is useful to retrieve closest matching phrases to data such as Metathesuarus strings. The retrieval mechanism, as the name implies, is a bag of words implementation. It is aided by the use of spelling suggestion to augment those word tokens that are not initially found. This is not an interactive tool that offers suggestions for each unmatched word. Instead, the top spelling suggestions are added to retrieve phrases that contain both the matched words and these suggestions. This tool suggests the top phrase matches to the input phrase. The phrase evaluation is based on an edit distance between the query and each suggestion.

Java API's are provided for developers to embed this application in their applications. Two java main programs are provided, BagOWordsPlus and BagOWordsPlusIndex. BagOWordsPlusIndex is the program that indexes the target and BagOWordsPlus is the program that does the retrieval. Once installed, the listed shell scripts around these two java programs are available.

See the output section for an explanation of the the output from the GSpell program.

3.2.1 Usage

 

BagOWordsPlus.[sh|bat]

   or

java

  -cp "$TOP/gspell/lib/gspellProject.jar;

       $TOP/gspell/config"

          gov.nih.nlm.nls.gspell.BagOWordsPlus

  

   [--dictionaryDir=someFullPath]

    --dictionaryName=NameOfDictionary 

   [--strict]

   [--truncate=N] [--considerNCandidates=N]

   [--maxEditDistance=N]

   [--wordLengthHeuristic=true|false]

   [--help]

BagOWordsPlusIndex.[sh|bat]

   or

java

  -cp "$TOP/gspell/lib/gspellProject.jar;

       $TOP/gspell/config"

          gov.nih.nlm.nls.gspell.BagOWordsPlusIndex

   [--dictionaryDir=someFullPath]

    --dictionaryName=NameOfDictionary 

   [--inputFile=YYYYY]

   [--help]     

 

3.2.2 Options

 

--dictionaryName=

The  name of the dictionary to be created or to be referenced. Since a directory is created based on this name, it should be named compliant with the naming rules for directories.

Each dictionary is placed in the dictionaries directory. The dictionaries directory is specified by the -- dictionaryDir=  path found in the $GSPELL_ROOT/config/gspellRegistry.cfg file. The $GSPELL_ROOT directory is the base directory where GSpell was installed. 

--dictionaryDir=

Dictionary directories containing the indexes are placed in the in the $DictionaryDir. $DictionaryDir is specified in the $GSPELL_ROOT/config/gspellRegistry.cfg file, but can be overuled by the -- dictionaryDir=aFullPath command line option. The $GSPELL_ROOT directory is the base directory where GSpell was installed.

A note to those who would like to have dictionaries placed in a location other than $GSPELL_ROOT/dictionaries: specify  the --dictionaryDir= and --dictionaryName= commandline options.  

--inputFile=

Specify the (full path) name of an input file to be indexed.

--unicode

Tell the application to expect UTF-8 formated input, and to write out UTF-8 data.

--considerNCandidates=N

Limit the total number of candidates to be evaluated to N. This parameter has a direct relation to both the speed and accuracy of the application. The more candidates evaluated, the better the suggestions, but the slower the response is.

--maxEditDistance=N

Limit the returned suggestions to those who have edit distances that are less than or equal to N.

--strict

Turn off spelling suggestion

--version

--CVSVersion

--history

Retrieve the verion of this build

Retrieve the CVS tagged version of this build

Retreive the history notes of this build.

--help

 

 

3.2.3 Output

BagOWordsPlus has the same output format as that of GSpell. This output currently is the "toString()" output from each candidate

Input term

Suggestion

Edit Distance

Rank from 0 to 1
(1.0 is an exact match)

[1]

Method used

Other messages
such as Correct

Frequency from corpus

[2]

Weighted rank

[3]

 

sleepin' disease

sleepin' disease|sleeping disease|1.0|0.87|BagOWordsPlus||0|2000000

sleepin' disease|creeping disease|3.0|0.3|BagOWordsPlus||0|4000000

 

 

[1] The rank is the normalized ranking with a value between 0 and 1 where 1 is an exact match and 0 is no match at all. The distribution for this normalized score is intended to have edit distances of .5 rank about the .90. The distribution curve is half the bell curve, and intended to allow developers relying on edit distance as the sorting metric to have a metric that ranks words beyond 2 edit distances as quickly getting worse as the curve extends out.

[2] The corpus frequency is taken from the 1999 Medline word frequencies. This is not used directly within BagOWordsPlus, because the frequencies are not of the phrases that bagOWords is intended to index on. However, the gspell algorithm that is called upon within GSpell to make good suggestions does use the frequencies at the word level.

[3] The weighted rank is currently the following:

     

      [ ( the edit distance * the method number)  * 1000 )  ]    + [ MAX FREQ - the frequency of the word found in the corpus]

    Where

          The edit distance is the Levenshtein edit distance

          The frequency  is 0, since we don't have phrase frequencies.

          The MAX FREQ currently is a constant == 1,000,000.  This number was intended to be the number of occurrences of words in the corpus, but this proved hard to compute and pass on to the ranking method. A suitably large number was chosen as a surrogate.

          The method number = 1 for bagOWordsPlus.

 

                 

3.2.4 Tunable Parameters

There are tuning parameters that are specific to each dictionary created. These parameters are stored in a configuration file in the dictionary directory. 

Performance tuning variables

--considerNCandidates

The number of candidates that can be evaluated for any given input. The higher the number the better the suggestions are, but the longer it takes to do. A reasonable rate on a sun is about 2000. This parameter alters the behavior of the NGrams algorithm.

--cacheSize

This is the number of grams to keep in memory during processing. The more grams in memory, the faster the response is. This makes a real difference for indexing, where if you can keep all the grams in memory, the faster indexing will go. The downside is more memory is used. For indexing, this may mean bumping up the memory to something along the lines of 100mb for dictionaries of 100,000 terms or more. This variable is used to keep the number of grams in memory to this size. The hash of grams does not grow beyond this number. Ideally, this size should not be set bigger than the number of grams you have. This parameter alters the behavior of the NGrams algorithm.

 

Variables that can be changed to give specific behaviors

--strict

Turn off spelling suggestion when retrieving candidates. This has the effect of only using the correctly spelled words within the query phrase to retrieve candidates. This will make retrieval faster, but will limit accuracy when misspelled words are within the query.

--truncate

This variable limits the number of suggestions returned for a given input.

--maxEditDistance

This variable limits suggestions to those which have an edit distance equal to or less than this parameter.

--wordLengthHeuristic

This heuristic retrieves candidate terms that are +/- 4 characters in length larger or smaller than the query term. For example, if you put in a term that is two characters long, the longest term retrieved would be 6 characters.  This heuristic is turned on by default.  This heuristic only affects the behavior of the NGram algorithm.

 

3.2.5 API's

3.2.5.1 Create an index

public BagOWordsPlusIndex(String[] argv) throws GSpellException

 

public BagOWordsPlusIndex(GlobalBehavior pSettings) throws GSpellException
public void index( ) throws GSpellException
public void index( String pTerm ) throws GSpellException

3.2.5.2 retrieve suggestions for phrases

public BagOWordsPlus(String[] argv) throws Exception

public BagOWordsPlusIndex(String pDictionaryName) throws GSpellException
public String[] find( String pTerm ) throws Exception
public Candidate[] get( String pTerm ) throws Exception

3.2.5.3 methods from the Candidate class.

public String getName()
public double getEditDistance()

public double getEditDistance1000()
public double getRank()
public String getFromMethod()
public String getMessage()

public long getCorpusFrequency()

public long getSortValue()

public long getMaxFreq()
public String toString()

3.2.5.4 Constructors for Apps that cannot set the classpath

For those applications that cannot set the classpath, use the the following constructor:

BagOWordsPlus(java.lang.String pConfigPath,

             java.lang.String pDictionaryName,

             int pMode)


The pConfigPath parameter should specify the full path to the ___GSPELL_ROOT/config path.

The dictionaryName should reflect the name of a directory from the ___GSPELL_ROOT/dictionaries/dictionaryName directory, or it should be the full path to the dictionary to be opened.

This constructor is used for those applications that cannot set the classpath, such as java servlet apps.

3.2.6 Dictionaries

Unjar the selected dictionaries in the dictionaries directory.

 

Description

DictionaryName

Name of the Jar file

The 2004 Lexicon indexes

2004LexiconBagOWords

2004LexiconBagOWords.jar

The 2003 Lexicon indexes

2003LexiconBagOWords

2003LexiconBagOWords.jar

The 2004AA Metathesaurus indexes

meta2004AABagOWords

meta2004AABagOWords.jar

The 2003AC Metathesuarus indexes

meta2003ACBagOWords

meta2003ACBagOWords.jar

The 2003AB Metathesuarus indexes

meta2003ABBagOWords

meta2003ABBagOWords.jar

The 2003AA Metathesuarus indexes

meta2003AABagOWords

meta2003AABagOWords.jar

The 2002AD Metathesuarus indexes

meta2002ADBagOWords

meta2002ADBagOWords.jar

The 2002AC Metathesuarus indexes

meta2002ACBagOWords

meta2002ACBagOWords.jar

The 2002AB Metathesuarus indexes

meta2002ABBagOWords

meta2002ABBagOWords.jar

The 2002AA Metathesuarus indexes

meta2002AABagOWords

meta2002AABagOWords.jar

The 2002AA-2004AB Metathesaurus indexes

2002AA-2004ABBagOWords

2002AA-2004ABBagOWords.jar

The 2002AA-2004AC Metathesaurus indexes

2002AA-2004ACBagOWords

2002AA-2004ACBagOWords.jar

 

3.3 NGrams

NGrams is a character based n-gram spelling suggestion tool. By default, it uses bi-grams. It indexes a set of words or terms, and provides a retrieval mechanism to retrieve close suggestions to input terms. Java API's are provided for developers to embed this application in their applications.

Once a dictionary is created, suggestions can be gotten via using this tool with the --find option, along with the --dictionaryName= option.

A dictionary is created using the --index option along with the --dictionary=NameOfDictionary option. Input terms can be an input file, specified by the --inputFile= option. Otherwise, the input is expected to be from standard input. The input is expected to be in the form of a term per new line separated row. Lines starting with # are taken to be comments are are ignored.

Dictionary directories containing the indexes are placed in the in the $DictionaryDir.  $DictionaryDir is specified in the $GSPELL_ROOT/config/gspellRegistry.cfg file, but can be overuled by the -- dictionaryDir=aFullPath command line option.

The $GSPELL_ROOT directory is the base directory where GSpell  was installed.

3.3.1 Usage

NGrams.[sh|bat]

   or

java

  -cp "$TOP/gspell/lib/gspellProject.jar;

       $TOP/gspell/config"

  -ms100m -mx600m

    gov.nih.nlm.nls.gspell.NGrams

  

--index|--update|--find|--export 

   [--dictionaryDir=someFullPath]

    --dictionaryName=NameOfDictionary 

   [--inputFile=YYYYY] [--outputFile=ZZZZZ]

   [--fieldedText] [--termField=X]

 

   [--truncate=N] [--considerNCandidates=N]

   [--maxEditDistance=N] [--correctField=Y]

   [--wordLengthHeuristic=true|false]

   [--reportTime] [--help]

 

3.3.2 Options

 

--index

Create a new dictionary. Take input from either the file specified by the --inputFile parameter, or from standard input. Input terms are expected to be of the format, one term per new line separated line. Lines starting with # are taken to be comments are are ignored. Existing dictionaries of the same name will be overwritten.

A note about memory. This application requires morethan the standard amount of memory when indexing. The total amount of memory used during indexing is reported with the --reportTime option.  It has been observed that 128MB was needed to load a set of 300,000 terms. Please use the java options -MX and -MS to run with the appropriate amount of memory when indexing.

--update

Add terms to an existing dictionary. This option is similar to --index option, except that if the dictionary exists, terms are added to it, rather than having the dictionary overwritten. Take input from either the file specified by the --inputFile parameter, or  from standard input. Input terms are expected to be of the format, one term per new line separated line. Lines starting with # are taken to be comments are are ignored.

--find

Retrieve close suggestions to input terms. Input is either taken from the standard input [the default], or from a file specified via the --inputFile. By default the input is to be in the format of one term per newline delimited line. However, the application can take fielded input (see the --fieldedText and -- termField options).

The output returned contains the input plus four fields. The first of these fields is the suggestion, the second is a ranking, the third field is the method that returned this suggestion, and the last field, will indicate whether or not a term was found, or was found to be correct (in a case insensitive way).

--export

Spew out to the contents of the indexes to standard output, or to the file specified by the -- outputFile option. Caution: this option spews out lots of data. This option is useful really only for debugging purposes. This option may disappear in future releases.

--dictionaryName=

The name of the dictionary to be created or to be referenced. Since a directory is created based on this name, it should be named compliant with the naming rules for directories.

Each dictionary is placed in the dictionaries directory. The dictionaries directory is specified by the --dictionaryDir=  path found in the $GSPELL_ROOT/config/gspellRegistry.cfg file. The $GSPELL_ROOT directory is the base directory where GSpell was installed.

--dictionaryDir=

Dictionary directories containing the indexes are placed in the in the $DictionaryDir. $DictionaryDir is specified in the $GSPELL_ROOT/config/gspellRegistry.cfg file, but can be overuled by the --dictionaryDir=aFullPath command line option. The $GSPELL_ROOT directory is the base directory where GSpell was installed.

A note to those who would like to have dictionaries placed in a location other than $GSPELL_ROOT/dictionaries: specify  the --dictionaryDir= and --dictionaryName= commandline options.              

--inputFile=

Specify the (full path) name of an input file. The semantics of this file refer to being the contents of what is to be index when this option is used with the --index and --update options. The contents of this file is taken as input for retrieval when the --find option is used.

--outputFile=

Specify the name of the output file.

--fieldedText

Indicate that the input is Pipe delimited input, where one of the fields contains the significant term, and the rest of the terms are passed through. This option must be used with the --termField option to indicate the field to retrieve. This option is only relevant with the --find option.

--termField=X

Indicate the field to pick up the term to be looked up. The first field is taken to be 1, not zero.

--truncate=N

Return the top N suggestions.

--considerNCandidates=N

Limit the total number of candidates to be evaluated to N. This parameter has a direct relation to both the speed and accuracy of the application. The more candidates evaluated, the better the suggestions, but the slower the response is.

--maxEditDistance=N

Limit the returned suggestions to those who have edit distances that are less than or equal to N.

--correctField=Y

This option is for debugging and benchmarking purposes. When present, it indicates the field that contains the correct spelling. Further, benchmark statistics are computed when this option is present. Those statistics include the number of responses where the correct term was found in the first, the first five, the first ten, the first hundred suggestions along with how many responses completely missed the mark. The output stats file is labeled dictionaryName.stats and is placed in the directory where the application is kicked off.

--reportTime

This option reports status messages to standard error during processing. It is useful when indexing a large set of terms to determine how long it takes to process 10,000 terms, and how much memory was needed.

--wordLengthHeurstic=
true|false

Limit the candidates to be evaluated to those which are with +/-4 characters of the input term. The nGram index is segmented into bins by the size of input words.

When this heuristic is turned on, the number of candidates is effectively pruned to not even consider words that are known to be outside reasonable edit distances already. This allows additional candidates that might not have been able to be evaluated to be evaluated. This heuristic has two beneficial aspects to it. It speeds up suggestions, and it suggests additional close neighbors.

This heuristic does increase the size of the resulting index by about 50% over an index that did not use this heuristic. Making a non- wordlengthHeurstic index is not an option. This heuristic prunes multi word neighbors that may share a word in common.

The size of target words range from 1 to over 100 characters, and are distributed in a Zipf like distribution (at least for our dictionary), where there are a few 1 character words, some more two character words, a fair amount of three character words, and a huge number of words between four characters and ten characters. The numbers start trailing off slowly after that. There is a precipitous drop off in the number of words that have more than 20 characters. This option is turned on by default.

--version

--CVSVersion

--history

Retrieve the verion of this build

Retrieve the CVS tagged version of this build

Retreive the history notes of this build.

--help

Help

 

3.3.3 Output

NGram's output for an input term contains the following fields:

Input term

Suggestion

Edit Distance

Rank from 0 to 1
(1.0 is an exact match)

[1]

Method used

Other messages
such as Correct

Frequency from corpus

[2]

Weighted rank

[3]

Here's an example of the output for the misspelled term anonomous.

anonomous

anonomous|anonymous|1.0|0.87|NGrams||0|2000000

anonomous|allonomous|2.0|0.58|NGrams||0|3000000

anonomous|autonomous|2.0|0.58|NGrams||0|3000000

anonomous|axonopus|3.0|0.3|NGrams||0|4000000

anonomous|analogous|3.0|0.3|NGrams||0|4000000

anonomous|anomalous|3.0|0.3|NGrams||0|4000000

anonomous|anonymously|3.0|0.3|NGrams||0|4000000

anonomous|annosus|3.0|0.3|NGrams||0|4000000

anonomous|anadromous|3.0|0.3|NGrams||0|4000000

anonomous|anthonomus|3.0|0.3|NGrams||0|4000000

 

Here is an example for the correctly spelled word sleep

sleep

sleep|sleep|0.0|1.0|NGrams|Correct|0|1000000

sleep|sheep|1.0|0.87|NGrams||0|2000000

sleep|sweep|1.0|0.87|NGrams||0|2000000

sleep|steep|1.0|0.87|NGrams||0|2000000

sleep|sleepy|1.0|0.87|NGrams||0|2000000

sleep|sleeps|1.0|0.87|NGrams||0|2000000

sleep|spleep|1.0|0.87|NGrams||0|2000000

sleep|seep|1.0|0.87|NGrams||0|2000000

sleep|sweeps|2.0|0.58|NGrams||0|3000000

sleep|seeps|2.0|0.58|NGrams||0|3000000

 

[1] The rank is the normalized ranking with a value between 0 and 1 where 1 is an exact match and 0 is no match at all. The distribution for this normalized score is intended to have edit distances of .5 rank about the .90. The distribution curve is half the bell curve, and intended to allow developers relying on edit distance as the sorting metric to have a metric that ranks words beyond 2 edit distances as quickly getting worse as the curve extends out.

[2] The corpus frequency is taken from the 1999 Medline word frequencies. (This is only used with the GSpell application)

[3] The weighted rank is currently the following:

     

      [ ( the edit distance * the method number)  * 1000 )  ]    + [ MAX FREQ - the frequency of the word found in the corpus]

 

 

    Where

          The edit distance is the Levenshtein edit distance

          The frequency of the word found is 0 for NGrams

          The MAX FREQ currently is a constant == 1,000,000.  This number was intended to be the number of occurrences of words in the corpus, but this proved hard to compute and pass on to the ranking method. A suitably large number was chosen as a surrogate.

          The method numbers were intended to weight homophones and corrections to common misspellings higher than other methods. The method number for NGrams = 1.

 

 

                 

3.3.4 Tunable Parameters

There are tuning parameters that are specific to each dictionary created. These parameters are stored in a configuration file in the dictionary directory. 

Performance tuning variables

--considerNCandidates

The number of candidates that can be evaluated for any given input. The higher the number the better the suggestions are, but the longer it takes to do. A reasonable rate on a sun is about 2000. This parameter alters the behavior of the NGrams algorithm.

--cacheSize

This is the number of grams to keep in memory during processing. The more grams in memory, the faster the response is. This makes a real difference for indexing, where if you can keep all the grams in memory, the faster indexing will go. The downside is more memory is used. For indexing, this may mean bumping up the memory to something along the lines of 100mb for dictionaries of 100,000 terms or more. This variable is used to keep the number of grams in memory to this size. The hash of grams does not grow beyond this number. Ideally, this size should not be set bigger than the number of grams you have. This parameter alters the behavior of the NGrams algorithm.

 

Variables that can be changed to give specific behaviors

--truncate

This variable limits the number of suggestions returned for a given input.

--maxEditDistance

This variable limits suggestions to those which have an edit distance equal to or less than this parameter.

--wordLengthHeuristic

This heuristic retrieves candidate terms that are +/- 4 characters in length larger or smaller than the query term. For example, if you put in a term that is two characters long, the longest term retrieved would be 6 characters.  This heuristic is turned on by default.  This heuristic only affects the behavior of the NGram algorithm.

 

3.3.5 API's

GSpell Package (contains the NGrams methods)

 

API builders should view the gspell package. It should be possible to embed this package into other applications using the following constructors and methods for the class GSpell.

Gspell instances can be embedded into single threaded applications with no problem. It can be embedded into threaded applications with the following advice:

Embedding read-only instances of GSpell objects is fine as long as there are no read/write instances present.

It is recommended that in cases where there are updates or read/writes, this object gets implemented as a singleton instance per dictionary, that is shared across threads and processes.

There is NO writer lock and unlock around the update/index method, nor around the cleanup or flush methods to insure that only one writer is alive at any given time, and that there are no readers alive. It is up to the application developer to insure that access is controlled around GSpell Object instances that do writes or updates.

3.3.5.1 Create or update an index

public NGrams(GlobalBehavior pSettings) throws Exception

public NGrams(String args[]) throws Exception
public int index( BufferedReader pFStream ) throws Exception
public void index( String pTerm ) throws Exception
public int update( BufferedReader pFStream ) throws Exception
public void update( String pTerm ) throws Exception

3.3.5.2 Retrieve suggestions for terms

public GSpell(Options pOptions) throws Exception
public GSpell(String pDictionaryName, int pMode) throws Exception
public int find( BufferedReader pFStream, PrintWriter pPrintStream ) throws Exception
public Candidate[] find( String pTerm ) throws Exception
public Candidate[] find( FindOptions fOptions, String pTerm ) throws Exception

 

3.3.5.3 Methods from the Candidate class.

public String getName()
public double getEditDistance()

public double getEditDistance1000()
public double getRank()
public String getFromMethod()
public String getMessage()

public long getCorpusFrequency()

public long getSortValue()

public long getMaxFreq()
public String toString()

3.3.5.4 GSpell GlobalBehavior and Find Options

public GlobalBehavior( String pApplicationName, String pRegistryFile, String args[])


public FindOptions (GlobalBehavior pSettings)

public FindOptions()

public FindOptions (int pConsiderNCandidates, boolean pUseWordLengthHeuristic, int pTruncateSize, pMaxEditDistance)

3.3.5.5 Example Applications

 

Examples that illustrate embedding this in an application are provided in the examples directory.

          NGramsIndex.java 

          NGramsSuggest.java 

          NGramsUpdate.java 

4 Version

4.1 V0.0.33

-Fixed the following issues:

         The gspell Update was not saving all entries - This used to work. It broke when I tried to deal with utf issues. I was allocating two extra bytes per ngram name thinking that utf characters would be wider. A bad thought. When an update came in, the stuff in those extra bytes might not be the stuff in in the bytes during the previous proccess'es index. So, sometimes the gram names were not being matched as a consequence.

         Added a --unicode flag to flag input and output files in the UTF-8 format.

         The gspell and bagOwordsPlus index, now automatically create the common misspelling and homophone indexes for the dictionary.

         Switched the indexing from Russell Loane's tools to the Berkeley btree software - not because there were any problems with his tools, but because I had filled in some additional Berkeley Btree methods to peek into indexes that I had not done with Russell's implementation.

4.2 V0.0.32

Added (back) in a constructor for those who work in a classpathless environment to pass in the full path to the config file. This is primarily for those embedding this in java servlet applications where adding classpaths is difficult.

Updated the distribution so that it now contains all the sources needed to be compliled

4.3 V0.0.30

3/31/04

          A more robust build, with test hooks, and baseline

          JSP inteface added.

          Noticed that the bagOWordsPlus index was overwriting the gspell index.

          Made these two dictionaries. They cannot share the same space.

4.4 V0.0.29

2/11/04

          A more mature versioning added

4.5 V0.0.28

This version contains the following new features

          Homophones

          Medline frequencies to reorder candidates

          Java NLS Repository style config files and settings

          Update function restored

          Deprecated rather deleted some of the older public methods

4.6 V0.0.27

This version includes Russel Loane's store methods. This version was released Nov. 2003.

 

 


Contact us at: umlslex@nlm.nih.gov

The SPECIALIST NLP Tools

Updated: December 3, 2004


Lexical Systems Group
Cognitive Science Branch
Lister Hill National Center for Biomedical Communications
National Library of Medicine
National Institutes of Health
Department of Health & Human Services
Copyright - Privacy - Accessibility