2
Package Download and Installation
3.1.5.1
Create or update an index
3.1.5.2
Retrieve suggestions for terms
3.1.5.3
Methods from the Candidate class.
3.1.5.4
GSpell GlobalBehavior and Find Options
3.1.5.6
Constructors for Apps that cannot set the classpath
3.2.5.2
retrieve suggestions for phrases
3.2.5.3
methods from the Candidate class.
3.2.5.4
Constructors for Apps that cannot set the classpath
3.3.5.1
Create or update an index
3.3.5.2
Retrieve suggestions for terms
3.3.5.3
Methods from the Candidate class.
3.3.5.4
GSpell GlobalBehavior and Find Options
SPECIALIST Spelling Suggestion Tools
Spelling suggestion tools and techniques are useful in a
variety of applications from spelling suggestion embedded behind queries of a
search engine, to nearest neighbor suggestions to find close or related terms
from an index. This package includes a suite of such tools:
· The GSpell program is a spelling suggestion tool that uses a mix of algorithms to retrieve close neighbors. This application is best suited to applications that index at the word or term level of tokenization. The mix of algorithms includes the NGrams, metaphone, common misspellings, and homophone retrieval tools. Candidates are evaluated by the Levenshtein edit distance, and similar ranked candidates are re-ordered by use of word based corpus frequencies.
· BagOWordsPlus is a phrase retrieval tool. This tool is useful to retrieve closest matching phrases to data such as strings from the Metathesaurus. The retrieval mechanism, as the name implies, is a bag of words implementation. It is aided by the use of GSpell to augment those word tokens that are not initially found. This tool suggests the top phrase matches to the input phrase. The phrase evaluation is based on an edit distance between the query and each suggestion.
· NGrams is a character based N-gram retrieval tool that anchors the initial retrieval sets on the first and last N-grams of the query term. By default the NGrams are bi-grams.
· MetaphoneRetrieve is a phonetic retrieval tool that uses Lawrence Philips Metaphone algorithm to normalize terms to a rough phonetic representation.
· Homophones is a lookup mechanism into a table of homophones.
· CommonMisspellings is lookup tool to retrieve correct spellings of common misspellings that we've gleaned.
2 Package
Download and Installation
Jar file |
Description |
Size |
Sources
Release Includes
all the sources necessary to build Gspell, all the data, and build scripts. |
173,453
KB |
|
Standard Release |
70,946
KB |
|
Standard Jar |
1,719
KB |
|
Essential jar |
165
KB |
·
Un-jar the gspellProjectStd.jar into the
location where you want to install gspell.
jar xvf
gspellProjectStd.jar
· Change directory to the gspell directory that was just created from the previous step.
· Run either the install.bat or install.sh
This will create the following scripts in the gspell/bin
directory:
GSpell.[bat|sh] |
GSpellIndex .[bat|sh] |
GSpellFind .[bat|sh] |
GSpellUpdate .[bat|sh] |
GSpellIndexCommon .[bat|sh] |
GSpellUpdateCommon .[bat|sh] |
bagOWordsPlusIndex.[bat|sh] |
bagOWordsPlus.[bat|sh] |
homophones.[bat|sh] |
metaphone.[bat|sh] |
ngrams.[bat|sh] |
These scripts set up the proper java environment prior to
invoking their respective java programs. These scripts accept the options
mentioned above.
Invoke the gspell applications via these scripts.
· [Optional] If you are integrating this into other applications, or want to run the java program outside the scripts, set your $CLASSPATH environment to include gspell/lib/gspellProject.jar and the gspell/config location. This package includes a gspell/dictionaries directory. The applications put the dictionaries in the dictionaries directory. If you want to use an alternative location, use the configuration options --dictionaryDir= to specify the parent directory of where the dictionary directory will reside.
· Retrieve suggestions from the default dictionary created from the SPECIALIST Lexicon. (See the End User Examples)
· [Optional]. Create a dictionary of your terms. (See the End User Examples)
Gspell is a spelling suggestion tool that uses a mix of
algorithms to retrieve close neighbors. This application is best suited to
applications that index at the word or term level of tokenization.
Java API's are provided for developers to embed this
application in their applications. Two java main programs are provided, GSpell,
and CommonMisspellings. GSpell is the general spelling suggestion tool. The
auxiliary CommonMispellings program is provided as the mechanism for inserting
and updating common misspelling by correct spelling pairs. Once installed, the
listed shell scripts around these two java programs are available.
See the output section for an explanation of the the output
from the GSpell program.
GSpell.[sh|bat]
or java
-cp
"$TOP/gspell/lib/gspellProject.jar;
$TOP/gspell/config" -ms100m
-mx600m
gov.nih.nlm.nls.gspell.GSpell
|
--index|--update|--find|--export
[--dictionaryDir=someFullPath]
--dictionaryName=NameOfDictionary
[--inputFile=YYYYY] [--outputFile=ZZZZZ]
[--fieldedText] [--termField=X]
[--truncate=N] [--considerNCandidates=N]
[--maxEditDistance=N] [--correctField=Y]
[--wordLengthHeuristic=true|false]
[--reportTime] [--help] |
GSpellIndex.[sh|bat] |
[--dictionaryDir=someFullPath]
--dictionaryName=NameOfDictionary
[--inputFile=YYYYY] [--outputFile=ZZZZZ]
[--fieldedText] [--termField=X]
[--reportTime] [--help] |
GSpellFind.[sh|bat] |
[--dictionaryDir=someFullPath]
--dictionaryName=NameOfDictionary
[--inputFile=YYYYY] [--outputFile=ZZZZZ]
[--fieldedText] [--termField=X]
[--truncate=N] [--considerNCandidates=N]
[--maxEditDistance=N] [--correctField=Y]
[--wordLengthHeuristic=true|false]
[--reportTime] [--help] |
GSpellUpdate.[sh|bat] |
[--dictionaryDir=someFullPath]
--dictionaryName=NameOfDictionary
[--inputFile=YYYYY] [--outputFile=ZZZZZ]
[--fieldedText] [--termField=X]
[--truncate=N] [--considerNCandidates=N]
[--maxEditDistance=N] [--correctField=Y]
[--wordLengthHeuristic=true|false]
[--reportTime] [--help] |
--index |
Create
a new dictionary. Take input from either the file specified by the
--inputFile parameter, or from standard input. Input terms are expected to be
of the format, one term per new line separated line. Lines starting with #
are taken to be comments are are ignored. Existing dictionaries of the same
name will be overwritten. A
note about memory. This application requires morethan the standard amount of
memory when indexing. The total amount of memory used during indexing is
reported with the --reportTime option. It has been observed that 128MB
was needed to load a set of 300,000 terms. Please use the java options -MX
and -MS to run with the appropriate amount of memory when indexing. |
--update |
Add terms to an existing dictionary. This option is
similar to --index option, except that if the dictionary exists, terms are
added to it, rather than having the dictionary overwritten. Take input from
either the file specified by the --inputFile parameter, or from
standard input. Input terms are expected to be of the format, one term per
new line separated line. Lines starting with # are taken to be comments are
are ignored. |
--find |
Retrieve close suggestions to input terms. Input is
either taken from the standard input [the default], or from a file specified
via the --inputFile. By default the input is to be in the format of one term
per newline delimited line. However, the application can take fielded input
(see the --fieldedText and -- termField options). The output returned contains the input plus four
fields. The first of these fields is the suggestion, the second is a ranking,
the third field is the method that returned this suggestion, and the last
field, will indicate whether or not a term was found, or was found to be
correct (in a case insensitive way). |
--export |
Spew out to the contents of the indexes to standard
output, or to the file specified by the -- outputFile option. Caution: this
option spews out lots of data. This option is useful really only for
debugging purposes. This option may disappear in future releases. |
--unicode |
Tell the application to expect UTF-8 formated input, and to write out UTF-8 data. |
--dictionaryName= |
The name of the dictionary to be created or to be
referenced. Since a directory is created based on this name, it should be
named compliant with the naming rules for directories. Each dictionary is placed in the dictionaries directory.
The dictionaries directory is specified by the --dictionaryDir= path
found in the $GSPELL_ROOT/config/gspellRegistry.cfg file. The $GSPELL_ROOT
directory is the base directory where GSpell was installed. |
--dictionaryDir= |
Dictionary directories containing the indexes are
placed in the in the $DictionaryDir. $DictionaryDir is specified in the
$GSPELL_ROOT/config/gspellRegistry.cfg file, but can be overuled by the
--dictionaryDir=aFullPath command line option. The $GSPELL_ROOT directory is
the base directory where GSpell was installed. A note to those who would like to have dictionaries
placed in a location other than $GSPELL_ROOT/dictionaries: specify the
--dictionaryDir= and --dictionaryName= commandline
options.
|
--inputFile= |
Specify the (full path) name of an input file. The
semantics of this file refer to being the contents of what is to be index
when this option is used with the --index and --update options. The contents
of this file is taken as input for retrieval when the --find option is used. |
--outputFile= |
Specify the name of the output file. |
--fieldedText |
Indicate that the input is Pipe delimited input,
where one of the fields contains the significant term, and the rest of the
terms are passed through. This option must be used with the --termField
option to indicate the field to retrieve. This option is only relevant with
the --find option. |
--termField=X |
Indicate the field to pick up the term to be looked
up. The first field is taken to be 1, not zero. |
--truncate=N |
Return the top N suggestions. |
--considerNCandidates=N |
Limit the total number of candidates to be
evaluated to N. This parameter has a direct relation to both the speed and
accuracy of the application. The more candidates evaluated, the better the
suggestions, but the slower the response is. |
--maxEditDistance=N |
Limit the returned suggestions to those who have
edit distances that are less than or equal to N. |
--correctField=Y |
This option is for debugging and benchmarking
purposes. When present, it indicates the field that contains the correct
spelling. Further, benchmark statistics are computed when this option is
present. Those statistics include the number of responses where the correct
term was found in the first, the first five, the first ten, the first hundred
suggestions along with how many responses completely missed the mark. The
output stats file is labeled dictionaryName.stats and is placed in the
directory where the application is kicked off. |
--reportTime |
This option reports status messages to standard
error during processing. It is useful when indexing a large set of terms to
determine how long it takes to process 10,000 terms, and how much memory was
needed. |
--wordLengthHeurstic= |
Limit the candidates to be evaluated to those which
are with +/-4 characters of the input term. The nGram index is segmented into
bins by the size of input words. When this heuristic is turned on, the number of
candidates is effectively pruned to not even consider words that are known to
be outside reasonable edit distances already. This allows additional
candidates that might not have been able to be evaluated to be evaluated.
This heuristic has two beneficial aspects to it. It speeds up suggestions,
and it suggests additional close neighbors. This heuristic does increase the size of the
resulting index by about 50% over an index that did not use this heuristic.
Making a non- wordlengthHeurstic index is not an option. This heuristic
prunes multi word neighbors that may share a word in common. The size of target words range from 1 to over 100
characters, and are distributed in a Zipf like distribution (at least for our
dictionary), where there are a few 1 character words, some more two character
words, a fair amount of three character words, and a huge number of words
between four characters and ten characters. The numbers start trailing off
slowly after that. There is a precipitous drop off in the number of words
that have more than 20 characters. This option is turned on by default. |
--version --CVSVersion --history |
Retrieve
the verion of this build Retrieve
the CVS tagged version of this build Retreive
the history notes of this build. |
--help |
Help |
GSpell's output for an input term contains the following
fields:
Input term |
Suggestion |
Edit Distance |
Rank from 0 to 1 [1] |
Method used |
Other messages |
Frequency from corpus [2] |
Weighted rank [3] |
Here's an example of the output for the misspelled term anonomous.
anonomous
anonomous|anonymous|1.0|0.87|NGrams||402|1999598
anonomous|autonomous|2.0|0.58|NGrams||664|2999336
anonomous|allonomous|2.0|0.58|NGrams||-1|3000001
anonomous|analogous|3.0|0.3|NGrams||1008|3998992
anonomous|anomalous|3.0|0.3|NGrams||890|3999110
anonomous|anonymously|3.0|0.3|NGrams||54|3999946
anonomous|anadromous|3.0|0.3|NGrams||1|3999999
anonomous|anonymes|3.0|0.3|Metaphone||-1|4000001
anonomous|anonyms|3.0|0.3|Metaphone||-1|4000001
anonomous|Axonopus|3.0|0.3|NGrams||-1|4500001
Here is an example for the correctly spelled word disease
disease
disease|disease|0.0|1.0|NGrams|Correct|94306|905694
disease|diseases|1.0|0.87|NGrams||18485|1981515
disease|diseased|1.0|0.87|NGrams||855|1999145
disease|dispase|1.0|0.87|NGrams||27|1999973
disease|discase|1.0|0.87|NGrams||3|1999997
disease|diverse|2.0|0.58|NGrams||3134|2996866
disease|disuse|2.0|0.58|NGrams||84|2999916
disease|disperse|2.0|0.58|NGrams||66|2999934
disease|diastase|2.0|0.58|NGrams||25|2999975
disease|dispense|2.0|0.58|NGrams||24|2999976
[1] The
rank is the normalized ranking with a value between 0 and 1 where 1 is an exact
match and 0 is no match at all. The distribution for this normalized score is
intended to have edit distances of .5 rank about the .90. The distribution
curve is half the bell curve, and intended to allow developers relying on edit
distance as the sorting metric to have a metric that ranks words beyond 2 edit
distances as quickly getting worse as the curve extends out.
[2] The
corpus frequency is taken from the 1999 Medline word frequencies. See the
simpleCorpusStatistics notes for details on how to embed other corpus
frequencies into this package.
[3] The
weighted rank is currently the following:
[ ( the edit distance * the method number) * 1000 )
] + [ MAX FREQ - the frequency of the word found in the corpus]
Where
·
The
edit distance is the Levenshtein edit distance
·
The
frequency of the word found in the corpus comes from the 1999 Medline word
frequency list, modified to throw out words with frequencies less than 4
occurrences, as an efficiency to keep the list manageable.
·
The
MAX FREQ currently is a constant == 1,000,000. This number was intended
to be the number of occurrences of words in the corpus, but this proved hard to
compute and pass on to the ranking method. A suitably large number was chosen
as a surrogate.
·
The
method numbers were intended to weight homophones and corrections to common
misspellings higher than other methods.
Method |
Method
Number |
NGrams |
1 |
Metaphone |
1 |
Homophones |
.5 |
Common
Misspellings |
.5 |
bagOWordsPlus |
1 |
default |
1 |
There are tuning parameters that are specific to each
dictionary created. These parameters are stored in a configuration file in the
dictionary directory.
Performance tuning variables
--considerNCandidates |
The number of candidates that can be evaluated for
any given input. The higher the number the better the suggestions are, but
the longer it takes to do. A reasonable rate on a sun is about 2000. This
parameter alters the behavior of the NGrams algorithm. |
--cacheSize |
This is the number of grams to keep in memory
during processing. The more grams in memory, the faster the response is. This
makes a real difference for indexing, where if you can keep all the grams in
memory, the faster indexing will go. The downside is more memory is used. For
indexing, this may mean bumping up the memory to something along the lines of
100mb for dictionaries of 100,000 terms or more. This variable is used to
keep the number of grams in memory to this size. The hash of grams does not grow
beyond this number. Ideally, this size should not be set bigger than the
number of grams you have. This parameter alters the behavior of the NGrams
algorithm. |
Variables that can be changed to give specific
behaviors
--truncate |
This variable limits the number of suggestions
returned for a given input. |
--maxEditDistance |
This variable limits suggestions to those which
have an edit distance equal to or less than this parameter. |
--wordLengthHeuristic |
This heuristic retrieves candidate terms that are +/-
4 characters in length larger or smaller than the query term. For example, if
you put in a term that is two characters long, the longest term retrieved
would be 6 characters. This heuristic is turned on by default.
This heuristic only affects the behavior of the NGram algorithm. |
API builders should view the gspell package. It should be
possible to embed this package into other applications using the following
constructors and methods for the class GSpell.
Gspell instances can be embedded into single threaded
applications with no problem. It can be embedded into threaded applications
with the following advice:
Embedding read-only instances of GSpell objects is fine as
long as there are no read/write instances present.
It is recommended that in cases where there are updates or
read/writes, this object gets implemented as a singleton instance per
dictionary, that is shared across threads and processes.
There is NO writer lock and
unlock around the update/index method, nor around the cleanup or flush methods
to insure that only one writer is alive at any given time, and that there are
no readers alive. It is up to the application developer to insure that access
is controlled around GSpell Object instances that do writes or updates.
3.1.5.1 Create or update an index
public GSpell(String
args[]) throws Exception
public
GSpell(GlobalBehavior pSettings) throws Exception
public
GSpell(String pDictionaryName, int pMode) throws Exception
public
int index( BufferedReader pFStream ) throws Exception
public
void index( String pTerm ) throws Exception
public
int update( BufferedReader pFStream ) throws Exception
public
void update( String pTerm ) throws Exception
3.1.5.2 Retrieve suggestions for terms
public
GSpell(GlobalBehavior pSettings) throws Exception
public GSpell(String
pDictionaryName) throws Exception
public
GSpell(String pDictionaryName, int pMode) throws Exception
public
int
find( BufferedReader pFStream, PrintWriter pPrintStream ) throws
Exception
public
Candidate[]
find( String pTerm ) throws Exception
public
Candidate[] find( FindOptions fOptions, String pTerm ) throws Exception
3.1.5.3 Methods from the Candidate
class.
public String getName()
public
double getEditDistance()
public
double getEditDistance1000()
public
double getRank()
public
String getFromMethod()
public
String getMessage()
public
long getCorpusFrequency()
public
long getMaxFreq()
public
String toString()
3.1.5.4 GSpell GlobalBehavior and Find
Options
public
GlobalBehavior( String pApplicationName, String pRegistryFile, String args[])
public
FindOptions (GlobalBehavior pSettings)
Examples that illustrate embedding this in an application
are provided in the examples directory.
3.1.5.6 Constructors for Apps that
cannot set the classpath
For those
applications that cannot set the classpath, use the the following constructor:
GSpell(java.lang.String
pConfigPath,
java.lang.String
pDictionaryName,
int pMode)
The pConfigPath parameter should specify the full path to the ___GSPELL_ROOT/config
path.
The dictionaryName should reflect the name of a directory
from the ___GSPELL_ROOT/dictionaries/dictionaryName directory, or it should be
the full path to the dictionary to be opened.
This constructor is used for those applications that cannot
set the classpath, such as java servlet apps.
Indexing
·
Create a 1000 word
dictionary
·
Create a
100,000 word dictionary, report the status
·
Create the
dictionary for the SPECIALIST Lexicon and create a common misspelling index for the
SPECIALIST Lexicon
Suggestions
·
Suggest words from the SPECIALIST
Lexicon, taking terms from standard input.
·
Find 1000 words
from the SPECIALIST Lexicon, note the time.
· Data
· Common misspellings that have been gathered from a variety of sources.
·
Indexed
Dictionaries
The application is driven by some heuristics that in general
work well, but have consequences.
There is a consequence to the bathtub heuristic. If the spelling
error occurs in the first or last character of the string, the suggestions, if
any will more than likely be wrong. In a future release, two additional
techniques may be employed to ameliorate this: a left and right truncation
retrieval, and a conservative stemming heuristic. A centroid approach may be
looked into as well, but that approach has a high cost of training associated
with it.
The quality of the returned suggested terms are directly
related to the number of candidates evaluated. The more candidates evaluated,
the better, but the cost is the amount of time it takes to return suggestions.
Nick Ide has suggested several improvements that should be
adopted:
· Have the suggestions be weighted by the likelihood that the transformation between the bad and good word was caused by some understandable heuristic. For instance it is common to see vowel transformations, where it is less common to see a vowel to consonant transformation. Likewise typo's caused by keyboard configuration are common.
This suggests instituting a confusion matrix a-la Tom Lasco
within the edit distance. Tom Lasco augmented gspell with a confusion matrix
two years ago for an OCR application.
· Institute a logging function to capture good/bad spelling pairs.
Performance enhancements
· Create threads for each of the spelling techniques. Each is independent of the other, and this would speed the application up by about 20%.
Quality Enhancements
· Investigate whether removing plural sembiants from incoming terms helps or hurts. I have observed that some noticeable portion of misses were caused by the incoming term containing a trailing s, but where only the singular form of the word was in the target.
· Employ a left and right truncation retrieval to enhance recall of "related" terms, such as everything that starts with the word "yellow". Left truncation would enhance recall for those terms that are prefixed by prefixes such as "un" "anti" "pro" "Con".
Misc. Enhancements
· Gather and distribute application specific dictionaries.
BagOWordsPlus is a phrase retrieval tool. This tool is
useful to retrieve closest matching phrases to data such as Metathesuarus
strings. The retrieval mechanism, as the name implies, is a bag of words
implementation. It is aided by the use of spelling suggestion to augment those
word tokens that are not initially found. This is not an interactive tool that
offers suggestions for each unmatched word. Instead, the top spelling
suggestions are added to retrieve phrases that contain both the matched words
and these suggestions. This tool suggests the top phrase matches to the input
phrase. The phrase evaluation is based on an edit distance between the query
and each suggestion.
Java API's are provided for developers to embed this application
in their applications. Two java main programs are provided, BagOWordsPlus and
BagOWordsPlusIndex. BagOWordsPlusIndex is the program that indexes the target
and BagOWordsPlus is the program that does the retrieval. Once installed, the
listed shell scripts around these two java programs are available.
See the output section
for an explanation of the the output from the GSpell program.
BagOWordsPlus.[sh|bat]
or java
-cp
"$TOP/gspell/lib/gspellProject.jar;
$TOP/gspell/config"
gov.nih.nlm.nls.gspell.BagOWordsPlus
|
[--dictionaryDir=someFullPath]
--dictionaryName=NameOfDictionary
[--strict]
[--truncate=N] [--considerNCandidates=N]
[--maxEditDistance=N]
[--wordLengthHeuristic=true|false]
[--help] |
BagOWordsPlusIndex.[sh|bat]
or java
-cp
"$TOP/gspell/lib/gspellProject.jar;
$TOP/gspell/config"
gov.nih.nlm.nls.gspell.BagOWordsPlusIndex |
[--dictionaryDir=someFullPath]
--dictionaryName=NameOfDictionary
[--inputFile=YYYYY]
[--help] |
--dictionaryName= |
The name of the dictionary to be created or
to be referenced. Since a directory is created based on this name, it should
be named compliant with the naming rules for directories. Each dictionary is placed in the dictionaries
directory. The dictionaries directory is specified by the --
dictionaryDir= path found in the $GSPELL_ROOT/config/gspellRegistry.cfg
file. The $GSPELL_ROOT directory is the base directory where GSpell was
installed. |
--dictionaryDir= |
Dictionary directories containing the indexes are
placed in the in the $DictionaryDir. $DictionaryDir is specified in the
$GSPELL_ROOT/config/gspellRegistry.cfg file, but can be overuled by the --
dictionaryDir=aFullPath command line option. The $GSPELL_ROOT directory is
the base directory where GSpell was installed. A note to those who would like to have dictionaries
placed in a location other than $GSPELL_ROOT/dictionaries: specify the
--dictionaryDir= and --dictionaryName= commandline options. |
--inputFile= |
Specify the (full path) name of an input file to be
indexed. |
--unicode |
Tell the application to expect UTF-8 formated input, and to write out UTF-8 data. |
--considerNCandidates=N |
Limit the total number of candidates to be
evaluated to N. This parameter has a direct relation to both the speed and
accuracy of the application. The more candidates evaluated, the better the
suggestions, but the slower the response is. |
--maxEditDistance=N |
Limit the returned suggestions to those who have
edit distances that are less than or equal to N. |
--strict |
Turn off spelling suggestion |
--version --CVSVersion --history |
Retrieve
the verion of this build Retrieve
the CVS tagged version of this build Retreive
the history notes of this build. |
--help |
|
BagOWordsPlus has the same output format as that of GSpell.
This output currently is the "toString()" output from each candidate
Input term |
Suggestion |
Edit Distance |
Rank from 0 to 1 [1] |
Method used |
Other messages |
Frequency from corpus [2] |
Weighted rank [3] |
sleepin'
disease
sleepin'
disease|sleeping disease|1.0|0.87|BagOWordsPlus||0|2000000
sleepin'
disease|creeping disease|3.0|0.3|BagOWordsPlus||0|4000000
[1] The
rank is the normalized ranking with a value between 0 and 1 where 1 is an exact
match and 0 is no match at all. The distribution for this normalized score is
intended to have edit distances of .5 rank about the .90. The distribution
curve is half the bell curve, and intended to allow developers relying on edit
distance as the sorting metric to have a metric that ranks words beyond 2 edit
distances as quickly getting worse as the curve extends out.
[2] The
corpus frequency is taken from the 1999 Medline word frequencies. This is not
used directly within BagOWordsPlus, because the frequencies are not of the
phrases that bagOWords is intended to index on. However, the gspell algorithm
that is called upon within GSpell to make good suggestions does use the
frequencies at the word level.
[3] The
weighted rank is currently the following:
[ ( the edit distance * the method number) * 1000 )
] + [ MAX FREQ - the frequency of the word found in the
corpus]
Where
·
The
edit distance is the Levenshtein edit distance
·
The
frequency is 0, since we don't have phrase frequencies.
·
The
MAX FREQ currently is a constant == 1,000,000. This number was intended
to be the number of occurrences of words in the corpus, but this proved hard to
compute and pass on to the ranking method. A suitably large number was chosen
as a surrogate.
·
The
method number = 1 for bagOWordsPlus.
There are tuning parameters that are specific to each dictionary
created. These parameters are stored in a configuration file in the dictionary
directory.
Performance tuning variables
--considerNCandidates |
The number of candidates that can
be evaluated for any given input. The higher the number the better the
suggestions are, but the longer it takes to do. A reasonable rate on a sun is
about 2000. This parameter alters the behavior of the NGrams algorithm. |
--cacheSize |
This is the number of grams to keep in memory
during processing. The more grams in memory, the faster the response is. This
makes a real difference for indexing, where if you can keep all the grams in
memory, the faster indexing will go. The downside is more memory is used. For
indexing, this may mean bumping up the memory to something along the lines of
100mb for dictionaries of 100,000 terms or more. This variable is used to
keep the number of grams in memory to this size. The hash of grams does not
grow beyond this number. Ideally, this size should not be set bigger than the
number of grams you have. This parameter alters the behavior of the NGrams
algorithm. |
Variables that can be changed to give specific
behaviors
--strict |
Turn off spelling suggestion when retrieving
candidates. This has the effect of only using the correctly spelled words within
the query phrase to retrieve candidates. This will make retrieval faster, but
will limit accuracy when misspelled words are within the query. |
--truncate |
This variable limits the number of suggestions
returned for a given input. |
--maxEditDistance |
This variable limits suggestions to those which
have an edit distance equal to or less than this parameter. |
--wordLengthHeuristic |
This heuristic retrieves candidate terms that are
+/- 4 characters in length larger or smaller than the query term. For example,
if you put in a term that is two characters long, the longest term retrieved
would be 6 characters. This heuristic is turned on by default.
This heuristic only affects the behavior of the NGram algorithm. |
public BagOWordsPlusIndex(String[]
argv) throws GSpellException
public BagOWordsPlusIndex(GlobalBehavior
pSettings) throws GSpellException
public
void index( ) throws GSpellException
public
void index( String pTerm ) throws GSpellException
3.2.5.2 retrieve
suggestions for phrases
public
BagOWordsPlus(String[] argv) throws Exception
public BagOWordsPlusIndex(String
pDictionaryName) throws GSpellException
public
String[] find( String pTerm ) throws Exception
public
Candidate[] get( String pTerm ) throws Exception
3.2.5.3 methods from the Candidate
class.
public String getName()
public
double getEditDistance()
public
double getEditDistance1000()
public
double getRank()
public
String getFromMethod()
public
String getMessage()
public
long getCorpusFrequency()
public
long getMaxFreq()
public
String toString()
3.2.5.4 Constructors for Apps that
cannot set the classpath
For those
applications that cannot set the classpath, use the the following constructor:
BagOWordsPlus(java.lang.String
pConfigPath,
java.lang.String
pDictionaryName,
int
pMode)
The pConfigPath parameter should specify the full path to the
___GSPELL_ROOT/config path.
The dictionaryName should reflect the name of a directory
from the ___GSPELL_ROOT/dictionaries/dictionaryName directory, or it should be
the full path to the dictionary to be opened.
This constructor is used for those applications that cannot
set the classpath, such as java servlet apps.
Unjar the
selected dictionaries in the dictionaries directory.
Description |
DictionaryName |
Name of the Jar file |
The 2004 Lexicon indexes |
2004LexiconBagOWords |
|
The 2003 Lexicon indexes |
2003LexiconBagOWords |
|
The 2004AA Metathesaurus indexes |
meta2004AABagOWords |
|
The 2003AC Metathesuarus indexes |
meta2003ACBagOWords |
|
The 2003AB Metathesuarus indexes |
meta2003ABBagOWords |
|
The 2003AA Metathesuarus indexes |
meta2003AABagOWords |
|
The 2002AD Metathesuarus indexes |
meta2002ADBagOWords |
|
The 2002AC Metathesuarus indexes |
meta2002ACBagOWords |
|
The 2002AB Metathesuarus indexes |
meta2002ABBagOWords |
|
The 2002AA Metathesuarus indexes |
meta2002AABagOWords |
|
The
2002AA-2004AB Metathesaurus indexes |
2002AA-2004ABBagOWords |
|
The
2002AA-2004AC Metathesaurus indexes |
2002AA-2004ACBagOWords |
NGrams is a character based n-gram spelling suggestion tool.
By default, it uses bi-grams. It indexes a set of words or terms, and provides
a retrieval mechanism to retrieve close suggestions to input terms. Java API's
are provided for developers to embed this application in their applications.
Once a dictionary is created, suggestions can be gotten via
using this tool with the --find option, along with the --dictionaryName=
option.
A dictionary is created using the --index option along with
the --dictionary=NameOfDictionary option. Input terms can be an input file,
specified by the --inputFile= option. Otherwise, the input is expected to be
from standard input. The input is expected to be in the form of a term per new
line separated row. Lines starting with # are taken to be comments are are
ignored.
Dictionary directories containing the indexes are placed in
the in the $DictionaryDir. $DictionaryDir is specified in the
$GSPELL_ROOT/config/gspellRegistry.cfg file, but can be overuled by the --
dictionaryDir=aFullPath command line option.
The $GSPELL_ROOT directory is the base directory where
GSpell was installed.
NGrams.[sh|bat]
or java
-cp
"$TOP/gspell/lib/gspellProject.jar;
$TOP/gspell/config" -ms100m
-mx600m
gov.nih.nlm.nls.gspell.NGrams
|
--index|--update|--find|--export
[--dictionaryDir=someFullPath]
--dictionaryName=NameOfDictionary
[--inputFile=YYYYY] [--outputFile=ZZZZZ]
[--fieldedText] [--termField=X]
[--truncate=N] [--considerNCandidates=N]
[--maxEditDistance=N] [--correctField=Y]
[--wordLengthHeuristic=true|false]
[--reportTime] [--help] |
--index |
Create
a new dictionary. Take input from either the file specified by the
--inputFile parameter, or from standard input. Input terms are expected to be
of the format, one term per new line separated line. Lines starting with #
are taken to be comments are are ignored. Existing dictionaries of the same
name will be overwritten. A
note about memory. This application requires morethan the standard amount of
memory when indexing. The total amount of memory used during indexing is
reported with the --reportTime option. It has been observed that 128MB
was needed to load a set of 300,000 terms. Please use the java options -MX
and -MS to run with the appropriate amount of memory when indexing. |
--update |
Add terms to an existing dictionary. This option is
similar to --index option, except that if the dictionary exists, terms are
added to it, rather than having the dictionary overwritten. Take input from
either the file specified by the --inputFile parameter, or from
standard input. Input terms are expected to be of the format, one term per
new line separated line. Lines starting with # are taken to be comments are
are ignored. |
--find |
Retrieve close suggestions to input terms. Input is
either taken from the standard input [the default], or from a file specified
via the --inputFile. By default the input is to be in the format of one term
per newline delimited line. However, the application can take fielded input
(see the --fieldedText and -- termField options). The output returned contains the input plus four
fields. The first of these fields is the suggestion, the second is a ranking,
the third field is the method that returned this suggestion, and the last
field, will indicate whether or not a term was found, or was found to be
correct (in a case insensitive way). |
--export |
Spew out to the contents of the indexes to standard
output, or to the file specified by the -- outputFile option. Caution: this
option spews out lots of data. This option is useful really only for
debugging purposes. This option may disappear in future releases. |
--dictionaryName= |
The name of the dictionary to be created or to be
referenced. Since a directory is created based on this name, it should be named
compliant with the naming rules for directories. Each dictionary is
placed in the dictionaries directory. The dictionaries directory is specified
by the --dictionaryDir= path found in the
$GSPELL_ROOT/config/gspellRegistry.cfg file. The $GSPELL_ROOT directory is
the base directory where GSpell was installed. |
--dictionaryDir= |
Dictionary directories containing the indexes are
placed in the in the $DictionaryDir. $DictionaryDir is specified in the
$GSPELL_ROOT/config/gspellRegistry.cfg file, but can be overuled by the
--dictionaryDir=aFullPath command line option. The $GSPELL_ROOT directory is
the base directory where GSpell was installed. A note to those who would like to have dictionaries
placed in a location other than $GSPELL_ROOT/dictionaries: specify the
--dictionaryDir= and --dictionaryName= commandline
options.
|
--inputFile= |
Specify the (full path) name of an input file. The
semantics of this file refer to being the contents of what is to be index
when this option is used with the --index and --update options. The contents
of this file is taken as input for retrieval when the --find option is used. |
--outputFile= |
Specify the name of the output file. |
--fieldedText |
Indicate that the input is Pipe delimited input,
where one of the fields contains the significant term, and the rest of the
terms are passed through. This option must be used with the --termField
option to indicate the field to retrieve. This option is only relevant with
the --find option. |
--termField=X |
Indicate the field to pick up the term to be looked
up. The first field is taken to be 1, not zero. |
--truncate=N |
Return the top N suggestions. |
--considerNCandidates=N |
Limit the total number of candidates to be
evaluated to N. This parameter has a direct relation to both the speed and
accuracy of the application. The more candidates evaluated, the better the
suggestions, but the slower the response is. |
--maxEditDistance=N |
Limit the returned suggestions to those who have
edit distances that are less than or equal to N. |
--correctField=Y |
This option is for debugging and benchmarking
purposes. When present, it indicates the field that contains the correct
spelling. Further, benchmark statistics are computed when this option is
present. Those statistics include the number of responses where the correct
term was found in the first, the first five, the first ten, the first hundred
suggestions along with how many responses completely missed the mark. The
output stats file is labeled dictionaryName.stats and is placed in the directory
where the application is kicked off. |
--reportTime |
This option reports status messages to standard
error during processing. It is useful when indexing a large set of terms to
determine how long it takes to process 10,000 terms, and how much memory was
needed. |
--wordLengthHeurstic= |
Limit the candidates to be evaluated to those which
are with +/-4 characters of the input term. The nGram index is segmented into
bins by the size of input words. When this heuristic is turned on, the number of
candidates is effectively pruned to not even consider words that are known to
be outside reasonable edit distances already. This allows additional
candidates that might not have been able to be evaluated to be evaluated.
This heuristic has two beneficial aspects to it. It speeds up suggestions,
and it suggests additional close neighbors. This heuristic does increase the size of the
resulting index by about 50% over an index that did not use this heuristic.
Making a non- wordlengthHeurstic index is not an option. This heuristic
prunes multi word neighbors that may share a word in common. The size of target words range from 1 to over 100
characters, and are distributed in a Zipf like distribution (at least for our
dictionary), where there are a few 1 character words, some more two character
words, a fair amount of three character words, and a huge number of words
between four characters and ten characters. The numbers start trailing off
slowly after that. There is a precipitous drop off in the number of words
that have more than 20 characters. This option is turned on by default. |
--version --CVSVersion --history |
Retrieve
the verion of this build Retrieve
the CVS tagged version of this build Retreive
the history notes of this build. |
--help |
Help |
NGram's output for an input term contains the following
fields:
Input term |
Suggestion |
Edit Distance |
Rank from 0 to 1 [1] |
Method used |
Other messages |
Frequency from corpus [2] |
Weighted rank [3] |
Here's an example of the output for the misspelled term anonomous.
anonomous
anonomous|anonymous|1.0|0.87|NGrams||0|2000000
anonomous|allonomous|2.0|0.58|NGrams||0|3000000
anonomous|autonomous|2.0|0.58|NGrams||0|3000000
anonomous|axonopus|3.0|0.3|NGrams||0|4000000
anonomous|analogous|3.0|0.3|NGrams||0|4000000
anonomous|anomalous|3.0|0.3|NGrams||0|4000000
anonomous|anonymously|3.0|0.3|NGrams||0|4000000
anonomous|annosus|3.0|0.3|NGrams||0|4000000
anonomous|anadromous|3.0|0.3|NGrams||0|4000000
anonomous|anthonomus|3.0|0.3|NGrams||0|4000000
Here is an example for the correctly spelled word sleep
sleep
sleep|sleep|0.0|1.0|NGrams|Correct|0|1000000
sleep|sheep|1.0|0.87|NGrams||0|2000000
sleep|sweep|1.0|0.87|NGrams||0|2000000
sleep|steep|1.0|0.87|NGrams||0|2000000
sleep|sleepy|1.0|0.87|NGrams||0|2000000
sleep|sleeps|1.0|0.87|NGrams||0|2000000
sleep|spleep|1.0|0.87|NGrams||0|2000000
sleep|seep|1.0|0.87|NGrams||0|2000000
sleep|sweeps|2.0|0.58|NGrams||0|3000000
sleep|seeps|2.0|0.58|NGrams||0|3000000
[1] The
rank is the normalized ranking with a value between 0 and 1 where 1 is an exact
match and 0 is no match at all. The distribution for this normalized score is
intended to have edit distances of .5 rank about the .90. The distribution
curve is half the bell curve, and intended to allow developers relying on edit
distance as the sorting metric to have a metric that ranks words beyond 2 edit
distances as quickly getting worse as the curve extends out.
[2] The
corpus frequency is taken from the 1999 Medline word frequencies. (This is only
used with the GSpell application)
[3] The
weighted rank is currently the following:
[ ( the edit distance * the method number) * 1000 )
] + [ MAX FREQ - the frequency of the word found in the
corpus]
Where
·
The
edit distance is the Levenshtein edit distance
·
The
frequency of the word found is 0 for NGrams
·
The
MAX FREQ currently is a constant == 1,000,000. This number was intended
to be the number of occurrences of words in the corpus, but this proved hard to
compute and pass on to the ranking method. A suitably large number was chosen
as a surrogate.
·
The
method numbers were intended to weight homophones and corrections to common
misspellings higher than other methods. The method number for NGrams = 1.
There are tuning parameters that are specific to each
dictionary created. These parameters are stored in a configuration file in the
dictionary directory.
Performance tuning variables
--considerNCandidates |
The number of candidates that can be evaluated for
any given input. The higher the number the better the suggestions are, but
the longer it takes to do. A reasonable rate on a sun is about 2000. This
parameter alters the behavior of the NGrams algorithm. |
--cacheSize |
This is the number of grams to keep in memory
during processing. The more grams in memory, the faster the response is. This
makes a real difference for indexing, where if you can keep all the grams in
memory, the faster indexing will go. The downside is more memory is used. For
indexing, this may mean bumping up the memory to something along the lines of
100mb for dictionaries of 100,000 terms or more. This variable is used to
keep the number of grams in memory to this size. The hash of grams does not
grow beyond this number. Ideally, this size should not be set bigger than the
number of grams you have. This parameter alters the behavior of the NGrams
algorithm. |
Variables that can be changed to give specific
behaviors
--truncate |
This variable limits the number of suggestions returned
for a given input. |
--maxEditDistance |
This variable limits suggestions to those which
have an edit distance equal to or less than this parameter. |
--wordLengthHeuristic |
This heuristic retrieves candidate terms that are
+/- 4 characters in length larger or smaller than the query term. For
example, if you put in a term that is two characters long, the longest term
retrieved would be 6 characters. This heuristic is turned on by
default. This heuristic only affects the behavior of the NGram algorithm. |
GSpell
Package (contains the
NGrams methods)
API builders should view the gspell package. It should be
possible to embed this package into other applications using the following
constructors and methods for the class GSpell.
Gspell instances can be embedded into single threaded
applications with no problem. It can be embedded into threaded applications
with the following advice:
Embedding read-only instances of GSpell objects is fine as
long as there are no read/write instances present.
It is recommended that in cases where there are updates or
read/writes, this object gets implemented as a singleton instance per
dictionary, that is shared across threads and processes.
There is NO writer lock and
unlock around the update/index method, nor around the cleanup or flush methods
to insure that only one writer is alive at any given time, and that there are
no readers alive. It is up to the application developer to insure that access
is controlled around GSpell Object instances that do writes or updates.
3.3.5.1 Create or update an index
public NGrams(GlobalBehavior
pSettings) throws Exception
public NGrams(String
args[]) throws Exception
public
int index( BufferedReader pFStream ) throws Exception
public
void index( String pTerm ) throws Exception
public
int update( BufferedReader pFStream ) throws Exception
public
void update( String pTerm ) throws Exception
3.3.5.2 Retrieve suggestions for terms
public GSpell(Options
pOptions) throws Exception
public
GSpell(String pDictionaryName, int pMode) throws Exception
public
int
find( BufferedReader pFStream, PrintWriter pPrintStream ) throws
Exception
public
Candidate[]
find( String pTerm ) throws Exception
public
Candidate[] find( FindOptions fOptions, String pTerm ) throws Exception
3.3.5.3 Methods from the Candidate
class.
public String getName()
public
double getEditDistance()
public
double getEditDistance1000()
public
double getRank()
public
String getFromMethod()
public
String getMessage()
public
long getCorpusFrequency()
public
long getMaxFreq()
public
String toString()
3.3.5.4 GSpell GlobalBehavior and Find
Options
public
GlobalBehavior( String pApplicationName, String pRegistryFile, String args[])
public
FindOptions (GlobalBehavior pSettings)
Examples that illustrate embedding this in an application
are provided in the examples directory.
·
The
gspell Update was not saving all entries - This used to work. It broke when I
tried to deal with utf issues. I was allocating two extra bytes per ngram name thinking that utf characters
would be wider. A bad thought. When an update came in, the stuff in those extra bytes might not be
the stuff in in the bytes during the previous proccess'es index.
So, sometimes the gram names were not being matched as a consequence.
·
Added a
--unicode flag to flag input and output files in the UTF-8 format.
·
The
gspell and bagOwordsPlus index, now automatically create the common misspelling
and homophone indexes for the dictionary.
·
Switched
the indexing from Russell Loane's tools to the
Added
(back) in a constructor for those who work in a classpathless environment to
pass in the full path to the config file. This is primarily for those embedding
this in java servlet applications where adding classpaths is difficult.
Updated the
distribution so that it now contains all the sources needed to be compliled
·
A
more robust build, with test hooks, and baseline
·
JSP
inteface added.
·
Noticed
that the bagOWordsPlus index was overwriting the gspell index.
·
Made
these two dictionaries. They cannot share the same space.
·
A
more mature versioning added
This
version contains the following new features
·
Homophones
·
Medline
frequencies to reorder candidates
·
Java
NLS Repository style config files and settings
·
Update
function restored
·
Deprecated
rather deleted some of the older public methods
This
version includes Russel Loane's store methods. This version was released Nov.
2003.
Contact us at: umlslex@nlm.nih.gov
Updated:
Lexical Systems Group
Cognitive
Science Branch
Lister Hill National Center for Biomedical
Communications
National Library of Medicine
National Institutes of Health
Department of Health & Human Services
Copyright - Privacy - Accessibility