Word Tokenizer Algorithm (Java)
Word Tokenizer is used to tokenize and filter out words and characters in TI and AB fields from citations. The algorithm used in the Java version is slice different than the Lisp version. Please see TI report and AB report for details.
The procedures and creteria are described as follows:
- Remove matched string (case sensitive)
Begining string | Ending string | References
|
---|
[correction | ] | ?
|
(abstracts were not | included) | 0289-10695616
|
[J. Neuroimmunol. 104, | 85-91] | ?
|
(abstracts presented at recent scientific meetings | package inserts) | 0306-11261533
|
(Japanese Association of Intellectual Copyright | #130,591) | 0306-11276498
|
- remove matched ending string (case sensitive)
Begining string | References
|
---|
CopyrightCopyright | ?
|
Copyright Copyright | ?
|
Copyright | ?
|
.Copyright | ?
|
)Copyright | ?
|
(abstract | ?
|
(ABSTRACT | ?
|
? Copyright | ?
|
) Copyright | ?
|
Copyright 2001 Wiley-Liss, Inc. | 0310-11391771
|
- remove matched ending string (case insensitive)
Begining string | Ending string | Exceptions | References
|
---|
[ | ] | [ | ?
|
[ | .] | [These syndromes can be a contributory | 0408-10199143
|
[published erratum | ] | None | ?
|
[forensic science international | ] | None | ?
|
(abstract truncated | ) | None | ?
|
(published erratum | ) | None | ?
|
(comments | )] | None | ?
|
- remove exact matched ending string (case insensitive)
Match string | References
|
---|
[see comments] | ?
|
(see comments) | ?
|
[seecomments] | ?
|
[ see comments] | ?
|
[in process citation] | ?
|
(in process citation) | ?
|
[corrected] | ?
|
[correction of artistic] | ?
|
(letter) | ?
|
(letter)] | ?
|
(editorial)] | ?
|
- remove [title]
- remove non-alpha-num char (begining and ending) from all words
- expand contraction
- repace punctuation with space
- remove words with less 3 characters
- remove words begins with digit