About Text Categorization
This version of Text Categorization is developed 100% in Java with capability to handle UTF-8. It includes five command line tools along with Java APIs.
- Command line tools:
- mlt (MEDLINE Tokenizer)
- jdi (Journal Descriptor Indexing)
- sti (Semantic Type Indexing)
- stri (Semantic Type Indexing, Real-time)
- stWsd (ST based WSD)
JDI is being used to as an automatic indexing method to substitute and help for manually indexing practices. It is also used in several NLM NLP projects to increases accuracy by identifying citations. JDI has been extended to performing Semantic Type (ST) indexing. STI uses JDI as the basis to calculate the ST rank on the similarity between the JD indexing of target text and JD indexing of ST documents. An ST document is a set of UMLS Metathesaurus concepts assigned to an ST. STI is used for applications in Word Sense Disambiguation (WSD). If the senses of an ambiguous word are expressed by STs, STI can be performed on the context surrounding the word (phrase, sentence, and paragraph) in the expectation that in the ST indexing of the context, the correct STs for the word will rank higher than the other candidate STs for the word. StWsd is developed based on STI.