index_img1.gif SPECIALIST dTagger
Introduction Motivation Download and Install References Hidden Markov Model Tagger Components Team Members About the Name Version/History
1. Introduction
The dTagger is a Part of Speech (POS) tagger.  A POS tagger assigns an unambiguous part of speech such as noun, adjective, adverb to the words or terms within a text.  Such tags are a necessary component to determining phrase barriers and head assignment commonly done within noun phrase extractors.  Taggers in general, and this tagger is able to tag after some training.   The sources to train from include some text where the parts of speech have already been assigned (an annotated corpus), a Lexicon of words and their potential parts of speech, and optionally, lots of plain text within the genre you are planning on using the tagger on.  The dTagger is distributed with a trained model that was trained on the MedPost Corpus, a corpus of Medline abstracts in the genomics field hand annotated with parts of speech. This corpus is also being redistributed. The dTagger includes the SPECIALIST Lexicon as well. 
Figure 1: An Abstract Highlighted with Parts of Speech
2. Motivation
Even though there are several publicly available POS taggers, we've had needs that motivated us to write our own.  We have wanted a tagger that worked specifically with the SPECIALIST Lexicon. We wanted a tagger that natively used the tag set that is used within the SPECIALIST Lexicon.  That being said, we wanted a tagger where the tag set was not hard coded in, so that other tag sets could be used.  We wanted a tagger that included the trainer and could be trained on untagged text.  We wanted a tagger that tokenized text into single words but more importantly, could tokenize text into multi-word terms, the same granularity as that of the SPECIALIST Lexicon.  The SPECIALIST TextTools already include this kind of tokenization. We would also like this tagger to be flexible enough to be turned to different languages.
3. Download and Install
Download the package from here:
Package Name
240 mb
    • Prerequisites
      • Java or greater
      • 2 gig of hard disk space
      • 300 Mb of Memory. But the more the better.
    • Installation Instructions
      • Un-jar the dtaggerDist.jar into the location where you want to install the nls projects. When un- archived, a nls/nlpdirectory will exist.
        > jar xvf dtaggerDist.jar
      • Change directories to the nls/nlp directory
        > cd nls/dtagger
      • Invoke the install.[bat|sh]
The install will create the following scripts in the nls/dtagger/bin directory. These are the scripts to kick off each of the applications:
    • Optional Post Installation Actions
      • Add the nls/dtagger/bin directory to the $PATH environment variable. This will enable these programs to be run from any directory.
      • Add to the $CLASSPATH environment variable nls/dtagger/lib/dtaggerProject.jar; nls/dtagger/config.This will enable applications that have these tools embedded in them to find the classes and data.
4. References
Notes on the Hidden Markov Model used within dTagger:
Browne AC, McCray, AT, Srinivasan S. The  SPECIALIST Lexicon Technical Report, 6/2000, 
Smith L, Rindflesch T, Wilbur WJ. MedPost: a part-of-speech tagger for bioMedical text Bioinformatics. 2004 Sep 22;20(14):2320-1.
Manning CD, Schütze H. Foundations of Statistical Natural Language Processing, 2003 Massachusetts Institute of Technology, Chapter 10.
Cutting D, Kupiec J, Pedersen J, Sibun P. A Practical Part-of-Speech Tagger, D. Cutting, J. 1992, Proceedings of the Third Conference on Applied Natural Language Processing
5. Hidden Markov Model
See related topics and documents
6. Tagger Components
The dTagger project includes not only a tagger, but three kinds of training:
  • Training when you have annotated text,
  • A tagger that uses a model created by some prior training,
  • Updating prior trained model, using untagged text,
  • Training when all you have is untagged text.
There are some additional components
  • A tag set
  • A tool to convert the SPECIALIST Lexicon's LRAGR table to a .lex file

All the tagger components assume a lexicon filled with tags. When going about training, it is best to build a model using an annotated set of documents or corpus.  The more the better.  It is realized that building an annotated corpus is a large task in and of it self.  If there is not an abundance of tagged text to train on, it fruitful to create an initial model with a little bit of tagged text, then update the model by running the update with a lot of untagged text. It is impossible to find even a few annotated sentences to train on, the ability to train using just untagged text exists. 
Train With Annotated Text
See related topics and documents
See related topics and documents
Update with Untagged Text
See related topics and documents
Train with Untagged Text
See related topics and documents
See related topics and documents
Utility to convert LRAGR to a .lex file
See related topics and documents
Utility to create adjs from Verbs
See related topics and documents
Morphology Discovery
See related topics and documents
7. Team Members
This was a collaborative effort initially involving Destinee Nace, who provided the background material needed to comprehend tagger technologies.
Russell Loane provided the first two iterations of the hidden Markov Model code used, and provides much needed support in the way of brainstorming, figuring out anomalies and the like.
Allen Browne has provided the linguistic expertise needed.
Guy Divita took Russell's ideas and code, merged it into a form that is compatible with the TextTools, and (hopefully) made improvements and additional contributions to it.
8. About the Name
The dTagger started out as a collaboration with Destinee Nace.  It was initially named "The Tagger Destinee", but this became just too much to write when referring to it, and consequently got shortened to dTagger.
9. Version/History
Version 0.0.1
     Released November 11, 2006