Applications: Automatic Glossary Indexing

It would be an interesting practice/project to apply STMT on automatic glossary index (the question you brought up at the end of our last meeting). Here are the main steps for this task:

  1. Collect all terms in your glossary database
    • A team could have multiple words includes space, punctuation, etc..
    • All these term are stored in a file (this is the corpus file)
      You might need to normalize terms in this file if you want to ignore case, punctuation, etc.. If so, you will need to normalize the inTerm in section 2).
    • Each terms should have an associated URL#Name page
  2. Use STMT to find all terms (included in glossary) in your web pages:
    • Basically, use STMT to find all glossary terms (with starting and ending positions) from your web pages.
    • Use SubtermApi.FindSubterms(inTerm, corpus)
      • inTerm:
        the page you are working on (you will need to go through all your pages) need to be normalized if you normalize your corpus file above
      • corpus:
        need to be instantiated from Corpus class (from STMT) Use the corpus file above as input
      • Output:
        A collection of all found Subterms. A Subterm is a Java object that includes term (that are included in the glossary) with starting and ending position in the inTerm.
  3. Final Markup:
    • Write a program to modify the page by adding links (<a href=”URL#Name”>Term in glossary </a>) automatically base on above output.
  4. Overlap Issues:
    • It might get complicated if there are overlaps between glossary terms. For example, if “Term” and “Term in glossary” are both in your corpus file. Both terms will be found with overlap. You will need to set your own rules to handle this issue.