Problems of Tokenize & Punctuations

III. Problems for Tokenization with Punctuations

Tokenization or Punctuation themselves are simple functions and cause no problems. However, most of applications use tokenization as front part of its algorithm. Typically, applications in text handling perform three steps:

  1. Apply tokenization to tokenize an input string into a serious of tokens.
  2. Perform modifications on tokens, such as strip certain tokens.
  3. Compose the modified tokens back to a string as an output (reverse-tokenization).

In such text handling applications, punctuations may cause lots of problems as described bellows:

  1. Which punctuation is a delimiter?

    Space " " is definitely a delimiter. Some punctuations are delimiters, such as ",", ".", "(", and ")" while other punctuations are not delimiters, such "@" and "-".

    < For example >

    * This is a book.
    => space and "." are delimiters

    * Sports (tennis, baseball, and basketball) are professional sports.
    => space, "(", ")", ",", and "." are delimiters

    * My E-mail address is: lu@nlm.nih.gov
    => space and ":" are delimiters while "-" and "@" are not delimiter

  2. Which punctuation should be stripped or kept during composing (reverse tokenization)?

    If the punctuation is consider as a delimiter during tokenization, it may be stripped or kept during reverse tokenization.

    < For example >

    * Strip stopwords on (in, the, top, left)
    => Strip stopwords (, , top, left)
    => Strip stopwords (top, left)
    ------------------------------------------------
    => Space, "(", ",", and ")" are delimiters
    => Spaces are always kept (and trimmed) during reverse tokenization
    => "(", "," and ")" are also kept during reverse tokenization
    => Sometimes, "," are stripped during reverse tokenization