Map Symbols & Punctuation to ASCII

  • Introduction:

    Unicode symbols and punctuation are very confusing not only because they looks alike, are multiple defined (in different Unicode blocks), but also because text editor software automatically change them during the editing and transaction. For example, quotations are defined in Unicode Latin Basic and Punctuation blocks, such as QUOTATION ( " ), APOSTROPHE ( ' ), GRAVE ( ` ), ACUTE ( ´ ), LEFT SINGLE QUOTATION ( ‘ ), RIGHT SINGLE QUOTATION ( ’ ), LEFT DOUBLE QUOTATION ( “ ), and RIGHT DOUBLE QUOTATION ( ” ), etc.. ASCII was designed to support the very restricted typographic style available to typewriter users, QUOTATION MARK ( " ) and APOSTROPHE ( ' ). In recent years, text editor software change ASCII (dumb) quotes to smart quotes automatically. “Smart quotes” is the automatic replacement of the correct typographic quote character (‘ or ’ and “ or ”) as you type (' and "). It does not refer to the curved quotes themselves.

    In addition, the X Window System fonts, MS software, and some software replace following punctuation automatically:

    • APOSTROPHE ( ' ) and RAVE ACCENT ( ` ) for APOSTROPHE ( ' )
    • LEFT SINGLE QUOTATION ( ‘ ) and RIGHT SINGLE QUOTATION ( ’ ) for APOSTROPHE ( ' )
    • LEFT DOUBLE QUOTATION ( “ ) and RIGHT DOUBLE QUOTATION ( ” ) for QUOTATION MARK ( " )
    • EN DASH ( – ) for HYPHEN-MINUS ( - )
    • EM DASH ( — ) for double HYPHEN-MINUS ( -- )
    • HORIZONTAL ELLIPSIS ( … ) for triple FULL STOP ( ... )

    Convert the Unicode punctuation and symbols to ASCII punctuation and symbols is imperative in NLP for preserving the original documents.

  • Algorithm:
    Table mapping method is applied to convert Unicode symbols and punctuation to ASCII. The mapping is a straight forward method, which replaces an Unicode symbols or punctuation character with an assigned mapped ASCII string. A configurable mapping table is used for this purpose. This table is located at ${LVG}/data/Unicode/symbolMap.data. This file is the default symbols & punctuation mapping table provided by lexical tools. The format is listed as below:

    UnicodeMapped ASCIICharUnicode Name
    U+02BA"ʺMODIFIER LETTER DOUBLE PRIME

    Please note:

    • Field 1 must be a non-ASCII Unicode character (in Unicode Hex value)
    • Field 2 must be an ASCII string
    • Fields 3 and 4 are the Unicode character and name of field 1. They are used for notation purpose (not used in the program).

  • Samples:
    Quotation marks, dashes, and hyphens are the most common seen cases in this normalization.

  • Java Code Implementation:
    • Perform mapping if the character is in the punctuation & symbols mapping table

  • References: