Strip or Map Unicode

  • Introduction:

    Most Unicode characters do not have semantic or graphic similarity to ASCII characters. In other words, these Unicode can't be logically normalized to ASCII since there are no ASCII characters to map to. These Unicode characters are:

    • Not ASCII (> 127, U+007F)
    • not normalized by Unicode normalization decomposition algorithm
    • not included in Unicode symbols and punctuation mapping tables
    • not included in Unicode mapping tables
    • not normalized by stripping diacritics
    • not normalized by splitting ligature

    These Unicode characters are either stripped (irrelevant) or mapped to ASCII string (according to users preferences) in NLP. Accordingly, two processes are followed:

    • Mapped to an ASCII String if found in the (users define) mapping table
      For examples, Greek letters are converted to ASCII to preserve the semantic meaning of the documents. Such as α, β, π, ω, etc..

    • Stripped if not found in the mapping table (irrelevant Unicode in semantic)
      For examples, Unicode symbols are stripped because they can be considered as stopwords for indexing purpose in the NLP. Such as ™, ©, ®, etc..

    This method can be used after Unicode Core Norm to ensure the results of normalization is pure ASCII by stripping all irrelevant non-ASCII characters.

  • Algorithm:
    • If the character is ASCII (< 128)
      • Return input character
    • else
      • Map it to an ASCII string if it is in the non-strip mapping table
      • Otherwise, Strip it

    Table mapping method is applied to convert Unicode characters into an ASCII String for the cases listed above. The mapping is a straight forward method, which replaces an Unicode character with an assigned mapped ASCII String. The character is stripped if it not found in the non-strip mapping table. A configurable mapping table is used for this purpose. This table is located at ${LVG}/data/Unicode/ This file is the default Unicode non-Strip mapping table provided by lexical tools. The format is listed as below:

    UnicodeMapped ASCIICharUnicode Name

    Please note:

    • Field 1 must be a non-ASCII Unicode character (in Unicode Hex value)
    • Field 2 must be an ASCII String
    • Fields 3 and 4 are the Unicode character and name of field 1. They are used for notation (not used in the program).

  • Java Code Implementation:
    • If the input character is an ASCII (< 128, U+0080)
      • Return the input character
    • else
      • if the character is in the Unicode non-strip mapping table
        • Map the character
      • else
        • Strip the character

  • References: