Normalize Unicode to ASCII

I. Introduction:
Unicode is designed to be an universal character set that includes all of the major scripts of the words. It allows data to be transported through many different systems without corruption. It is very useful when dealing with multilingual NLP. Non-ASCII Unicode are commonly seen even in English documents, such as diacritics, ligature, punctuation, and symbols. For examples, © is used for copyright and ® for registered sign. Accordingly, UTF-8 is used as the default standard input and output format for the Lexical tools since 2005.

ASCII (American Standard Code for Information Interchange) is the most common used standard code for information interchange and communication between data processing systems. The ASCII character set contains 128 7-bit coded characters including alphabetic, numeric, control and graphic characters. Even Unicode are widely used these days, there are lots of NLP projects still only dealing with ASCII. Thus, there is a need for Lexical tools to provide users a way to convert characters from Unicode (UTF-8) to ASCII (7-bit).

II. Definition:
To provide a tool to convert non-ASCII characters from Unicode (UTF-8) to ASCII (value < 128, U+007F). The normalized result should not change the meaning of the original Unicode character. The normalization algorithm and results are described as follows.

III. Norm Guidelines:
Two fundamental principles are used as the guideline of normalizing non-ASCII Unicode to ASCII:

  • Similar Semantic representation: represents the same meaning
  • Similar Graphic representation: similar graphic appearance

    The table below illustrates examples for the combinations of above two guidelines. Please note that different application might apply different normalization guideline.

  • Examples: similar in both semantic and graphic representations
    SimilarSimilarYesU+0100: [Ā] to [A]Strip diacritics
    SimilarSimilarYesU+00BD: [½] to [1/2]Split ligature
    SimilarSimilarYesU+201C: [“] to ["]Punctuation mapping
    SimilarSimilarYesU+0406: [І] to [I]Alphabet mapping
    SimilarSimilarYesU+FF2B: [K] to [K]Fullwidth letters

  • Examples: similar in semantic; not in graphic representations
    SimilarDifferentYesU+00AB: [«] to ["]Punctuation mapping
    SimilarDifferentYesU+00A9: [©] to [(c)]Symbol mapping
    SimilarDifferentYesU+00B0: [°] to [(degree)]Symbol mapping
    SimilarDifferentYesU+03B1: [α] to [(alpha)]Alphabet mapping
    DifferentSimilarYesU+00D7: [×] to [*]Symbol mapping
    SimilarDifferentYesU+03BC: [μ] to [(mu)]Alphabet mapping

  • Examples: similar in graphic; not in semantic representations
    DifferentSimilarYesU+2190: [←] to [<-]Common used
    DifferentSimilarYes/NoU+00B5: [µ] to [u]"ul" is used for microLiter
    DifferentSimilarYes/NoU+2022: [•] to [*]Use * for bullet?
    DifferentSimilarYes/NoU+03BC: [μ] to U+00B5: [µ]Common used synonym or typo?
    DifferentSimilarYes/NoU+00DF: [ß] to U+03B2: [β]Common used synonym or typo?
    DifferentSimilarYes/NoU+00B6: [¶] to U+03C0: [π]Common used synonym or typo?

    The normalization based on semantic or graphic similarity principle is the core operation for Unicode to ASCII normalization. This is called as core normalization and can be performed by:

    • Symbols and punctuation mapping
    • Unicode mapping
    • Split ligatures
    • Strip diacritics
    recursively until no further normalized results can be found. This core normalization is the most robust operations and provided as a lvg flow component of Unicode Core Norm (-f:q7) in Lexical Tools.

IV. Norm Operations: