Normalize Unicode to ASCII
Unicode is designed to be an universal character set that includes all of the major scripts of the words. It allows data to be transported through many different systems without corruption. It is very useful when dealing with multilingual NLP. Non-ASCII Unicode are commonly seen even in English documents, such as diacritics, ligature, punctuation, and symbols. For examples, © is used for copyright and ® for registered sign. Accordingly, UTF-8 is used as the default standard input and output format for the Lexical tools since 2005.
ASCII (American Standard Code for Information Interchange) is the most common used standard code for information interchange and communication between data processing systems. The ASCII character set contains 128 7-bit coded characters including alphabetic, numeric, control and graphic characters. Even Unicode are widely used these days, there are lots of NLP projects still only dealing with ASCII. Thus, there is a need for Lexical tools to provide users a way to convert characters from Unicode (UTF-8) to ASCII (7-bit).
To provide a tool to convert non-ASCII characters from Unicode (UTF-8) to ASCII (value < 128, U+007F). The normalized result should not change the meaning of the original Unicode character. The normalization algorithm and results are described as follows.
III. Norm Guidelines:
Two fundamental principles are used as the guideline of normalizing non-ASCII Unicode to ASCII:
- Similar Semantic representation: represents the same meaning
- Similar Graphic representation: similar graphic appearance
The table below illustrates examples for the combinations of above two guidelines. Please note that different application might apply different normalization guideline.
- Examples: similar in both semantic and graphic representations
Semantic Graphic Norm? Example Notes Similar Similar Yes U+0100: [Ā] to [A] Strip diacritics Similar Similar Yes U+00BD: [½] to [1/2] Split ligature Similar Similar Yes U+201C: [“] to ["] Punctuation mapping Similar Similar Yes U+0406: [І] to [I] Alphabet mapping Similar Similar Yes U+FF2B: [Ｋ] to [K] Fullwidth letters
- Examples: similar in semantic; not in graphic representations
Semantic Graphic Norm? Example Notes Similar Different Yes U+00AB: [«] to ["] Punctuation mapping Similar Different Yes U+00A9: [©] to [(c)] Symbol mapping Similar Different Yes U+00B0: [°] to [(degree)] Symbol mapping Similar Different Yes U+03B1: [α] to [(alpha)] Alphabet mapping Different Similar Yes U+00D7: [×] to [*] Symbol mapping Similar Different Yes U+03BC: [μ] to [(mu)] Alphabet mapping
- Examples: similar in graphic; not in semantic representations
Semantic Graphic Norm? Example Notes Different Similar Yes U+2190: [←] to [<-] Common used Different Similar Yes/No U+00B5: [µ] to [u] "ul" is used for microLiter Different Similar Yes/No U+2022: [•] to [*] Use * for bullet? Different Similar Yes/No U+03BC: [μ] to U+00B5: [µ] Common used synonym or typo? Different Similar Yes/No U+00DF: [ß] to U+03B2: [β] Common used synonym or typo? Different Similar Yes/No U+00B6: [¶] to U+03C0: [π] Common used synonym or typo?
The normalization based on semantic or graphic similarity principle is the core operation for Unicode to ASCII normalization. This is called as core normalization and can be performed by:
- Symbols and punctuation mapping
- Unicode mapping
- Split ligatures
- Strip diacritics
IV. Norm Operations:
- Basic Norm Operations
There are 7 basic Norm operations used in Lexical tools for normalize non-ASCII Unicode characters to ASCII. Most normalization for Unicode to ASCII can be achieved by combining different basic operations in different order. The following table shows the lvg flows and other information of these seven basic norm operations.
Lvg Flow Descriptions Abbreviation No Operation (for ASCII) NO -f:q Strip Diacritics SD -f:q0 Map Symbols & Punctuation to ASCII MS -f:q1 Map Unicode to ASCII MU -f:q2 Split Ligatures SL -f:q3 Get Unicode Name UN -f:q4 Get Unicode Synonym US -f:q8 Strip or Map Unicode SM
- Combined Norm Operations
Different combination of above basic Norm operation can be used for different NLP. Lexical tools provide 3 most common used combined operation as shown in the following table:
Lvg Flow Descriptions Combined Flows -f:q7 Unicode Core Norm (based on semantic & graphic similarity) -f:q0:q1:q2:q, recursively -f:q5 Norm Unicode to ASCII -f:q7:q3 -f:q6 Norm Unicode to ASCII with synonym option -f:q4:q7:q3