Strip or Map Unicode to ASCII
- Short Description: Convert input Unicode characters to ASCII characters by stripping or mapping non-ASCII Unicode characters.
- Full Description:
- stripped, because they are symbols or typos (meaningless in NLP) or
- mapped to ASCII characters, because they are known Unicode characters in users' NLP projects
- Difference:
- Features:
- Convert Unicode characters to ASCII from the input term by stripping and mapping.
- Symbol: q8
- Examples:
This flow converts Unicode characters to ASCII characters. Some Unicode characters cannot be converted to ASCII by other Unicode normalization algorithm, such as strip diacritics, split ligatures, symbol mapping, or Unicode mapping. These characters are either:
When the -m flag is specified, the detail mutate operations for each characters of the input string are added after the standard set of lvg output fields. There are three basic mutate operations in this flow as shown in following table:
Operations | Descriptions | Example |
---|---|---|
NO | No operation | A -> A |
MP | Table lookup mapping | ɑ -> alpha |
SP | Stripped | ™ -> |
None.
shell> lvg -f:q8 -m ɑ-Best™ ɑ-Best™|alpha-Best|2047|16777215|q8|1|MP|NO|NO|NO|NO|NO|SP|More examples
- Check if the character is ASCII
- if yes,
=> return the original input character - if no,
=> Check if the character is in the non-strip mapping table:- if yes, return the mapped ASCII character
- if no, strip the non-ASCII Unicode
- if yes,