Strip Diacritics in Unicode

  • Introduction
    Diacritic characters are used in Spanish, French, etc. and also appear in English documents. To strip diacritics is a common process of normalizing non-ASCII Unicode to ASCII operations.

  • Algorithm:
    Diacritics are stripped in Lexical Tools with an enhanced algorithm (since 2008) by following processes:

    • Table mapping (configurable):
      Table mapping method is used to overwrite the default diacritics stripping result. The mapping is a straight forward method, which replaces a non-ASCII Unicode diacritic character with an assigned mapped ASCII character. A configurable mapping table is used for this purpose. This table is located at ${LVG}/data/Unicode/ This table is used for those Unicode diacritic characters can't be stripped by Unicode normalization algorithm. This file is the default strip diacritics mapping table provided by lexical tools. The format is listed as below:

      UnicodeMapped ASCIICharUnicode Name

      Please note:

      • Field 1 must be a non-ASCII Unicode character (in Unicode Hex value)
      • Field 2 must be a ASCII character
      • Fields 3 and 4 are the Unicode character and name of field 1. They are used for notation (not used in the program)

    • Unicode normalization D algorithm (D: Canonical Decomposition, not followed by Canonical Composition):
      As discussed in the Unicode Normalization, Unicode Normalization D can be used for stripping diacritics. Unicode normalization D decomposes a diacritic character into a base character and combining diacritic mark. The combining diacritic mark can be stripped (by reserving the first base character). This algorithm works very well on most diacritics, such as characters in blocks of Latin-1 supplement, Latin Externd-A, Latin Externd-B.

  • Java Code Implementation:
    • Download icu4j from internet.
    • include icu4j.jar in the Java CLASSPATH
    • import*;
    • If the character is in the diacritic mapping table
      • Perform mapping
    • else
      • String normStr = Normalizer.normalize(inChar, Normalizer.NFD);
      • Stripped combining diacritical mark of the normStr
        • If the length of normStr is more than 1 (diacritical character)
        • If normStr contains characters of combining diacritical mark (exclude non-Latin based characters)
          => Keep the first character (base character always show on the front)

  • References: