Strip Punctuations, Enhanced

  • Short Description: Strip punctuation, enhanced

  • Full Description:

    Strip punctuation from the input term, except where the punctuation is between or just before numbers. This enhanced feature is an attempt to avoid breaking up tokens in floating numbers, negative number, dates, telephone numbers, or category numbers, such as "1.25", "-1", "301-435-3170", "10/12/97", or "10-12-00".

    Punctuation is defined in Java Character class and include:

    • DASH_PUNCTUATION (20): -
    • START_PUNCTUATION (21): ( { [
    • END_PUNCTUATION (22): ) } }
    • CONNECTOR_PUNCTUATION (23): _
    • OTHER_PUNCTUATION (24): ! @ # % & * \ : ; " ' , . ? /
    • MATH_SYMBOL (25): ~ + = | < >
    • CURRENCY_SYMBOL (26): $
    • MODIFIER_SYMBOL (27): ` ^

    No effect on the -m option. "none" is added at the end of the output.

  • Difference:
    1. Java version trims output terms (remove spaces at the beginning and ending of the term).
    2. C version replace "-" with space even when "-" is conjoint to " " while Java version just strip it.
    3. Different result for testing diacritics, such as \345\346... in the unit test.
    4. "'" in genitive is stripped. This feature may need further discussed.


  • Features:

    Strip a character from the input term if the character belongs to above punctuation list except for following cases.

    1. Floating number: such as "1.25" and "-23.38".
    2. Negative integer: such as "-23".
    3. Date: such as "10/12/97" or "10-12-00".
    4. Telephone: such as "301-435-3170" or "301.435.3170"
    5. Catalog: such as "007.12.1234.07" or "007-12-1234-07".


  • Symbol: P

  • Examples:
    
    shell> lvg -f:P
    -12.3|-12.3|2047|16777215|p|1|
    10/12/97|10/12/97|2047|16777215|P|1|
    301-435-2134|301-435-2134|2047|16777215|P|1|
    St. John's|St Johns|2047|16777215|p|1|
    
    More examples

  • Implementation Logic:
    1. Tokenize each word from the input term.
    2. Check if words are float numbers.
    3. Check if words are dates.
    4. Check if words are telephone numbers.
    5. Check if words are catalog numbers.

  • Source Code: ToStripPunctuationEnhanced.java

  • Hierarchy: Object -> Transformation -> ToStripPunctuationEnhaced