Ending Punctuation Splitter

  • Description:
    This splitter is used to process a split by adding a space after ending punctuation if a token contains ending punctuation. Ending punctuation includes: .?!,:;&)]}

  • Features:
    Split a token in front of ending punctuation.

  • Examples:

    File NameInputOutput
    10023.txtdown.pleasedown. please
    10286.txt...my... my
    10004.txtcancer?ifcancer? if
    11186.txt?pls? please
    97.txtsuggestions?thankssuggestions? thanks
    53.txthello!canhello! can
    11186.txt,she, she
    16823.txt:by: by
    22.txt;syrinx; syrinx
    2.txt)why) why

  • Implementation Logic:
    • Recursively perform the following process:
    • Converts input word to coreTerm by strip off leading and ending punctuation, spaces, and digits.
    • Check if the coreTerm contains ending punctuation, if yes
      • Find the last ending punctuation
      • Check if the coreTerm matches the exceptions of the ending punctuation, if not:
        • Add space before the ending punctuation
    • Check if the prefix ends with ending punctuation, if yes
      • Add space after the ending punctuation
    • Check if the suffix contains ending punctuation, if yes
      • Find the last ending punctuation
      • Check if the suffix matches the exceptions of the ending punctuation, if not:
        • Add space before the ending punctuation
    • Converts the updated coreTerm back to output term if split happen in coreterm, prefix, or suffix.

  • Notes:
    • Baseline source code: PreProcSentence.java
    • Enhancement:
      • not used dictionary
      • Add ending punctuation of [:]
      • Remove hard coded patterns of [NUM], [EMAIL], [URL]
      • Remove leading punctuation of [/] and [-] to increase precision
      • Implements exceptions separately for each ending punctuation
      • Use coreTermObj to split to prefix, coreTerm, suffix
      • Recursively split until there is no more split
    • Punctuation of @ and * might be qualified for ending punctuation, it needs further analysis.
    • Action: Redesign and implemented
    • Apply the non-dictionary splitter model with matchers/filters by utilizing regular expression for each ending punctuation. They are described in the following table:
      Broader Generic Matchers
      MatcherRegular ExpressionExamples
      Contains Ending Punctuation^.*[\\.\\?!,;:&\\)\\]\\}].*$
      Email (false)^[\\w!#$%&'*+-/=?^_`{|}~]+@(\\w+(\\.\\w+)*(\\.(gov|com|org|edu|mil|net)))$
      • abc@gmail.com
      • !!@gamil.com
      • abc@123.net
      Url (false)^((ftp|http|https|file)://)?(\\w+(\\.\\w+)*(\\.(gov|com|org|edu|mil|net|uk)).*)$
      • http://www.yahoo.com
      • yahoo.com
      • yahoo.com?test=1%20try%20abc
      Pure digit or punctuation (false)^([\\W_\\d&&\\S]+)$
      • 123.500
      • 12-35-00
      • 12.35.00
      • !@#123$%^

      Filters (Specific Exceptions for Each Ending Punctuation)
      Ending PunctuationFilter (Exception)Regular ExpressionExamples
      Period [.] 1. Plural form(.*\\.s)
      • Dr.s
      • Mr.s
      2. surrounded by digit
      [char]*[digit].[digit][char]*
      ((\\w*\\d\\.\\d\\w*)+)
      • 16q22.1
      • 123.2
      • 123.234.4567
      • 1c3.2d4.4e6
      3. surrounded by single characters
      [single non-digit].[single non-digit]?
      ((\\D\\.)+\\D?)
      • D.C.A.B.
      • D.C.A.B
      • d.c.a.
      • d.c.a
      • D.c
      4. followed by a hyphen
      [word]*.-[word]*
      (\\w*\\.-\\w*)
      • St.-John
      • 123.-John
      5. followed by a quote
      [char]*.['"]
      (.*\\.['\"])
      • Mucinosis."
      Question Mark [?]1. followed by a quote
      [char]*?['"]
      (.*\\?['\"])
      • ulcers?'
      • ulcers?"
      Exclamation Mark [!]1. followed by a quote
      [char]*!['"]
      (.*!['\"])
      • ulcers!'
      • ulcers!"
      Comma [,] 1. digit group separator
      [digit]+,[digit]{3}
      (\\d+(,[\\d]{3})+)
      • 12,345
      • 1,234,567
      Colon [:]1. ratio
      [digit]+:[digit]+
      (\\d+:\\d+)
      • 1:2
      Semicolon [;]1. No exceptions found$^None
      Ampersand [&]1. Abbreviations
      [A-Z]+&[A-Z]+
      [A-Z]+&[A-Z]+
      • AT&T
      • R&D
      Right Parenthesis [)] 1. single char surrounded by parenthesis
      [non-space]*([+char])[non-space]*
      ((\\S)*\\([+\\w]\\)(\\S)*)
      • homocyst(e)ine
      • NAD(P)H
      • RS(3)PE
      • D(+)HUS
      2. chars surrounded by parenthesis and followed by a hyphen
      [non-space]*(char+)-[non-space]*
      ((\\S)*\\([+\\w]+\\)-(\\S)*)
      • Ca(2+)-ATPase
      • beta(2)-microglobulin
      • (Si)-synthase
      • (ADP)-ribose
      3. digit surrounded by parenthesis
      [non-space]*(digit+)[non-space]*
      ((\\S)*\\(\\d+\\)(\\S)*)
      • VO(2)max
      • δ(18)O
      • (123)I-mIBG
      • (131)I
      Right Square Bracket []] 1. [digit]+[Upper] surrounded by []
      [non-space]*[[digit]+[Upper]][non-space]*
      (\\S*\\[\\d+[A-Z]\\]\\S*)
      • [11C]MeG
      • [3H]-thymidine
      • [3H]tyrosine
      2. [lower] surrounded by []
      [Upper]+
      (\\S*\\[[a-z]\\]\\S*)
      • benzo[a]pyrene
      • B[e]P
      Right Curly Brace [}]1. No exceptions found$^None

  • Source Code: