Ending Punctuation Splitter
- Description:
- Features:
- Examples:
- Implementation Logic:
- Recursively perform the following process:
- Converts input word to coreTerm by strip off leading and ending punctuation, spaces, and digits.
- Check if the coreTerm contains ending punctuation, if yes
- Find the last ending punctuation
- Check if the coreTerm matches the exceptions of the ending punctuation, if not:
- Add space before the ending punctuation
- Check if the prefix ends with ending punctuation, if yes
- Add space after the ending punctuation
- Check if the suffix contains ending punctuation, if yes
- Find the last ending punctuation
- Check if the suffix matches the exceptions of the ending punctuation, if not:
- Add space before the ending punctuation
- Converts the updated coreTerm back to output term if split happen in coreterm, prefix, or suffix.
- Notes:
- Baseline source code: PreProcSentence.java
- Enhancement:
- not used dictionary
- Add ending punctuation of [:]
- Remove hard coded patterns of [NUM], [EMAIL], [URL]
- Remove leading punctuation of [/] and [-] to increase precision
- Implements exceptions separately for each ending punctuation
- Use coreTermObj to split to prefix, coreTerm, suffix
- Recursively split until there is no more split
- Punctuation of @ and * might be qualified for ending punctuation, it needs further analysis.
- Action: Redesign and implemented
- Apply the non-dictionary splitter model with matchers/filters by utilizing regular expression for each ending punctuation. They are described in the following table:
Broader Generic Matchers Matcher Regular Expression Examples Contains Ending Punctuation ^.*[\\.\\?!,;:&\\)\\]\\}].*$
Email (false) ^[\\w!#$%&'*+-/=?^_`{|}~]+@(\\w+(\\.\\w+)*(\\.(gov|com|org|edu|mil|net)))$
- abc@gmail.com
- !!@gamil.com
- abc@123.net
Url (false) ^((ftp|http|https|file)://)?(\\w+(\\.\\w+)*(\\.(gov|com|org|edu|mil|net|uk)).*)$
- http://www.yahoo.com
- yahoo.com
- yahoo.com?test=1%20try%20abc
Pure digit or punctuation (false) ^([\\W_\\d&&\\S]+)$
- 123.500
- 12-35-00
- 12.35.00
- !@#123$%^
Filters (Specific Exceptions for Each Ending Punctuation) Ending Punctuation Filter (Exception) Regular Expression Examples Period [.] 1. Plural form (.*\\.s)
- Dr.s
- Mr.s
2. surrounded by digit
[char]*[digit].[digit][char]*((\\w*\\d\\.\\d\\w*)+)
- 16q22.1
- 123.2
- 123.234.4567
- 1c3.2d4.4e6
3. surrounded by single characters
[single non-digit].[single non-digit]?((\\D\\.)+\\D?)
- D.C.A.B.
- D.C.A.B
- d.c.a.
- d.c.a
- D.c
4. followed by a hyphen
[word]*.-[word]*(\\w*\\.-\\w*)
- St.-John
- 123.-John
5. followed by a quote
[char]*.['"](.*\\.['\"])
- Mucinosis."
Question Mark [?] 1. followed by a quote
[char]*?['"](.*\\?['\"])
- ulcers?'
- ulcers?"
Exclamation Mark [!] 1. followed by a quote
[char]*
- ulcers!'
- ulcers!"
Comma [,] 1. digit group separator
[digit]+,[digit]{3}(\\d+(,[\\d]{3})+)
- 12,345
- 1,234,567
Colon [:] 1. ratio
[digit]+:[digit]+(\\d+:\\d+)
- 1:2
Semicolon [;] 1. No exceptions found $^
None Ampersand [&] 1. Abbreviations
[A-Z]+&[A-Z]+[A-Z]+&[A-Z]+
- AT&T
- R&D
Right Parenthesis [)] 1. single char surrounded by parenthesis
[non-space]*([+char])[non-space]*((\\S)*\\([+\\w]\\)(\\S)*)
- homocyst(e)ine
- NAD(P)H
- RS(3)PE
- D(+)HUS
2. chars surrounded by parenthesis and followed by a hyphen
[non-space]*(char+)-[non-space]*((\\S)*\\([+\\w]+\\)-(\\S)*)
- Ca(2+)-ATPase
- beta(2)-microglobulin
- (Si)-synthase
- (ADP)-ribose
3. digit surrounded by parenthesis
[non-space]*(digit+)[non-space]*((\\S)*\\(\\d+\\)(\\S)*)
- VO(2)max
- δ(18)O
- (123)I-mIBG
- (131)I
Right Square Bracket []] 1. [digit]+[Upper] surrounded by []
[non-space]*[[digit]+[Upper]][non-space]*(\\S*\\[\\d+[A-Z]\\]\\S*)
- [11C]MeG
- [3H]-thymidine
- [3H]tyrosine
2. [lower] surrounded by []
[Upper]+(\\S*\\[[a-z]\\]\\S*)
- benzo[a]pyrene
- B[e]P
Right Curly Brace [}] 1. No exceptions found $^
None - Source Code:
This splitter is used to process a split by adding a space after ending punctuation if a token contains ending punctuation. Ending punctuation includes: .?!,:;&)]}
Split a token in front of ending punctuation.
File Name | Input | Output |
---|---|---|
10023.txt | down.please | down. please |
10286.txt | ...my | ... my |
10004.txt | cancer?if | cancer? if |
11186.txt | ?pls | ? please |
97.txt | suggestions?thanks | suggestions? thanks |
53.txt | hello!can | hello! can |
11186.txt | ,she | , she |
16823.txt | :by | : by |
22.txt | ;syrinx | ; syrinx |
2.txt | )why | ) why |