Remove parenthetic plural forms: (s), (es), and (ies)

I. Introduction

There are terms include (s), (es), and (ies) to represent plural form of noun. This becomes a very challenge issue for normalization due to various ways of representation. This study is try to normalize (s), (es), and (ies) for all terms in Metathesaurus by removing them or replacing them with space.

II. Problem Definition

In regular English, we could normalize terms with plural forms of (s), (es), (ies) by removing them. For example:

Original termNormalized term
Abuse, drug(s), heroinAbuse, drug, heroin
Abdomen CT Adrenal Mass(es) Bilateral Abdomen CT Adrenal Mass Bilateral
Donor pneumonectomy(ies) with preparation and maintenance of allograft (cadaver) Donor pneumonectomy with preparation and maintenance of allograft (cadaver)

In our research, all (es) and (ies) in English terms should be removed without any exceptions. However, patterns of (s) appear in chemical, protein, and other expressions do not mean plural forms. In other words, (s) should not be removed in such cases. For example:

  • 1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine
  • 9(s)-erythromycylamine
  • anatoxin-b(s)
  • Ap(s)pCHClpp(s)A
  • Bacillus phage rho11(s)
  • Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe
  • EAV G(s) glycoprotein
  • G(s), alpha Subunit
  • Histone H1(s)
  • J(s)(b) ANTIBODY
  • N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer
  • natoxin-a(s)
  • Salmonella II 6,7:(g),m,(s),t:1,5
  • (s)-(+)-citreofuran
  • su(s) protein, Drosophila
  • XLalpha(s) protein

In addition, there are some cases that (s) should be replaced with a space instead of removed. For example:

Original termNormalized term
[X]O spontn disrptn/lig(s)knee [X]O spontn disrptn/lig knee
O spontn disrptn/lig(s)knee O spontn disrptn/lig knee

III. Governing Patterns

Fortunately, there are certain patterns we can govern and derived. First of all, the patterns of (s) and (es) we are dealing with are the plural forms of nouns. In other words, the word before this (s) pattern should be a noun. Thus, if we observe the word (characters) in front of (s), there are certain rules we can derived. Table below shows derived rules from above examples:

Sample termPattern in front of (s)Distance
1-N-(s)-4-amino-2-hydroxybutyryl-3'4'-deoxyneamine punctuation - 1
9(s)-erythromycylamine arabic number 9 1
anatoxin-b(s) punctuation - 2
Ap(s)pCHClpp(s)A
  • size of front word Ap <= 2
  • special pattern pp
1
Bacillus phage rho11(s) arabic number 11 1
Cbz-AAPhepsi((s)-CH(OH)CH2)GlyVV-OMe punctuation ( 1
EAV G(s) glycoprotein
  • size of front word J <= 2
  • space " "
2
G(s), alpha Subunit size of front word G <= 2n/A
Histone H1(s)
  • arabic number 1
  • size of front word H1 <= 2
1
J(s)(b) ANTIBODY size of front word J <= 2n/A
N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer
  • size of front word J <= 2
  • space " "
1
natoxin-a(s) punctuation - 2
Salmonella II 6,7:(g),m,(s),t:1,5 punctuation , 1
(s)-(+)-citreofuran size of front word <= 2n/A
su(s) protein, Drosophila size of front word su <= 2n/A
XLalpha(s) protein greek letters alpha1

As for the issue of remove (s) or replace (s) with a space, it is determined by the character following (s). If there is a word follows (s), then it should be replaced by a space. If not, it can be simply removed. Legal English word usually starts with a letter. Thus following rules is derived for cases where we replace (s) with space (" "):

Sample termCharacter after (s)
[X]O spontn disrptn/lig(s)knee k

IV. Wild Card Set

Let define some wild card character set before we solve the problem:

  • ^: start, where the word before (s) starts
  • $: end, where the word before (s) ends
  • C: any character
  • D: digit [0-9]
  • L any letter[a-z]
  • P: punctuation: [- ( ,]
  • S: space: [ ]

  • special patter: pp
  • Greek letter: alpha, beta, gamma, delta, epsilon, zeta, eta, theta, iota, kappa, lamda, mu, nu, xi, omikron, pi, rho, sigma, tau, upsilon, phi, chi, psi, omega

V. Rules Representation

RulesExamples
D$9(s)-erythromycylamine
P$Salmonella II 6,7:(g),m,(s),t:1,5
^$(s)-(+)-citreofuran
PC$N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer
SC$EAV G(s) glycoprotein
^C$J(s)(b) ANTIBODY
SCC$Histone H1(s)
^CC$su(s) protein, Drosophila
pp$Ap(s)pCHClpp(s)A
alpha$XLalpha(s) protein
beta$ 
gamma$ 
delta$ 
epsilon$ 
zeta$ 
eta$ 
theta$ 
iota$ 
kappa$ 
lamda$ 
mu$ 
nu$ 
xi$ 
omikron$ 
pi$ 
rho$ 
sigma$ 
tau$ 
upsilon$ 
phi$ 
chi$ 
psi$ 
omega$ 

Please notes that rule S$ is not used even it is good for example of

  • N(alpha)-benzoylarginineamide monohydrochloride, (s)-isomer

However, there are cases that this rule is not applicable. Such as
  • Brief psychotic disorder, with marked stressor (s)
  • Closed fracture of metacarpal bone (s), site unspecified
  • Paralysis extraocular muscle (s)

V. Algorithm

Below is the algorithm:

  • Locate (es) and (ies).
  • Remove (es) and (ies).

  • Locate (s)
  • Go through the reversed trie (formed from rules)
    • If match exceptional pattern, keep input term the same
    • If doesn't, check the character follows (s):
      • If it is a letter, replace (s) with a space " ".
      • If not, remove (s)

VI. Reversed trie of rules