DPairs by Spelling Variants

I. Introduction

From linguistic standpoint, all spelling variants (SpVars) are equivalent. Accordingly, all permutations of SpVars for a dPair between two records are valid dPair. For example, the following dPairs are valid:

  • color|noun|E0017902|colorful|adj|E0017909|O|S|None
  • color|noun|E0017902|colourful|adj|E0017909|O|S|None
  • colour|noun|E0017902|colorful|adj|E0017909|O|S|None
  • colour|noun|E0017902|colourful|adj|E0017909|O|S|None

In other words,

  • the dPair tag of 2 records should be the same
    # 17549|space|noun|E0056852|spacey|adj|E0234312|no
    # 33938|space|noun|E0056852|spacy|adj|E0234312|yes
  • the negation tag of 2 records shold be the same

II. DPairs by SpVars without matching characters

DPairs cuased by spelling varaints could have different starting and ending characters and thus violate the definition of dTypes of zeroD, prefixD, and suffixD. Such as: aestheticity|noun|E0604308|esthetic|adj|E0355942 is a valid dPair with spelling variatns from suffix dPair: estheticity|noun|E0604308|esthetic|adj|E0355942. This type of dPairs is called dPairs by SpVars. They are commonly seen from nomD. Bellows are the summary of how we handle this type of dPairs:

  • Linguistic standpoint:
    • Nominalization pairs caused by spVars are nominalizations
    • dPairs caused by spVars are dPairs
    • dPairs are related by Lexical reocrds
    • LRNOM (nominalization table): include all nominalization pairs caused by SpVars
  • Lexical Tools:
    • Only direct related dPair (character matching up) are generated.
      • zeroD must has same spelling
        • Some zeroDs are excluded because they don't have SpVars. Such as:
          lower-class|adj|E0038116|lower class|noun|E0038110|O|Z|None
          short-circuit|verb|E0055706|short circuit|noun|E0055703|O|Z|None
          water-ski|verb|E0065141|water ski|noun|E0330701|O|Z|None
          A better dType algorithm is needed for identifying zeroD (from orgD and nomD) for SpVars cases.
      • prefixD must have same ending characters (prefix + substring = prefixString)
      • suffixD must have same starting characters (>= 2, with some exceptions) and has same number of " " and "-". For example, re-order|verb|E0312476|reordering|noun|E0500038 is excluded. This excludes some nomD that has no SpVar at all, such as low density|noun|E0038126|lowdensity|adj|E0038140. A better dType algorithm is needed for identifying suffixD (from nomD) for SpVars cases.
        • Some nomD does not have perfect matched starting characters. Such as: usability|noun|E0063734|useable|adj|E0063735
        • A more sophisticated algorithm is needed for generate nomD from LEXICON (not from LRNOM).
    • dPairs caused by SpVars can be retrieved by combining derivation flow and spelling variants flow
      => Similar to using recursive derivation flow to get multiple levels derivational related variants.
    • Accordingly, only direct dPairs are included in Database tables (derivation.data and DM.DB in LEXICON)
    • All dPairs caused by SpVars will be removed in above two tables!

III. Impacts

  • 2013 data:
    • prefixD: none
    • suffixD: none
    • zeroD: < 0.03% (4/15,037)
  • 2014 data:
    • nomD (32,932):
      • Z: 355
      • S: 32,001
      • P: 0

      • ZS: 116 (exclude from zeroD)
      • SS: 460 (exclude from suffixD)
      • PS: 0
    • orgD (3,928) for those has EUIs.
      • P: 4
      • S: 3,549
      • Z: 220

      • PS: 0
      • SS: 47 (exclude from suffixD)
      • ZS: 17 (exclude from zeroD)

      • U: 91