Matcher - SpVar Pattern

I. Introduction

From our osverbation, if n-grams of a term that matches spVar pattern in the n-gram set, they are good candidates of MWEs.Spelling variants are usually looks alike and sounds alike as shown in the following example:

EUIExamplesDifference
E0055858
  • sideeffect
  • side effect
  • side-effect
Spaces and hyphens
E001353
  • bloodpressure
  • blood pressure
  • blood-pressure
Spaces and hyphens
E0000237
  • Alzheimers disease
  • Alzheimer's disease
  • Alzheimers' disease
Genitive
E0000862
  • Behcets disease
  • Behcet's disease
  • Behçets disease
  • Behçet's disease
Genitive and unicode
E0000679
  • BM
  • B.M.
Punctuation
E0008797
  • analyse
  • analyze
Spelling
E0017903
  • color
  • colour
Spelling
E0000857
  • Beduin
  • beduin
  • Bedouin
  • bedouin
Cases and spelling
E0223708
  • type 2 diabetes
  • type II diabetes
  • type-2 diabetes
  • type-II diabetes
Spaces, hyphens, and numbers.

A spVar model, including over 10 spelling variant types, has been developed to identify spVars in a corpus. This model was tested on LRSPL (recall), tested on Lexicon (recall and precision), and then apply on MEDLINE.

II. Models
The following algorithm is used to retrieve N-grams that match spVar patterns:

  • AMIA.2016 paper (May, 2016)
    • The distrilled MEDLINE n-gram set, WC > 150

    • Group SpVar ByNorm
    • Group by MES: maxEditDist = 2
    • Group by ES: maxEditDist = 1
    • Group by MES: maxEditDist = 3
    • Group by ES: maxEditDist = 2
    • Group by MES: maxEditDist = 4

    • This model work fine. However, more developments on this model are needed to improve the processing time and performance (precison and recall on Lexicon).
  • HealthInf.2017 paper
    • The distrilled MEDLINE n-gram set, WC > 150

    • Group SpVar ByNorm
    • Group by M2CES: maxEditDist = 2

    • Lower the WC,

III. Processes

  • directory: ${MULTIWORDS_DIR}/bin
  • Run program: shell> ./08.MatcherSpVar ${YEAR}
  • Processes:

    StepDescriptionInputsOutputsNotes - Examples
    Unit Tests for SpVar software compoments
    1Unit Test: SpVar Norm
    SpVarNorm.java
    NonenormStr of sample StrUnit test on spVar norm
    • No need to run for candidate list
    2Unit Test: Metaphone
    Metaphone.java
    Nonemetaphone of sample StrUnit test on metaphone
    • No need to run for candidate list
    3Unit Test: Edit Distance
    EditDistance.java
    Noneedit distance of 2 sample StrsUnit test on edit distance
    • No need to run for candidate list
    4Unit Test: Sorted Distance
    SortedDistance.java
    Nonesorted distance between sample Strs in a set of input termsUnit test on sorted distance
    • No need to run for candidate list
    Utility Tests for SpVar software compoments
    10SpVar on norm|Metaphone|Edit distance ..
    TestSpVarOnNormMpEd.java
    Nonenorm|Metaphone|EditDistanceShow results of two inStrs
    • Not needed for candidate list
    11GroupSpVarByNorm ..
    GetStdAndSpVarsFromLRSPL.java
    GroupSpVarByNorm.java
    • ./inData/LRSPL
    • LRSPL.std
      SpVar-1|SpVar-2|SpVar-3|...
    • LRSPL.data
      SpVar-1
      SpVar-2
      SpVar-3
      ...

    • ./unitTest/LRSPL.data.1.byNorm.out
    Retrieve LMW from LRSPL by spVarNorm
    • Not needed for candidate list
    12GroupSpVarByMES ..
    GroupSpVarByMES.java
    • ./unitTest/LRSPL.data.1.byNorm.out
    • ./unitTest/LRSPL.data.2.byMES.2.out
    Retrieve LMW from results of step-11 by MES
    • Not needed for candidate list
    13GroupSpVarByES ..
    GroupSpVarByES.java
    • ./unitTest/LRSPL.data.2.byMES.2.out
    • ./unitTest/LRSPL.data.3.byES.1.out
    Retrieve LMW from results of step-12 by ES
    • Not needed for candidate list
    14PrintOutSpVars ..
    PrintOutSpVars.java
    • ./unitTest/LRSPL.data.3.byES.1.out
    Split and print out spVars to single and spVars
    • ./unitTest/LRSPL.data.3.byES.1.out.std (= single + spVars)
    • ./unitTest/LRSPL.data.3.byES.1.out.notSpVars (single without spVar)
    • ./unitTest/LRSPL.data.3.byES.1.out.spVars (have spVars)
    • Not needed for candidate list
    15GetNormSpVarsTable ..
    GetNormSpVarsTable.java
    • ./unitTest/LRSPL.data
    • ./unitTest/LRSPL.data.notSpVar
    • ./unitTest/LRSPL.data.spVar
    Get a norm spVar|spVars table from a file (terms or n-grams)
    • Not needed for candidate list
    Analysis: Test on LRSPL
    20Convert file format from LRSP ..
    21Get SpVar from LRSPL - ByNorm ..
    22Get SpVar from LRSPL - ByMES (ED:2) ..
    23Get SpVar from LRSPL - ByES (ED:1) ..
    24Get SpVar from LRSPL - ByMES (ED:3) ..
    25Get SpVar from LRSPL - ByES (ED:2) ..
    26Get SpVar from LRSPL - ByMES (ED:4) ..
    27Print result of step 26 - ByMES (ED:4) ..
    28Test SpVar matcher on LRSPL - (Steps: 21-26) ..
    29Analysis: GetSpVarTypeFromLRSPL ..
    GetSpVarTypeFromLRSPL.java
    • ./inData/LRSPL
    • spVars.type
      = GENITIVE + NON_GENITIVE

    • spVars.type.GENITIVE
    • spVars.type.NON_GENITIVE
    • spVars.type.NON_GENITIVE.GENITIVE
    Analyze types of spVars:
      TypeExamples
      SVT_SPACElookup|look up
      SVT_CASEAcG|ACG
      SVT_PUNC_DASHlookup|look-up
      SVT_PUNC_PERIODAAMD|A.A.M.D.
      SVT_PUNC_OTHERSanti-HB(s)|antiHBs
      SVT_GENITIVEAddisons|Addison's
      SVT_GENITIVE_SAlzheimer|Alzheimer's
      SVT_GENITIVE_PAlzheimer|Alzheimers'
      SVT_GENITIVE_SSAlzheimer|Alzheimer'S
      SVT_GENITIVE_PPAAlzheimer|lzheimerS'
      SVT_NUMBER3|three
      SVT_RANK2nd|second
      SVT_SYNONYMSt.|Saint
      SVT_SPVARantitumour|antitumor
      SVT_UNICODEæcidium|aecidium
      SVT_TBDadvertize|advertise

    • No need to run for candidate list
    Analysis: Test on Lexicon (inflVars.data) for AMIA paper Table-2
    30Get inflectional SpVar from Lexicon
    • GetInflSpVarsFromLexicon.java
    31Get gold standard for SpVars from Lexicon
    • GetGoldStdFromLex.java
    32Get SpVar from Lex - ByNorm
    • GroupSpVarByNorm.java
    33Get SpVar from Lex - ByMES (ED:2)
    • GroupSpVarByMES.java
    • ED = 2
    • Time: hr.
    33A,B,C,D,EGet SpVar from Lex - ByM2ES, M3ES, C2ES, M2CES, M3CES (ED:2)
    • GroupSpVarByXXXX.java
    • ED = 2
    • Time: ~ 2 hr.
    34Get SpVar from Lex - ByES (ED:1)
    • GroupSpVarByES.java
    • ED = 1
    • Time: hr.
    35Get SpVar from Lex - ByMES (ED:3)
    • GroupSpVarByMES.java
    • ED = 3
    • Time: hr.
    36Get SpVar from Lex - ByES (ED:2)
    • GroupSpVarByES.java
    • ED = 2
    • Time: hr.
    37Get SpVar from Lex - ByMES (ED:4)
    • GroupSpVarByMES.java
    • ED = 4
    • Time: hr.
    37Get PRF for above tests (Must complete steps: 31-36)
    Pre-Process
    40Group nGrams by core-term
    GroupByCoreTerm.java
    Same as use otpion 11 of 6.NGramUtil
    • ${NGRAM_DIR}distilledNGram.${YEAR}
    • ./Candidates/distilledNGram.${YEAR}.core
    • ./Candidates/distilledNGram.${YEAR}.detail
    • Group distilled n-gram by core-term
    • Must finish the distilled n-gram set
    41Get terms from nGramSet (filtered by WC, sorted)
    NGramWcTermFilter.java
    • distilledNGram.${YEAR}.core
    • distilledNGram.${YEAR}.core.${WC}.sort
    • distilledNGram.${YEAR}.core.${WC}.sort.term
    Filter by WC (default 150) and sort
    Process: Apply spVar on MEDLINE
    50Apply SpVar on Medline - ByNorm
    • 1 min (AMIA.init, WC: 150)
    • 2 min (HealthInf, WC: 100)
    • 3 min (HealthInf, WC: 50)
    • 5 min (HealthInf, WC: 30)
    • distilledNGram.${YEAR}.core.WC.sort (manual remove lines less than WC)
    • distilledNGram.${YEAR}.core.WC.sort.term (flds 2 of above file)
    ./${YEAR}/outData/08.MatcherSpVar/Medline/
    • medline.1.byNorm.out.150
    • medline.1.byNorm.out.100
    • medline.1.byNorm.out.50
    • medline.1.byNorm.out.30
    51Apply SpVar on Medline - ByM2CES
    • 2 hr (Amia, WC:150)
    • 23 hr (2015, WC:100)
    • 4 days 18 hr (2015, WC:50)
    • 14 days 8 hr (2015, WC:30)
    • 16 days 23 hr (2016, WC:30)
    ./${YEAR}/outData/08.MatcherSpVar/Medline/
    • medline.2.byM2CES.2.out.150
    • medline.2.byM2CES.2.out.100
    • medline.2.byM2CES.2.out.50
    • medline.2.byM2CES.2.out.30
    51APrint out SpVarClass results of Step-51 to files
    • 1 min (HealthInf)
    ./${YEAR}/outData/08.MatcherSpVar/Medline/
    • medline.2.byM2CES.2.out.30.notSpVars
    • medline.2.byM2CES.2.out.30.spVars
    • medline.2.byM2CES.2.out.30.std
    52Apply SpVar on Medline - ByMES (ED:2)
    • 11 hr (AMIA-lexdev1)
    53Apply SpVar on Medline - ByES (ED:1)
    • 5 days 6 hr (AMIA-lexdev1)
    54Apply SpVar on Medline - ByMES (ED:3)
    • 30 min. (AMIA-lexdev1)
    55Apply SpVar on Medline - ByES (ED:2)
    • 5 days 5 hr (AMIA-lexdev1)
    56Apply SpVar on Medline - ByMES (ED:4)
    • 20 min. (AMIA-lexdev1)
    Results
    57Apply SpVar on Medline (Step 51 - 56)
    • 11~12 days

    GetSpVarClassByTerm.java
    • distilledNGram.${YEAR}.core.${WC}.sort.term
    It takes 12 days to generate spVar class (WC >=150), they are LMW candidates:
    • norm
    • MES 1, maxEditDist=2
    • ES 1, maxEditDist=1
    • MES 2: maxEditDist=3
    • ES 2, maxEditDist=2
    • MES 3: maxEditDist=4
    • PrintOut results
    58Print out SpVarClass results to files
    • distilledNGram.${YEAR}.core.${WC}.sort.term.std
    • distilledNGram.${YEAR}.core.${WC}.sort.term.notSpVars
    • distilledNGram.${YEAR}.core.${WC}.sort.term.spVars
    Results
    Process: Generate LMW candidates from SpVar, Cui, Wc
    60Get LMS candidates from spVar file with CUI
    • GetCanFromSpVarCui.java
    • 90 min.
    • ./Medline/medline.2.byM2CES.2.out.30.spVars
    • ${MED_DIR}/distilledNGram.2015.core.30 (for WC)
    • ${STMT_DIR}/data/Config/smt.properties (for SMT to get CUI)
    • ./Candidates/medline.2.byM2CES.2.out.30.spVars.cui
    • All fields are tokenized into terms.
    • Check if the term has CUI information from SMT
    • Filter out terms without CUI in the spVar class
    • Remove the spVar class if only one term in the class
    • Retrieve WC information
    61Apply WC to LMW candidates from above spVar-CUI file
    • ApplyWcToSpVarCuiCanFile.java
    • ./Candidates/medline.2.byM2CES.2.out.30.spVars.cui
    • WC_BASE
    • UP_LIMIT
    • DOWN_LIMIT
    • ./Candidates/medline.2.byM2CES.2.out.30.spVars.cui.raw
    • ./*.spVars.cui.raw.WC_BASE.UP_LIMIT.DOWN_LIMIT
    61AApply 5 WC sets to LMW candidates
    • ApplyWcToSpVarCuiCanFile.java
    • 15 sec.
    • medline.2.byM2CES.2.out.30.spVars.cui
    • *.raw.100.0.500
    • *.raw.1000.0.500
    • *.raw.10000.0.500
    • *.raw.100000.0.500
    • *.raw.1000000.0.500
    • WC: 1000K, 100K, 10K, 1K, 100
    • UP_LIMIT: 0
    • DOWN_LIMIT: 500
    62Auto tag spVar-CUI-WC file
    • TagSpVarWcCandidateFile.java
    • Use 62A instead
    62AAuto tag 5 WC sets of spVar-CUI files
    • TagSpVarWcCandidateFile.java
    • 15 sec.
    • *.raw.100.0.500
    • *.raw.100.50.50.tag.${YEAR}
    • *.raw.100.50.50.tag.${YEAR}.no
    • *.raw.100.50.50.tag.${YEAR}.tbd
    • *.raw.100.50.50.tag.${YEAR}.yes
    • *.raw.100.50.50.tag.${YEAR}.yesNo
    • *.raw.100.50.50.tag.${YEAR}.can (candidate file)
      => Same as tag file except:
      • remvoe tag if tag is TBD
      • remove the spVar class if all candidates are auto-tagged
    • WC:1000K, 100K, 10K, 1K, 100
    Post-Process: TBD
    70Add word count to SpVar class
    AddWcToSpVarClass.java
    • distilledNGram.${YEAR}.core.${WC}.sort.term.spVars
    • distilledNGram.${YEAR}.core.${WC}.sort
    • distilledNGram.${YEAR}.core.${WC}.sort.term.spVars.wc
    Add WC back to spVar1|spVar2|..
    71Sort by reversed string ... (request by Lynn)