Spelling Variant Patterns - Test on Lexicon-LRSPL

Norm, MES, and ES are used in a sequential order to retrieve the most spelling variant groups. This model is tested on Lexicon (inflVars.data) for the recall, precisino, F1, and accuracy. The results are shown as follows:

  • Results:

    2015 (Used in AMIA paper submission)

    StepMethodsEdit DistanceSample No.ret-relret-irrelnotRet-relnotRet-irrelPrecisionRecallF1Accuracy
    0Lexicon.2015N/A867,728363,21700504,5111.00001.00001.00001.0000
    1NormN/A867,728306,38719,37456,830485,1370.94050.84350.88940.9122
    2MES2867,728355,423173,6477,794330,8640.67180.97850.79670.7909
    3ES1867,728360,599286,9322,618217,5790.55690.99280.71350.6663
    4MES3867,728360,956301,0972,261203,4140.54520.99380.70410.6504
    5ES2867,728362,082353,5121,135150,9990.50600.99690.67130.5913
    6MES4867,728362,159356,1561,058148,3550.50420.99710.66970.5883

  • Discussion:

    Step 6 is the final results we use for the matcher. Use it as example for calculation check:

    Check ItemCheck numbers
    Total sample no867,728 = 362,159 + 356,156 + 1,058 + 148,355
    Precision 0.5042 = 362,159 / (362,159 + 356,156)
    Recall0.9971 = 362,159 / (362,159 + 1,058)
    F10.6697 = (2 * 0.5042 * 0.9971) / (0.5042 + 0.9971)
    Accuracy0.5883 = (362,159 + 148,355) / 867,728

  • Discussion:
    • The recall reaches 99.71% while precision, F1, and accuracy are relatively low. Also, the performance is very low. Thus, we have to reduce the size of n-gram by increasing the WC from 30 to 150. Even so, the entire process took more than 14 days to run. Thus, a better model with improve performnace, precision, F1, and accuracy should be developed.

  • Enhancement:
    StepMethodsEdit DistanceSample No.ret-relret-irrelnotRet-relnotRet-irrelPrecisionRecallF1Accuracy
    Baseline
    (step-2 from above)
    MES2867,728355,423173,6477,794330,8640.67180.97850.79670.7909
    1Double Metaphone (10)2867,728356,375178,6986,842325,8130.66600.98120.79350.7862
    2
    • Metaphone (10)
    • Caverphone
    2867,728354,790151,0288,427353,4830.70140.97680.81650.8162
    3
    • Metaphone (60)
    • Caverphone
    2867,728352,911115,53110,306388,9800.75340.97160.84870.8550
    Enhanced SpVarNorm
    Baseline
    (step-1 from above)
    NormN/A867,728306,38719,37456,830485,1370.94050.84350.88940.9122
    New BaslineNormN/A867,728304,8313,97358,386500,5380.98710.83930.90720.9281
    4
    • Metaphone (60)
    • Caverphone
    2867,728352,826114,27110,391390,2400.75540.97140.84990.8563
    5
    • Metaphone (60)
    • Caverphone
    • GrecoLatin
    2867,728352,675105,62310,542398,8880.76950.97100.85860.8661
    New GoldStandard - with inflectional Spelling Variants
    6.0Norm2867,728305,3293,47574,447484,4770.98870.80400.88680.9102
    6.1??
    • Metaphone (60)
    • Caverphone
    • GrecoLatin
    2867,728369,20097,89710,576390,0550.79040.97220.87190.8750
    6.1
    • Metaphone (60)
    • Caverphone
    • Greco-Latin
    1867,728369,04989,24910,727398,7030.80530.97180.88070.8845
    6.2
    • Metaphone (60)
    • Caverphone
    • Greco-Latin
    2867,728369,04989,24910,727398,7030.80530.97180.88070.8845

  • Enhancement logs:

    Tried:

    • LVG Metaphone (1.0)
    • Double Metaphone (2.0)
    • Soundex
    • Refined Soundex
    • Caverphone 1.0
    • Caverphone 2.0
    • Cologne Phonetic
    • Step 1: Use double metaphone to increase recall on MES:
      ExampleTermMetaphone 1Metaphone 2Notes
      1meagrenessMKRNSMKRNS
      • increase TP to increase recall
      meagernessMJRNSMKRNS
      2abkhasianABKHXNAPKSN
      • increase TP to increase recall
      abkhazianABKHSNAPKSN
      3toxic edemaTKSSTMTKSKTM
      • increase TP to increase recall
      toxic oedemaTKSKTMTKSKTM

    • Step 2: Add Caverphone to increase precision on MES
      ExampleTermMetaphone 1Metaphone 2Caverphone 2.0Notes
      1zymographicalSMKRFKLSMKRFKLSMKRFKA111
      • reduce FP to increase precision
      zymographicallySMKRFKLSMKRFKLSMKRFKLA11
      2absorption testABSRPXNTSTAPSRPXNTSTAPSPSNTST1
      • reduce FP to increase precision
      absorption testsABSRPXNTSTAPSRPXNTSTAPSPSNTSTS
      3bacterial culture mediaBKTRLKLTRMPKTRLKLTRMPKTRKTRMTA
      • reduce FP to increase precision
      bacterial culture mediumBKTRLKLTRMPKTRLKLTRMPKTRKTRMTM

    • Step 3: increase maxLength in metaphone to 60 to increase precision for long words (the max length of LexItem Metaphone is 54 while the amx length of n-grams is 50) on MES

      ExampleTermMetaphone (10)Metaphone (60)Notes
      12-item patient health questionnairTMPTNTL0KSITMPTN0L0KSXNR
      • reduce FP to increase precision
      2-item patient health questionnairesTMPTNTL0KSTMPTNTL0KSXNRS
      2bacterial culture mediaPKTRLKLTRMPKTRLKLTRMT
      • reduce FP to increase precision
      bacterial culture mediumPKTRLKLTRMPKTRLKLTRMTM

    • Step 4: for noun, if it is metareg, it generates x -> x's as it's plural form. Which should not be normalized on SpVarNorm

      ExampleSingularPluralNotes
      1aanaan's
      • reduce FP to increase precision
      2dcmp deaminasedcmp deaminase's
      • reduce FP to increase precision

    • Step 5: for noun, if it matches GrecoLatin singular-plural pattern, then do not add to spVar (exclude them out).

      ExampleTermMetaphone (60)Caverphone 2.0Greco-LatinNotes
      1acrosclerosesAKRSKRSSAKRSKLRSS1singular
      • reduce FP to increase precision
      acrosclerosisAKRSKRSSAKRSKLRSS1plural
      2ammon's horn sclerosesAMNSRNSKRSSAMNSNSKLRSsingular
      • reduce FP to increase precision
      ammon's horn sclerosisAMNSRNSKRSSAMNSNSKLRSplural
      3fimbriaFMPRFMPRA11111singular
      • reduce FP to increase precision
      fimbriaeFMPRFMPRA11111plural
      4infraorbital foramenANFRRPTLFRMNANFRPTFRMNsingular
      • reduce FP to increase precision
      infraorbital foraminaANFRRPTLFRMNANFRPTFRMNplural

    • Step 6: if leading characters are number, must be the same
      • 2-item patient health questionnair -> TMPTNTL0KSXNR
      • 4-item patient health questionnair -> TMPTNTL0KSXNR
      • 15-item patient health questionnair -> TMPTNTL0KSXNR

  • Step 6: TBD

    ExampleTermMetaphone (60)Caverphone 2.0Greco-LatinNotes
    1zygomycetesSKMSTSSKMSTS1111?
    • TBD?
    zygomycetousSKMSTSSKMSTS1111?

  • Step 7: Lexicon error (can be used in LexBuild for approximate match)

    ExampleTermMetaphone (60)Caverphone 2.0Greco-LatinNotes
    1zygapophyseal jointTBDTBD?
    • TBD?
    zygapophysial jointTBDTBD?
    2zuclomifeneTBDTBD?
    • TBD?
    zuclomipheneTBDTBD?