Spelling Variant Patterns - Test on Lexicon-LRSPL
Norm, MES, and ES are used in a sequential order to retrieve the most spelling variant groups. This model is tested on Lexicon (inflVars.data) for the recall, precisino, F1, and accuracy. The results are shown as follows:
- Results:
2015 (Used in AMIA paper submission)
Step Methods Edit Distance Sample No. ret-rel ret-irrel notRet-rel notRet-irrel Precision Recall F1 Accuracy 0 Lexicon.2015 N/A 867,728 363,217 0 0 504,511 1.0000 1.0000 1.0000 1.0000 1 Norm N/A 867,728 306,387 19,374 56,830 485,137 0.9405 0.8435 0.8894 0.9122 2 MES 2 867,728 355,423 173,647 7,794 330,864 0.6718 0.9785 0.7967 0.7909 3 ES 1 867,728 360,599 286,932 2,618 217,579 0.5569 0.9928 0.7135 0.6663 4 MES 3 867,728 360,956 301,097 2,261 203,414 0.5452 0.9938 0.7041 0.6504 5 ES 2 867,728 362,082 353,512 1,135 150,999 0.5060 0.9969 0.6713 0.5913 6 MES 4 867,728 362,159 356,156 1,058 148,355 0.5042 0.9971 0.6697 0.5883 - Discussion:
Step 6 is the final results we use for the matcher. Use it as example for calculation check:
Check Item Check numbers Total sample no 867,728 = 362,159 + 356,156 + 1,058 + 148,355 Precision 0.5042 = 362,159 / (362,159 + 356,156) Recall 0.9971 = 362,159 / (362,159 + 1,058) F1 0.6697 = (2 * 0.5042 * 0.9971) / (0.5042 + 0.9971) Accuracy 0.5883 = (362,159 + 148,355) / 867,728 - Discussion:
- The recall reaches 99.71% while precision, F1, and accuracy are relatively low. Also, the performance is very low. Thus, we have to reduce the size of n-gram by increasing the WC from 30 to 150. Even so, the entire process took more than 14 days to run. Thus, a better model with improve performnace, precision, F1, and accuracy should be developed.
- Enhancement:
Step Methods Edit Distance Sample No. ret-rel ret-irrel notRet-rel notRet-irrel Precision Recall F1 Accuracy Baseline
(step-2 from above)MES 2 867,728 355,423 173,647 7,794 330,864 0.6718 0.9785 0.7967 0.7909 1 Double Metaphone (10) 2 867,728 356,375 178,698 6,842 325,813 0.6660 0.9812 0.7935 0.7862 2 - Metaphone (10)
- Caverphone
2 867,728 354,790 151,028 8,427 353,483 0.7014 0.9768 0.8165 0.8162 3 - Metaphone (60)
- Caverphone
2 867,728 352,911 115,531 10,306 388,980 0.7534 0.9716 0.8487 0.8550 Enhanced SpVarNorm Baseline
(step-1 from above)Norm N/A 867,728 306,387 19,374 56,830 485,137 0.9405 0.8435 0.8894 0.9122 New Basline Norm N/A 867,728 304,831 3,973 58,386 500,538 0.9871 0.8393 0.9072 0.9281 4 - Metaphone (60)
- Caverphone
2 867,728 352,826 114,271 10,391 390,240 0.7554 0.9714 0.8499 0.8563 5 - Metaphone (60)
- Caverphone
- GrecoLatin
2 867,728 352,675 105,623 10,542 398,888 0.7695 0.9710 0.8586 0.8661 New GoldStandard - with inflectional Spelling Variants 6.0 Norm 2 867,728 305,329 3,475 74,447 484,477 0.9887 0.8040 0.8868 0.9102 6.1?? - Metaphone (60)
- Caverphone
- GrecoLatin
2 867,728 369,200 97,897 10,576 390,055 0.7904 0.9722 0.8719 0.8750 6.1 - Metaphone (60)
- Caverphone
- Greco-Latin
1 867,728 369,049 89,249 10,727 398,703 0.8053 0.9718 0.8807 0.8845 6.2 - Metaphone (60)
- Caverphone
- Greco-Latin
2 867,728 369,049 89,249 10,727 398,703 0.8053 0.9718 0.8807 0.8845 - Enhancement logs:
Tried:
- LVG Metaphone (1.0)
- Double Metaphone (2.0)
- Soundex
- Refined Soundex
- Caverphone 1.0
- Caverphone 2.0
- Cologne Phonetic
- Step 1: Use double metaphone to increase recall on MES:
Example Term Metaphone 1 Metaphone 2 Notes 1 meagreness MKRNS MKRNS - increase TP to increase recall
meagerness MJRNS MKRNS 2 abkhasian ABKHXN APKSN - increase TP to increase recall
abkhazian ABKHSN APKSN 3 toxic edema TKSSTM TKSKTM - increase TP to increase recall
toxic oedema TKSKTM TKSKTM - Step 2: Add Caverphone to increase precision on MES
Example Term Metaphone 1 Metaphone 2 Caverphone 2.0 Notes 1 zymographical SMKRFKL SMKRFKL SMKRFKA111 - reduce FP to increase precision
zymographically SMKRFKL SMKRFKL SMKRFKLA11 2 absorption test ABSRPXNTST APSRPXNTST APSPSNTST1 - reduce FP to increase precision
absorption tests ABSRPXNTST APSRPXNTST APSPSNTSTS 3 bacterial culture media BKTRLKLTRM PKTRLKLTRM PKTRKTRMTA - reduce FP to increase precision
bacterial culture medium BKTRLKLTRM PKTRLKLTRM PKTRKTRMTM - Step 3: increase maxLength in metaphone to 60 to increase precision for long words (the max length of LexItem Metaphone is 54 while the amx length of n-grams is 50) on MES
Example Term Metaphone (10) Metaphone (60) Notes 1 2-item patient health questionnair TMPTNTL0KS ITMPTN0L0KSXNR - reduce FP to increase precision
2-item patient health questionnaires TMPTNTL0KS TMPTNTL0KSXNRS 2 bacterial culture media PKTRLKLTRM PKTRLKLTRMT - reduce FP to increase precision
bacterial culture medium PKTRLKLTRM PKTRLKLTRMTM - Step 4: for noun, if it is metareg, it generates x -> x's as it's plural form. Which should not be normalized on SpVarNorm
Example Singular Plural Notes 1 aan aan's - reduce FP to increase precision
2 dcmp deaminase dcmp deaminase's - reduce FP to increase precision
- Step 5: for noun, if it matches GrecoLatin singular-plural pattern, then do not add to spVar (exclude them out).
Example Term Metaphone (60) Caverphone 2.0 Greco-Latin Notes 1 acroscleroses AKRSKRSS AKRSKLRSS1 singular - reduce FP to increase precision
acrosclerosis AKRSKRSS AKRSKLRSS1 plural 2 ammon's horn scleroses AMNSRNSKRSS AMNSNSKLRS singular - reduce FP to increase precision
ammon's horn sclerosis AMNSRNSKRSS AMNSNSKLRS plural 3 fimbria FMPR FMPRA11111 singular - reduce FP to increase precision
fimbriae FMPR FMPRA11111 plural 4 infraorbital foramen ANFRRPTLFRMN ANFRPTFRMN singular - reduce FP to increase precision
infraorbital foramina ANFRRPTLFRMN ANFRPTFRMN plural - Step 6: if leading characters are number, must be the same
- 2-item patient health questionnair -> TMPTNTL0KSXNR
- 4-item patient health questionnair -> TMPTNTL0KSXNR
- 15-item patient health questionnair -> TMPTNTL0KSXNR
- Step 6: TBD
Example Term Metaphone (60) Caverphone 2.0 Greco-Latin Notes 1 zygomycetes SKMSTS SKMSTS1111 ? - TBD?
zygomycetous SKMSTS SKMSTS1111 ? - Step 7: Lexicon error (can be used in LexBuild for approximate match)
Example Term Metaphone (60) Caverphone 2.0 Greco-Latin Notes 1 zygapophyseal joint TBD TBD ? - TBD?
zygapophysial joint TBD TBD ? 2 zuclomifene TBD TBD ? - TBD?
zuclomiphene TBD TBD ?