This page describes the algorithm for real-word correction. In general, detection and correction for real-word errors in CSpell is computed on the fly, based on context score, word frequency score, and other heuristic rules. No confusion set or assumption on the number of real-word errors were used.
II. Results on the Training Set
Tested different methods on the real-word included gold standard from the training set.
|Ensemble (Use Non-Word on Real-Word)||556|825|964||0.6739|0.5768|0.6216|
|CSpell: NW + RW_Merge||619|742|964||0.8342|0.6421|0.7257|
|CSpell: NW + RW_Split||611|737|964||0.8290|0.6338|0.7184|
|CSpell: NW + RW_1To1||614|740|964||0.8297|0.6369|0.7207|
|CSpell: NW + RW_Merge + RW_Split||621|747|964||0.8313|0.6442|0.7259|
|CSpell: NW + RW_Merge + RW_Split + RW_1To1||626|756|964||0.8280|0.6494|0.7279|
- RW_M and RW_S: ~1 min.
- RW_1: ~4 min.
- RW_M_S: ~1 min.
- RW_A: ~4.5 min.
ID Input Output Notes M-1 on set on set No merge M-2 based on set criteria based on set criteria No merge M-3 early on set early onset Merged M-4 on set dementia onset dementia Merged M-5 dianosed early on set deminita diagnosed early onset dementia Merged with other NW corrections
- "on set" is merged to "on set" depends on the context. In Example M-5, dianosed and deminita are also corrected to "diagnosed" and "dementia" respectively in the non-word functions before the real-word merged.
ID Input Output Notes S-1 along along No Split S-2 for along time for a long time Split S-3 He is along He is along No split S-4 He is a long with me He is along with me No split - Merge
- Google does not correct S-2 and S-4!!
- Spelling (1-to-1):
ID Input Output 1-1 foul small foul smell 1-2 bad small bad smell 1-3 small an odor smell an odor 1-4 sense of small sense of smell 1-5 taste and small taste and smell 1-6 smell size small size 1-7 smell amount small amount 1-8 a smell sip of water a small sip of water 1-9 smell intestine small intestine 1-10 very smell very small 1-11 relatively smell relatively small
- Google does not correct 1-3, 1-5, 1-10 and 1-11!!