Real-word Correction
This page describes the algorithm for real-word correction. In general, detection and correction for real-word errors in CSpell is computed on the fly, based on context score, word frequency score, and other heuristic rules. No confusion set or assumption on the number of real-word errors were used.
I. Functions
II. Results on the Training Set
Tested different methods on the real-word included gold standard from the training set.
Methods | Raw data | Performance |
---|---|---|
Ensemble (Use Non-Word on Real-Word) | 556|825|964 | 0.6739|0.5768|0.6216 |
Ensemble (Real-Word) | 517|718|964 | 0.7201|0.5363|0.6147 |
CSpell: NW | 609|731|964 | 0.8331|0.6317|0.7186 |
CSpell: NW + RW_Merge | 619|742|964 | 0.8342|0.6421|0.7257 |
CSpell: NW + RW_Split | 611|737|964 | 0.8290|0.6338|0.7184 |
CSpell: NW + RW_1To1 | 614|740|964 | 0.8297|0.6369|0.7207 |
CSpell: NW + RW_Merge + RW_Split | 621|747|964 | 0.8313|0.6442|0.7259 |
CSpell: NW + RW_Merge + RW_Split + RW_1To1 | 626|756|964 | 0.8280|0.6494|0.7279 |
- RW_M and RW_S: ~1 min.
- RW_1: ~4 min.
- RW_M_S: ~1 min.
- RW_A: ~4.5 min.
III. Examples
- Merge:
ID Input Output Notes M-1 on set on set No merge M-2 based on set criteria based on set criteria No merge M-3 early on set early onset Merged M-4 on set dementia onset dementia Merged M-5 dianosed early on set deminita diagnosed early onset dementia Merged with other NW corrections - "on set" is merged to "on set" depends on the context. In Example M-5, dianosed and deminita are also corrected to "diagnosed" and "dementia" respectively in the non-word functions before the real-word merged.
- Split:
ID Input Output Notes S-1 along along No Split S-2 for along time for a long time Split S-3 He is along He is along No split S-4 He is a long with me He is along with me No split - Merge - Google does not correct S-2 and S-4!!
- Spelling (1-to-1):
ID Input Output 1-1 foul small foul smell 1-2 bad small bad smell 1-3 small an odor smell an odor 1-4 sense of small sense of smell 1-5 taste and small taste and smell 1-6 smell size small size 1-7 smell amount small amount 1-8 a smell sip of water a small sip of water 1-9 smell intestine small intestine 1-10 very smell very small 1-11 relatively smell relatively small - Google does not correct 1-3, 1-5, 1-10 and 1-11!!