Training Set
- brat format: Training Set (brat), 107 KB
- text format: Training Set (text), 215 KB
- 5.3 MB
- OrgData.471: 471 original data from health-related questions in NLM asked by consumers
- GoldStd-NonWord: non-word gold standard
- GoldStd-RealWord: real-word gold standard
II. Description
We used both the training set and the test set from the Ensemble method as our training set to develop CSpell. The training set is summarized as follows:
- Summary statistics:
Consumer health questions 471* Tokens 24,837 Annotation tags 1,008 Instances of non-word corrections 774 Instances of real-word corrections 964 Word count per question 5 - 328 Average word count per question 52.49 Error per question 0 - 27 Average error per question 2.14 Error rate (error per token) 0.04 (= 964/24,837)
*One question (11199.txt) is removed from the Ensemble method data because it contains too many non-English words.
III. Distribution of Errors in the Training Set
- Stats on file size and error tags
Count Minimum Maximum Average Character 34 1985 296.37 Word 5 328 52.49 Error Tag 0 27 2.14 - Error types and corrections
Correction needed non-word real-word ND Multiple Total Spelling 348 153 113 N/A 614 Merge 10 38 0 N/A 48 Split 24 10 281 N/A 315 Multiple N/A N/A N/A 31 31 Total 382 201 394 31 1008 Percentage 37.90% 19.94% 39.09%A 3.08% 100.00% where:
- ND: errors that do not need a dictionary for correction
- Multiple: errors that combine serval type and require multiple corrections
IV. Other Components
- Issues on the training set original gold standard
- Training set data analysis (on the original gold standard)
- Gold standard annotation revision
V. Performance Tests