Training Set

I. Download the Training Set

  • brat format: Training Set (brat), 107 KB
  • text format: Training Set (text), 215 KB
    • 5.3 MB
    • OrgData.471: 471 original data from health-related questions in NLM asked by consumers
    • GoldStd-NonWord: non-word gold standard
    • GoldStd-RealWord: real-word gold standard

II. Description

We used both the training set and the test set from the Ensemble method as our training set to develop CSpell. The training set is summarized as follows:

  • Summary statistics:
    Consumer health questions471*
    Tokens24,837
    Annotation tags1,008
    Instances of non-word corrections774
    Instances of real-word corrections964
    Word count per question5 - 328
    Average word count per question52.49
    Error per question0 - 27
    Average error per question2.14
    Error rate (error per token)0.04 (= 964/24,837)

*One question (11199.txt) is removed from the Ensemble method data because it contains too many non-English words.

III. Distribution of Errors in the Training Set

  • Stats on file size and error tags
    CountMinimumMaximumAverage
    Character341985296.37
    Word532852.49
    Error Tag0272.14

  • Error types and corrections
    Correction needednon-wordreal-wordNDMultipleTotal
    Spelling348153113N/A614
    Merge10380N/A48
    Split2410281N/A315
    MultipleN/AN/AN/A3131
    Total382201394311008
    Percentage37.90%19.94%39.09%A3.08%100.00%

    where:

    • ND: errors that do not need a dictionary for correction
    • Multiple: errors that combine serval type and require multiple corrections

IV. Other Components

V. Performance Tests