Test Set from NER Collection

This page describes the process of generating a spelling correction test set from NER (Name Entity Recognition) collection.

  • I. Source
    The original data from NER collection is in the directory of CHQA-NER-Corpus_1.0. It includes 3100 files as shown in the following table:

    TypeFile ExtensionNo.
    Configuration*.conf2
    Text*.txt1548
    Annotation*.ann1548

  • II. Formats and Retrieval Data
    The test set is retrieved from the 1548 text files.
    • Only retrieve data from 1128 *.xml.txt.
      Excludes 420 *.txt (they are already annotated in the baseline gold standard).
    • The sources of *.xml.txt includes different sources, such as email, web inquiry, etc.. Data from different sources are stored in different formats. The table below describes the retrieved data from these files of different formats. In general, the format can be identified by the key pattern [XXX:] in the first line of the file. The key pattern [XXX:] is used in each line of a file. "http:" and "https:" are excluded from key pattern.

      Key (first line)No.Retrieved FieldsExample/Notes
      SUBJECT:963
      • SUBJECT:
      • MESSAGE:
      • 1-118268098.xml.txt
      • 1-118259395.xml.txt
      • 12626.xml.txt
      None (Plain Text)144
      • Plain Text (142)
      • MESSAGE: (1)
      • "Subject:" and "Message:" (1)
      • 11901.xml.txt
      • 13247.xml.txt
      • 1-118316905.xml.txt
      • 1-135889572.xml.txt
      • 1-123082816.xml.txt ("MESSAGE:")
      • 1-135050116.xml.txt ("Subject:" and "Message:")
      EMAIL:14
      • MESSAGE: (13)
      • Plain Text (1)
      • 1-118275165.xml.txt
      • 1-120103542.xml.txt
      • 1-122955272.xml.txt
      • 1-123818745.xml.txt
      • 11433.xml.txt (plain text)
      Name:6
      • "Message Body:"
      • 1-130899901.xml.txt
      • 1-131195919.xml.txt
      • 1-131297375.xml.txt
      • 1-131417291.xml.txt
      • 1-131503031.xml.txt
      • 1-132136861.xml.txt
      From:1
      • "Subject:"
      • 1-133488182.xml.txt

  • III. Retrieve Relevant Data
    Relevant data are retrieved and stored in ChrText.out in the following format:
    File NameText (retrieved data)
    • A period is added to the contents of "SUBJECT:" or "Subject:" if no sentence ending punctuation (.!/) is found.
    • A space is used to replace new line for all contents
    • Contents is trimmed (removed space at the begin or the end)

  • IV. Frequency (word count)
    ChrText.out is used to calculate WC and saved in ChrText.wc.coreLc.out:
    • Each text is tokenized by space/tab ("\\s+")
    • Each token is lower cased
    • CoreTerm of each token is used (unnecessary leading or ending punctuation is removed)
    • token is trimmed

  • V. Retrieve Candidates of Spelling Error Words
    Low frequency and OOV (out of vocabulary) word are considered as candidates for spelling error words. They are retrieved to errWordCandidates.out by the following algorithm:
    • CoreTerm is used (input is in the form of coreTerm)
    • Low frequency (WC <= 5)
    • OOV (not in the dictionary, use Lexicon element words and numbers, slightly more coverage than the dictionary of the baseline)
      • handles possessive (e.g. wife's is converted to wife, then check)
      • handles parenthetic plural forms (e.g. drug(s) is converted to drug, then check)
      • handles multiple term connected by slash (e.g. CASE/TEST is converted to two words CASE and TEST and then check individually)
    • Not pure digit (e.g. 123.50)
    • Not pure punctuation
    • Not the combination of digit and punctuation
    • Not measurements (120mg/10Kg, but not 120mg)
    • Not URL
    • Not Email

  • VI. Generate NER Test Set (TestSetTextObj.java)
    • Inputs:
      • ChrText.out
      • errWordCandidates.out
      • lexNumDic.data
      • unit.data
      • maxErrNo (1000)

    • Algorithm:
      • Go through all files and count of OOV_LWC and OOV
      • Sort by OOV_LWC, then OOV, then Text
      • Print if the total OOV_LWC is less than maxErrNo

    • Outputs:
      Generate the test set in three formats:
      • testSet.out (text format, used for tagging)
      • testSet.out.vtt (vtt format, provided visual tagging to ease manual tagging)
        Read in to VTT and then saved as PDF file (testSetTag.pdf)
      • testSet.out.all (for all files)

      File Format:

      Source file nameOOV_LWCOOVText

      Results Stats:

      • File No: 226
      • OOV_LWC No: 1002
      • OOV No: 1073

    • VII. Annotation NER Test Set (Brat)
      • Remove non-English customer's query during Brat annotation:
        • 1-133262975.xml.txt: Spanish
        • 14030.txt: Spanish