Distilled MEDLINE N-Gram Set

I. Introduction

The MEDLINE n-gram set includes many invalid LMWs that are not needed for most NLP research. LSG developed a set of exclusive filters that filter out these invalid LMWs. The filtering process filtered out about 2/3 of n-grams from MEDLINE n-gram set release. This enhanced/filtered N-Gram set is called the distilled MEDLINE n-gram set.

II. Precision and Recall

This distilled MEDLINE n-gram set has higher precision and same (similar) recall rate in terms of valid multiwords. LSG performs the accuracy test on all developed exclusive filters by applying these filters on Lexicon (valid LMW). The minimum passing rate is 99.99%. In other words, these filters only filter out invalid LMWs without removing valid LMWs. A simple calculation is described as below:

  • The n-gram set include valid N LMW (TP0) and M invalid LMW (FP0)
  • A serial filters filter out X valid LMW (TP1) and Y invalid LMW (FP1)
  • The distilled N-Gram set have (N-X) valid LMW (TP2) and (M-Y) invalid LMW (FP1)
  • If the accuracy test is very high (99.99%), then
    • X is a very small number (almost 0)
    • Y is a large number (almost 2/3 of N+M)

  • The precision
    = (retrieved and relevant)/(total retrieved)
    = TP2/(TP2+FP2)
    = (N-X)/(N-X + M-Y)
    = N/(N+M-Y) (> N/N+M)

  • The recall
    = (retrieved and relevant)/(total relevant)
    = TP2/(TP2 + FN0)
    = (N-X)/(N-X + FN0)
    = N/(N + FN0) (= TP0/ (TP0 + FN0))

III. Conclusion

The distilled MEDLINE n-gram Set vs. MEDLINE n-gram Set

  • All exclusive filters have accuracy rate above 99.99% (tested on Lexicon)
  • smaller data set (about 1/3)
  • better precision
  • similar recall
  • cab be used as baseline for further analysis

IV. Release Process

  • Dir: ${MULTIWORD_DIR}
  • Script: manually add n-gram number of ${YEAR} to ${MULTIWORD_DIR}/bin/05.ApplyFilters
    shell>cd ${MULTIWORDS}/data/${YEAR}/outData/02.NGram/nGrams
    shell>wc -l nGramSet.${YEAR}.30

    YearnGram Number
    201417,023,819
    201518,148,692
    201619,325,338
    201721,963,037
    201823,171,133

    This is a must to get the correct pass-rate (percentage)
  • Input Data:
    Need to setup all the following files before runnning the program (05.ApplyFilters)
    For the Lead-End-Term: they should be run the 03.LeadEndTerm ${YEAR} to update th data. However, use the previous data is OK.
    • n-gram.${YEAR} (Step 1):
      shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/05.ApplyFilters
      shell>ln -sf ../02.NGram/nGrams/nGramSet.${YEAR}.30 nGram.${YEAR}
    • NRVAR (Step 13):
      shell>cd ${MULTIWORD_DIR}/data/${YEAR}/inData
      shell>ln -sf nfsvol/lex/Lu/Backup/Releases/UMLS/${YEAR}_AA_release/LEX/NUMBERS/NRVAR NRVAR
    • stopWords.data (Step 14):
      shell>cd ${MULTIWORD_DIR}/data/${YEAR}/inData
      shell>cp -p ../../${PREV_YEAR}/inData/stopWords.data.${PREV_YEAR} stopWords.data.${YEAR}
      shell>ln -sf ./stopWords.data.${YEAR} stopWords.data
    • unit.data (Step 24):
      shell>cp -p ../../${PREV_YEAR}/inData/unit.data.${PREV_YEAR} unit.data.${YEAR}
      shell>ln -sf ./unit.data.${YEAR} unit.data
    • invalidLeadTerms.data.abs (Step 30):
      shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
      shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/invalidLeadTerms.data.${PREV_YEAR} invalidLeadTerms.data.${YEAR}
      shell>ln -sf ./invalidLeadTerms.data.${YEAR} invalidLeadTerms.data.abs
    • invalidEndTerms.data.abs (Step 31):
      shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
      shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/invalidEndTerms.data.${PREV_YEAR} invalidEndTerms.data.${YEAR}
      shell>ln -sf ./invalidEndTerms.data.${YEAR} invalidEndTerms.data.abs
    • invalidLeadEndTermCandidates.data (Step 32):
      shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
      shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/invalidLeadEndTermCandidates.data .
      This file could be the same if you run the 03.LeadEndTerm
    • validLeadTerms.data.pat (Step 33):
      shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
      shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/validLeadTerms.data.pat.${PREV_YEAR} validLeadTerms.data.pat.${YEAR}
      shell>ln -sf ./validLeadTerms.data.pat.${YEAR} validLeadTerms.data.pat
    • validEndTerms.data.pat (Step 34):
      shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
      shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/validEndTerms.data.pat.${PREV_YEAR} validEndTerms.data.pat.${YEAR}
      shell>ln -sf ./validEndTerms.data.pat.${YEAR} validEndTerms.data.pat
  • Run Program:
    • shell>cd ${MULTIWORDS}/bin/05.ApplyFilters ${YEAR}
      1
      10-14
      20-25
      30-34
      40

      or

    • shell>cd 05.ApplyFiltersAll
    • shell>runApplyFilersAll ${YEAR}
  • Output Data:
    • Dir: /${MULTIWORD}/data/${YEAR}/outData/05.ApplyFilters
    • ApplyFilters.rpt (use this file to update log file)
    • nGram.${YEAR}.${STEP}.${NAME}
    • nGram.${YEAR}.${STEP}.${NAME}.exp
    • nGram.${YEAR}.${STEP}.${NAME}.trap

    • Use nGram.${YEAR}.34.invEndTermPat (the last fitlered one) for the distilled n-gram set
    • Distributed it
    • Backup it

V. Release Logs

VI. Run the Test data on the Lexicon

  • Must run 03.LeadEndTerm ${YEAR} for lexWords.data
  • run 04.TestFilters ${YEAR} to update test result on each filter