The MEDLINE N-gram Set Specifications

This page describes the specifications of the MEDLINE n-gram set by LSG:

  • How many grams?
    First, we need to decide the range of grams (N). We assessed all terms (valid words) in the Lexicon under the assumption of Lexicon is a representative subset of general English. The result shows up to 5-grams cover 99.47%, as shown in the following table. Thus, we decided to generate 1~5 grams.

    There are 875,090 words:

    • Single word: 457,335 (52.2615%)
    • Total word: 417,755 (47.7385%)

    NWord CountCumulative Word Count
    1457,335 (52.2615%)457,335(52.2615%)
    2281,857 (32.2089%)739,192(84.4704%)
    393,011 (10.6287%)832,203(95.0991%)
    429,905 (3.4174%)862,108(98.5165%)
    583,58 (0.9551%)870,466(99.4716%)
    62,846 (0.3252%)873,312(99.7968%)
    71,211 (0.1384%)874,523(99.9352%)
    8390 (0.0446%)874,913(99.9798%)
    9104 (0.0119%)875,017(99.9917%)
    1029 (0.0033%)875,046(99.9950%)

  • Contents
    Titles and abstracts from MEDLINE.2014

  • Tokenizer
    • Tokenize titles and abstracts into sentences
      • All titles are considered as a separated sentence (by adding a period and space afterward)
      • 126,612,705 sentences are tokenized
      • 14,314 unrecognized pattern warning are found from sentence tokenizer
    • Tokenize all sentences into words (use space and tab as word boundary)
      • use space and tab as word boundary
      • 2,610,209,406 words are tokenized

  • Other Information
    Word count and document count are calculated with n-gram

  • Filters
    • length of terms (<= 50)
      Use Lexicon.2014 as example,
      • shortest word: 1
        • a
        • ...
      • longest word: 103
        • matrix-assisted laser desorption/ionization Fourier-transform ion cyclotron resonance mass spectrometry
        • ...
      • Words with length of 50 characters cover 99.5508% (871,159/875,090)
    • word count (>= 30)
    • document count (>= 1)