The MEDLINE.2017 N-gram Set

This page describes the details of generating n-grams (n = 1-5) from MEDLINE.2017 using split, combine, and filter algorithm.

ProgramNAppro. Time (Hr.) Option 1
  • GenPmidTiAbSentenceFiles
  • PmidTiAbSentences{YY}n{DDDD}.txt
Option 10
  • Gen split n-gram
  • GetNGramFromSentenceFiles
  • MAX_CL = 50

  • 1.Split:
Option 11
  • Group split n-grams by alphabetic characters
  • GroupSpliteNGrams

  • 2.Group:
Option 12
  • Filter by WC and combine alphabetic n-grams
  • FilterWcCombineNGrams
  • WC = 30

  • 3.FilterCombine:
Option 13
  • Sort n-grams by dwt, tdw
  • NGramFilter

  • 3.FilterCombine:
Preprocess 1
  • ~ 1 hr.
  • PmidTiAbS17: 1-892
    
unigramsn=10.5 hr. 
  • param: 10,1, (150000000)
  • 20 min.

  • Documents: 26,759,399
  • Sentences: 163,021,640
  • Tokens: 3,386,661,350

  • split: 1, no split
  • 1-grams (not unique): 27,261,960
    (it is unique beacuse no split)

  • Files:
    • nGram.out.1.heap.50.s01.0001-0892 (440 Mb)
  • param:
    • 11,1,01,NO,NO
  • 1 min.

  • Group Alphabetically
  • 1-gram (unique): 27,261,960

  • Files:
    • ${NGram}.g01.NO-NO (440 Mb)
  • param: 12, 1, 30
  • 1 min

  • 1-gram (WC >= 30): 976,872

  • File:
    • 1-gram.${YEAR}.30 (16 Mb)
  • param: 13, 1, 30
  • 1 min.

  • 1-gram (sorted): 976,872

  • File:
    • 1-gram.${YEAR}.30.dwt (16 Mb)
bigramsn=21.5 hr. 
  • param: 10,2, (150000000)
  • 1.1 hr.

  • split: 2
  • 2-gram (not unique): 300,462,134

  • Files:
    • s01.0001-0580 (3.1 Gb)
    • s02.0581-0892 (3.0 Gb)
  • param: see file names below
    • 11,2,01,NO,a
    • 11,2,02,a,NO
  • 20 min.

  • Group Alphabetically
  • 2-gram (unique, use wc -l): 258,150,841

  • Files:
    • ${NGram}.g01.NO-a (1.9 Gb)
    • ${NGram}.g02.a-NO (3.4 Gb)
  • param: 12, 2, 30
  • 2 min.

  • 2-gram (WC >= 30): 5,722,210

  • File:
    • 2-gram.${YEAR}.30 (122 Mb)
  • param: 13, 2, 30
  • 1 min.

  • 2-gram (sorted): 5,722,210

  • File:
    • 2-gram.${YEAR}.30.dwt (122 Mb)
trigramsn=37 hr. 
  • param: 10,3, (150000000)
  • 2.4 hr.

  • split: 9
  • 3-gram (not unique): 1,260,815,630

  • Files:
    • s01.0001-0207 (3.8 Gb)
    • s02.0208-0322 (3.7 Gb)
    • s03.0323-0420 (3.8 Gb)
    • s04.0421-0557 (3.8 Gb)
    • s05.0558-0644 (3.8 Gb)
    • s06.0645-0727 (3.8 Gb)
    • s07.0728-0800 (3.8 Gb)
    • s08.0801-0870 (3.8 Gb)
    • s09.0871-0892 (1.4 Gb)
  • param: see file names below
  • 4.5 hr.

  • Group Alphabetically
  • 3-gram (unique): 887,664,290

  • Files:
    • g01.NO-U (4.9 Gb)
    • g02.U-d (4.3 Gb)
    • g03.d-k (4.2 Gb)
    • g04.k-re (4.3 Gb)
    • g05.re-NO (4.8 Gb)
  • param: 12, 3, 30
  • 10 min.

  • 3-gram (WC >= 30): 8,096,532

  • File:
    • 3-gram.${YEAR}.30 (207 Mb)
  • param: 13, 3, 30
  • 1 min.

  • 3-gram (sorted): 8,096,532

  • File:
    • 3-gram.${YEAR}.30.dwt (207 Mb)
fourgramsn=410.5 hr. 
  • param: 10,4, (130000000)
  • 3 hr.

  • split: 17
  • 4-gram (not unique): 2,127,650,711

  • Files:
    • s01.0001-0076 (3.9 Gb)
    • s02.0077-0199 (3.9 Gb)
    • s03.0200-0270 (3.9 Gb)
    • s04.0271-0317 (3.9 Gb)
    • s05.0318-0370 (4.0 Gb)
    • s06.0371-0417 (4.0 Gb)
    • s07.0418-0515 (4.0 Gb)
    • s08.0516-0557 (3.9 Gb)
    • s09.0558-0604 (4.0 Gb)
    • s10.0605-0648 (4.0 Gb)
    • s11.0649-0695 (4.0 Gb)
    • s12.0696-0734 (4.0 Gb)
    • s13.0735-0772 (4.0 Gb)
    • s14.0773-0810 (4.0 Gb)
    • s15.0811-0847 (4.0 Gb)
    • s16.0848-0885 (4.0 Gb)
    • s17.0886-0892 (816 Mb)
  • param: see file names below
  • 7.2 hr.

  • Group Alphabetically
  • 4-gram (unique): 1,650,912,612

  • Files:
    • g01.NO-F (4.8 Gb)
    • g02.F-ab (4.9 Gb)
    • g03.ab-b (4.6 Gb)
    • g04.b-d (4.5 Gb)
    • g05.d-fq (4.7 Gb)
    • g06.fq-is (4.7 Gb)
    • g07.is-o (4.2 Gb)
    • g08.o-pm (4.4 Gb)
    • g09.pm-si (4.4 Gb)
    • g10.si-th (2.5 Gb)
    • g11.th-u (4.3 Gb)
    • g12.u-NO (2.9 Gb)
  • param: 12, 4, 30
  • 20 min.

  • 4-gram (WC >= 30): 5,044,153

  • File:
    • 4-gram.${YEAR}.30 (152 Mb)
  • param: 13, 4, 30
  • 1 min.

  • 4-gram (sorted): 5,044,153

  • File:
    • 4-gram.${YEAR}.30.dwt (152 Mb)
fivegramsn=517 hr. 
  • param: 10,5, (120000000)
  • 3.6 hr.

  • split: 18
  • 5-gram (not unique): 2,448,680,409

    Files:

    • s01.0001-0064 (4.3 Gb)
    • s02.0065-0112 (4.4 Gb)
    • s03.0113-0232 (4.3 Gb)
    • s04.0233-0279 (4.4 Gb)
    • s05.0280-0316 (4.4 Gb)
    • s06.0317-0360 (4.4 Gb)
    • s07.0361-0398 (4.4 Gb)
    • s08.0399-0481 (4.3 Gb)
    • s09.0482-0523 (4.4 Gb)
    • s10.0524-0557 (4.4 Gb)
    • s11.0558-0596 (4.4 Gb)
    • s12.0597-0632 (4.4 Gb)
    • s13.0633-0668 (4.4 Gb)
    • s14.0669-0704 (4.4 Gb)
    • s15.0705-0735 (4.4 Gb)
    • s16.0736-0766 (4.4 Gb)
    • s17.0767-0797 (4.4 Gb)
    • s18.0798-0827 (4.4 Gb)
    • s19.0828-0857 (4.4 Gb)
    • s20.0858-0888 (4.4 Gb)
    • s21.0889-0892 (619 Mb)
  • param: see file names below
  • 13 hr.

  • Group Alphabetically
  • 5-gram (unique): 2,138,854,513

  • Files:
    • g01.NO-A (3.8 Gb)
    • g02.A-M (4.5 Gb)
    • g03.M-a (4.6 Gb)
    • g04.a-and (3.6 Gb)
    • g05.and-b (4.7 Gb)
    • g06.b-cf (3.5 Gb)
    • g07.cf-d (3.3 Gb)
    • g08.d-em (3.5 Gb)
    • g09.em-g (4.3 Gb)
    • g10.g-inc (4.6 Gb)
    • g11.inc-m (4.2 Gb)
    • g12.m-o (4.1 Gb)
    • g13.o-on (3.7 Gb)
    • g14.on-pp (3.9 Gb)
    • g15.pp-s (4.6 Gb)
    • g16.s-t (4.5 Gb)
    • g17.t-thf (4.8 Gb)
    • g18.thf-v (3.8 Gb)
    • g19.v-NO (3.9 Gb)
  • param: 12, 5, 30
  • 25 min.

  • 5-gram (WC >= 30): 1,812,223

  • File:
    • 5-gram.2016.30 (64 Mb)
  • param: 13, 5, 30
  • 1 min.

  • 5-gram (sorted): 1,812,223

  • File:
    • 5-gram.2016.30.dwt (64 Mb)