The MEDLINE.2018 N-gram Set

This page describes the details of generating n-grams (n = 1-5) from MEDLINE.2018 using split, combine, and filter algorithm.

ProgramNAppro. Time (Hr.) Option 1.1
  • GenPmidTiAbSentenceFromXmls
  • pubmed{YY}n{DDDD}.xml
  • PmidTiAbSentences{YY}n{DDDD}.txt
Option 10
  • Gen split n-gram
  • GetNGramFromSentenceFiles
  • MAX_CL = 50

  • 1.Split:
Option 11
  • Group split n-grams by alphabetic characters
  • GroupSpliteNGrams

  • 2.Group:
Option 12
  • Filter by WC and combine alphabetic n-grams
  • FilterWcCombineNGrams
  • WC = 30

  • 3.FilterCombine:
Option 13
  • Sort n-grams by dwt, tdw
  • NGramFilter

  • 3.FilterCombine:
Preprocess 1
  • ~ 2 hr.
  • PmidTiAbS18: 1-928
    
unigramsn=11.0 hr. 
  • param: 10,1, (150000000)
  • 49 min.

  • Documents: 27,837,540
  • Sentences: 174,395,209
  • Tokens: 3,585,789,820

  • split: 1, no split
  • 1-grams (not unique): 30,229,399
    (it is unique beacuse no split, use wc -l)

  • Files:
    • nGram.out.1.heap.50.s01.0001-0928 (491 Mb, use ls -alh)
  • param:
    • 11,1,01,NO,NO
  • 1 min.

  • Group Alphabetically
  • 1-gram (unique): 30,229,399

  • Files:
    • ${NGram}.g01.NO-NO (491 Mb)
  • param: 12, 1, 30
  • 1 min

  • 1-gram (WC >= 30): 1,022,412

  • File:
    • 1-gram.${YEAR}.30 (17 Mb)
  • param: 13, 1, 30
  • 1 min.

  • 1-gram (sorted): 1,022,412

  • File:
    • 1-gram.${YEAR}.30.dwt (17 Mb)
bigramsn=22.2 hr. 
  • param: 10,2, (150000000)
  • 1.5 hr.

  • split: 2
  • 2-gram (not unique): 335,654,488

  • Files:
    • s01.0001-0584 (3.1 Gb)
    • s02.0585-0889 (3.0 Gb)
    • s03.0890-0928 (706 Mb)
  • param: see file names below
    • 11,2,01,NO,a
    • 11,2,02,a,NO
  • 29 min.

  • Group Alphabetically
  • 2-gram (unique, use wc -l): 273,853,147

  • Files:
    • ${NGram}.g01.NO-a (2.0 Gb)
    • ${NGram}.g02.a-NO (3.6 Gb)
  • param: 12, 2, 30
  • 5 min.

  • 2-gram (WC >= 30): 6,000,100

  • File:
    • 2-gram.${YEAR}.30 (128 Mb)
  • param: 13, 2, 30
  • 1 min.

  • 2-gram (sorted): 6,000,100

  • File:
    • 2-gram.${YEAR}.30.dwt (128 Mb)
trigramsn=311.0 hr. 
  • param: 10,3, (150000000)
  • 3.1 hr.

  • split: 9
  • 3-gram (not unique): 1,334,003,772

  • Files:
    • s01.0001-0209 (3.8 Gb)
    • s02.0210-0324 (3.7 Gb)
    • s03.0325-0422 (3.8 Gb)
    • s04.0423-0559 (3.7 Gb)
    • s05.0560-0647 (3.8 Gb)
    • s06.0648-0729 (3.8 Gb)
    • s07.0730-0801 (3.8 Gb)
    • s08.0802-0871 (3.8 Gb)
    • s09.0872-0928 (3.2 Gb)
  • param: see file names below
  • 7.8 hr.

  • Group Alphabetically
  • 3-gram (unique): 935,932,433

  • Files:
    • g01.NO-U (5.2 Gb)
    • g02.U-d (4.5 Gb)
    • g03.d-k (4.4 Gb)
    • g04.k-re (4.5 Gb)
    • g05.re-NO (5.1 Gb)
  • param: 12, 3, 30
  • 18 min.

  • 3-gram (WC >= 30): 8,534,524

  • File:
    • 3-gram.${YEAR}.30 (219 Mb)
  • param: 13, 3, 30
  • 2 min.

  • 3-gram (sorted): 8,534,524

  • File:
    • 3-gram.${YEAR}.30.dwt (219 Mb)
fourgramsn=415.5 hr. 
  • param: 10,4, (130000000)
  • 4 hr.

  • split: 17
  • 4-gram (not unique): 2,251,050,930

  • Files:
    • s01.0001-0077 (4.0 Gb)
    • s02.0078-0204 (3.9 Gb)
    • s03.0205-0272 (3.9 Gb)
    • s04.0273-0319 (3.9 Gb)
    • s05.0320-0372 (4.0 Gb)
    • s06.0373-0419 (4.0 Gb)
    • s07.0420-0517 (4.0 Gb)
    • s08.0518-0559 (3.9 Gb)
    • s09.0560-0606 (4.0 Gb)
    • s10.0607-0650 (4.0 Gb)
    • s11.0651-0697 (4.0 Gb)
    • s12.0698-0736 (4.0 Gb)
    • s13.0737-0774 (4.0 Gb)
    • s14.0775-0812 (4.0 Gb)
    • s15.0813-0849 (4.0 Gb)
    • s16.0850-0887 (4.1 Gb)
    • s17.0888-0924 (4.0 Gb)
    • s18.0925-0928 (494 Mb)
  • param: see file names below
  • 11.1 hr.

  • Group Alphabetically
  • 4-gram (unique): 1,740,156,534

  • Files:
    • g01.NO-F (5.1 Gb)
    • g02.F-ab (5.2 Gb)
    • g03.ab-b (4.9 Gb)
    • g04.b-d (4.7 Gb)
    • g05.d-fq (4.9 Gb)
    • g06.fq-is (5.0 Gb)
    • g07.is-o (4.4 Gb)
    • g08.o-pm (4.6 Gb)
    • g09.pm-si (4.6 Gb)
    • g10.si-th (2.6 Gb)
    • g11.th-u (4.5 Gb)
    • g12.u-NO (3.1 Gb)
  • param: 12, 4, 30
  • 30 min.

  • 4-gram (WC >= 30): 5,348,132

  • File:
    • 4-gram.${YEAR}.30 (162 Mb)
  • param: 13, 4, 30
  • 2 min.

  • 4-gram (sorted): 5,348,132

  • File:
    • 4-gram.${YEAR}.30.dwt (162 Mb)
fivegramsn=522.5 hr. 
  • param: 10,5, (120000000)
  • 4.3 hr.

  • split: 18
  • 5-gram (not unique): 2,588,773,641

    Files:

    • s01.0001-0064 (4.3 Gb)
    • s02.0065-0112 (4.3 Gb)
    • s03.0113-0233 (4.3 Gb)
    • s04.0234-0279 (4.3 Gb)
    • s05.0280-0316 (4.4 Gb)
    • s06.0317-0360 (4.4 Gb)
    • s07.0361-0398 (4.4 Gb)
    • s08.0399-0482 (4.4 Gb)
    • s09.0483-0524 (4.4 Gb)
    • s10.0525-0558 (4.4 Gb)
    • s11.0559-0597 (4.4 Gb)
    • s12.0598-0633 (4.4 Gb)
    • s13.0634-0671 (4.4 Gb)
    • s14.0672-0705 (4.4 Gb)
    • s15.0706-0736 (4.4 Gb)
    • s16.0737-0767 (4.4 Gb)
    • s17.0768-0798 (4.4 Gb)
    • s18.0799-0828 (4.4 Gb)
    • s19.0829-0858 (4.4 Gb)
    • s20.0859-0889 (4.5 Gb)
    • s21.0890-0919 (4.5 Gb)
    • s22.0920-0928 (1.3 Gb)
  • param: see file names below
  • 17.5 hr.

  • Group Alphabetically
  • 5-gram (unique): 2,255,702,245

  • Files:
    • g01.NO-A (4.0 Gb)
    • g02.A-M (4.7 Gb)
    • g03.M-a (4.8 Gb)
    • g04.a-and (3.8 Gb)
    • g05.and-b (5.0 Gb)
    • g06.b-cf (3.7 Gb)
    • g07.cf-d (3.5 Gb)
    • g08.d-em (3.6 Gb)
    • g09.em-g (4.5 Gb)
    • g10.g-inc (4.9 Gb)
    • g11.inc-m (4.5 Gb)
    • g12.m-o (4.3 Gb)
    • g13.o-on (3.9 Gb)
    • g14.on-pp (4.1 Gb)
    • g15.pp-s (4.9 Gb)
    • g16.s-t (4.8 Gb)
    • g17.t-thf (5.0 Gb)
    • g18.thf-v (4.0 Gb)
    • g19.v-NO (4.2 Gb)
  • param: 12, 5, 30
  • 42 min.

  • 5-gram (WC >= 30): 2,265,965

  • File:
    • 5-gram.${YEAR}.30 (80 Mb)
  • param: 13, 5, 30
  • 1 min.

  • 5-gram (sorted): 2,265,965

  • File:
    • 5-gram.${YEAR}.30.dwt (80 Mb)