The MEDLINE.2015 N-gram Set

This page describes the details of generating n-grams (n = 1-5) from MEDLINE.2015 using split and combine algorithm.

ProgramPreprocessunigramsbigramstrigramsfourgramsfivegrams
N n=1n=2n=3n=4n=5
Appro. Time (Hr.)10.41.44.118.725.5
Option 1
  • GenPmidTiAbSentenceFiles
  • PmidTiAbSentences{YY}n{DDDD}.txt
  • < 1 hr.
  • PmidTiAbS15: 1-779
     
Option 10
  • Gen split n-gram
  • GetNGramFromSentenceFiles
  • MAX_CL = 50

  • 1.Split:
 
  • param: 10,1
  • 20 min.

  • split: 1, no split
  • Documents: 23,343,329
  • Sentences: 134,834,507
  • Tokens: 2,786,085,158
  • n-grams (unique tokens): 22,779,973

  • nGram.out.1.heap.50.s01.0001-0779
  • param: 10,2
  • 55 min.

  • split: 2
  • 2-gram (not unique): 252,869,058

  • ${NGram}.s01.0001-0591
  • ${NGram}.s02.0592-0779
  • param: 10,3
  • 1 hr. 50 min.

  • split: 7
  • 3-gram (not unique): 1,018,482,231

  • s01.0001-0298
  • s02.0299-0403
  • s03.0404-0492
  • s04.0493-0573
  • s05.0574-0651
  • s06.0652-0725
  • s07.0726-0779
  • param: 10,4
  • 5 hr 43 min.

  • split: 12
  • 4-gram (not unique): 1,717,419,118

  • s01.0001-0238
  • s02.0239-0315
  • s03.0316-0376
  • s04.0377-0431
  • s05.0432-0482
  • s06.0483-0532
  • s07.0533-0580
  • s08.0581-0627
  • s09.0628-0673
  • s10.0674-0718
  • s11.0719-0762
  • s12.0763-0779
  • param: 10,5
  • 11 hr.

  • split: 14
  • 5-gram (not unique): 1,991,428,282

  • s01.0001-0226
  • s02.0227-0295
  • s03.0296-0350
  • s04.0351-0399
  • s05.0400-0444
  • s06.0445-0488
  • s07.0489-0531
  • s08.0532-0573
  • s09.0574-0614
  • s10.0615-0655
  • s11.0656-0695
  • s12.0696-0734
  • s13.0735-0772
  • s14.0773-0779
Option 11
  • Group split n-grams by alphabetic characters
  • GroupSpliteNGrams

  • 2.Group:
 
  • param:
    • 11,1,01,NO,NO
  • 1 min.

  • ${NGram}.g01.NO-NO

  • 1-gram: 22,779,973
  • size: 367 Mb
  • param:
    • 11,2,01,NO,a
    • 11,2,02,a,NO
  • 16 min.

  • ${NGram}.g01.NO-a
  • ${NGram}.g02.a-NO

  • 2-gram: 217,447,811
  • size: 4.4 Gb
  • param:
    • 11,3,01,NO,Z
    • 11,3,02,Z,e
    • 11,3,03,e,k
    • 11,3,04,k,s
    • 11,3,05,s,NO
  • 2 hr. 5 min.

  • g01.NO-Z
  • g02.Z-e
  • g03.e-k
  • g04.k-s
  • g05.s-NO

  • 3-gram: 744,721,406
  • size: 19 Gb
  • param:
    • g01.NO-F
    • g02.F-a
    • g03.a-c
    • g04.c-e
    • g05.e-f
    • g06.f-k
    • g07.k-p
    • g08.p-s
    • g09.s-u
    • g10.u-NO
  • 12 hr. 40 min.

  • 4-gram:1,375,850,664
  • Size: 42 Gb
  • param:
    • g01.NO-A
    • g02.A-M
    • g03.M-a
    • g04.a-b
    • g05.b-d
    • g06.d-f
    • g07.f-i
    • g08.i-l
    • g09.l-n
    • g10.n-p
    • g11.p-r
    • g12.r-s
    • g13.s-t
    • g14.t-u
    • g15.u-NO
  • 14hr. 5 min.

  • 5-gram: 1,772,937,004
  • Size: 64 Gb
Option 12
  • Filter by WC and combine alphabetic n-grams
  • FilterWcCombineNGrams
  • WC = 30

  • 3.FilterCombine:
 
  • param: 12, 1, 30
  • 1 min

  • 1-gram.2015.30: 843,206
  • param: 12, 2, 30
  • 3 min.

  • 2-gram.2015.30: 4,845,965
  • param: 12, 3, 30
  • 10 min.

  • 3-gram.2015.30: 6,702,194
  • param: 12, 4, 30
  • 18 min.

  • 4-gram.2015.30: 4,082,612
  • param: 12, 5, 30
  • 25 min.

  • 5-gram.2015.30: 1,674,715
Option 13
  • Sort n-grams by dwt, tdw
  • NGramFilter

  • 3.FilterCombine:
 
  • param: 13, 1, 30
  • 1 min.

  • 1-gram.2015.30.dwt: 843,206
  • param: 13, 2, 30
  • 1 min.

  • 2-gram.2015.30.dwt: 4,845,965
  • param: 13, 3, 30
  • 1 min.

  • 3-gram.2015: 6,702,194
  • param: 13, 4, 30
  • 1 min.

  • 4-gram.2015: 4,082,612
  • param: 13, 5, 30
  • 1 min.

  • 5-gram.2015: 1,674,715