The MEDLINE.2014 N-gram Set

This page describes the details of generating n-grams (n = 1-5) from MEDLINE.2014 using split and combine algorithm.

ProgramPreprocessunigramsbigramstrigramsfourgramsfivegrams
N n=1n=2n=3n=4n=5
Option 1
  • GenPmidTiAbSentenceFiles
  • PmidTiAbSentences{YY}n{DDDD}.txt
  • < 1 hr.
  • PmidTiAbS14: 1-746
     
Option 10
  • Gen split n-gram
  • GetNGramFromSentenceFiles
  • MAX_CL = 50

  • 1.Split
 
  • Documents: 22,356,869
  • Sentences: 126,612,705
  • Tokens:2,610,209,406 (100%)

  • split = 1, no split
  • n-grams (unique tokens): 21,530,469

  • nGram.out.1.heap.50.s1.1

  • split = 4
  • n-grams (not unique): 28,877,339
  • split = 4
  • 2-gram (not unique): 270,862,934

  • nGram.out.2.heap.50.c4.1
  • ...
  • nGram.out.2.heap.50.c4.4
  • split = 8
  • 3-gram (not unique): 952,453,940

  • nGram.out.3.heap.50.c8.1
  • ...
  • nGram.out.3.heap.50.c8.8
  • split = 20
  • 4-gram (not unique): 1,659,414,636

  • nGram.out.4.heap.50.c20.1
  • ...
  • nGram.out.4.heap.50.c20.20
  • split = 20
  • 5-gram (not unique): 1,882,559,441

  • nGram.out.5.heap.50.s20.1
  • ...
  • nGram.out.5.heap.50.s20.20
Option 11
  • Group split n-grams by alphabetic characters
  • GroupSpliteNGrams

  • 2.Group
 
  • 1-gram: 21,530,469
  • size: 348 Mb

  • g1.NO-A
  • g2.A-Z
  • g3.Z-NO
  • 2-gram: 205,868,398
  • size: 4.2 Gb

  • g1.NO-A
  • g2.A-Z
  • g3.Z-NO
  • 3-gram: 703,148,136
  • size: 18 Gb

  • g1.NO-A
  • g2.A-Z
  • g3.Z-g
  • g4.g-s
  • g5.s-NO
  • 4-gram:1,295,096,308
  • Size: 40 Gb

  • g01.No-A
  • g02.A-M
  • g03.M-Z
  • g04.Z-c
  • g05.c-e
  • g06.e-k
  • g07.k-p
  • g08.p-s
  • g09.s-v
  • g10.v-NO
  • 5-gram: 1,665,248,566
  • Size: 61 Gb

  • g01.NO-A
  • g02.A-M
  • g03.M-Z
  • g04.Z-b
  • g05.b-d
  • g06.d-e
  • g07.e-i
  • g08.i-n
  • g09.n-p
  • g10.p-q
  • g11.q-s
  • g12.s-t
  • g13.t-u
  • g14.u-NO
Option 12
  • Filter by WC and combine alphabetic n-grams
  • FilterWcCombineNGrams
  • WC = 30, 50

  • 3.FilterCombine
 
  • 1-gram.2014.30 (804,382)
  • 1-gram.2014.50 (564,244)
  • 2-gram.2014.30 (4,587,349)
  • 2-gram.2014.50 (2,979,558)
  • 3-gram.2014.30 (6,287,536)
  • 3-gram.2014.50 (3,691,583)
  • 4-gram.2014.30 (3,799,377)
  • 4-gram.2014.50 (2,039,445)
  • 5-gram.2014.30 (1,545,175)
  • 5-gram.2014.50 (773,277)
Option 13
  • Sort n-grams by dwt, tdw
  • NGramFilter

  • 4.nGram
 
  • 1-gram.2014 (804,382)
  • 2-gram.2014 (4,587,349)
  • 3-gram.2014 (6,287,536)
  • 4-gram.2014 (3,799,377)
  • 5-gram.2014 (1,545,175)