N-gram Set by Split, Group, Filter, Combine Algorithm

This page describes the details of generating n-grams (n = 1-5) from MEDLINE using split and combine algorithm. Due to the n-grams are too big for the Java HashMap limitation, the n-grams retrieving processes can be split (by

I. Basic N-gram Set
n-gram (uniGram, biGram, triGram, fourthGram, fifthGram) from MEDLINE are retrieved from Medline as follows:

II. Split (MEDLINE):

Split the total input MEDLINE files into N portions.

  • For 2014 release, there are 746 files (PmidTiAbS14n0001.txt - PmidTiAbS14n0746.txt)
  • This program can automatically split into N portions
  • The output file is: n-gram.out.N.heap.MAX_CL.sS.C
    • N: n-gram
    • MAX_CL: max. characters length (all n-grams is longer than MAX_CL is filtered out)
    • S: no. of split
    • C: current no. of split

    For Example: nGram.out.5.heap.50.s20.15
    • 5-gram (N = 5) with max. characters of 50 (MAX_CL), split into 20 (S), current portion of 15 (C, ~ PmidTiAbS14n0519.txt - PmidTiAbS14n0555.txt)
    • S = 20, 746/20 = 37
    • C = 15, 37 x (15-1) + 1 = 519; 37 x 15 = 555;

III. Group (by alphabetic order):

Group all split n-gram files with specified range of characters. All n-grams are independent if group (sorted and combined alphabetically) together. The alphabets are in the following order:

NO, ... 0-9, ... >, ?, @A, B, C, ..., X, Y, Z[, \, ], ^, _, `a, b, c, ..., x, y, z{, |, }, ... NO
  • The program allows users to specify the range of starting and ending characters
  • The output file is: nGram.out.N.heap.MAX_CL.sN.gS.SC-EC
    • MAX_CL: max. characters length (all n-grams is longer than MAX_CL is filtered out)
    • N: n-gram
    • S: serial number
    • SC: starting character (included)
    • EC: ending character (not included)

    For Example: nGram.out.5.heap.50.s20.g05.b-d
    • 5-gram with max. characters of 50, with 20 split, group no. 5 by grouping all n-grams starting with b and c (ends with d, not included).

IV. Filter (by WC) and Combine:

Combine and filter out n-grams by WC (which take most portion in higher grams)

V. Example Walk-through (MEDLINE.2014):

ProgramPreprocessunigramsbigramstrigramsfourgramsfivegrams
N n=1n=2n=3n=4n=5
Option 1
  • GenPmidTiAbSentenceFiles
  • PmidTiAbSentences{YY}n{DDDD}.txt
  • ~45 min.
  • PmidTiAbS14: 1-746
     
Option 3
  • Gen split n-gram
  • GetNGramFromSentenceFiles
  • MAX_CL = 50

  • 1.Split
 
  • Documents: 22,356,869
  • Sentences: 126,612,705
  • Tokens:2,610,209,406 (100%)

  • split = 1, no split
  • n-grams (unique tokens): 21,530,469

  • nGram.out.1.heap.50.s1.1

  • split = 4
  • n-grams (not unique): 28,877,339
  • split = 4
  • 2-gram (not unique): 270,862,934

  • nGram.out.2.heap.50.c4.1
  • ...
  • nGram.out.2.heap.50.c4.4
  • split = 8
  • 3-gram (not unique): 952,453,940

  • nGram.out.3.heap.50.c8.1
  • ...
  • nGram.out.3.heap.50.c8.8
  • split = 20
  • 4-gram (not unique): 1,659,414,636

  • nGram.out.4.heap.50.c20.1
  • ...
  • nGram.out.4.heap.50.c20.20
  • split = 20
  • 5-gram (not unique): 1,882,559,441

  • nGram.out.5.heap.50.s20.1
  • ...
  • nGram.out.5.heap.50.s20.20
Option 4
  • Group split n-grams by alphabetic characters
  • GroupSpliteNGrams

  • 2.Group
 
  • 1-gram: 21,530,469
  • size: 348 Mb

  • g1.NO-A
  • g2.A-Z
  • g3.Z-NO
  • 2-gram: 205,868,398
  • size: 4.2 Gb

  • g1.NO-A
  • g2.A-Z
  • g3.Z-NO
  • 3-gram: 703,148,136
  • size: 18 Gb

  • g1.NO-A
  • g2.A-Z
  • g3.Z-g
  • g4.g-s
  • g5.s-NO
  • 4-gram:1,295,096,308
  • Size: 40 Gb

  • g01.No-A
  • g02.A-M
  • g03.M-Z
  • g04.Z-c
  • g05.c-e
  • g06.e-k
  • g07.k-p
  • g08.p-s
  • g09.s-v
  • g10.v-NO
  • 5-gram: 1,665,248,566
  • Size: 61 Gb

  • g01.NO-A
  • g02.A-M
  • g03.M-Z
  • g04.Z-b
  • g05.b-d
  • g06.d-e
  • g07.e-i
  • g08.i-n
  • g09.n-p
  • g10.p-q
  • g11.q-s
  • g12.s-t
  • g13.t-u
  • g14.u-NO
Option 5
  • Filter by WC and combine alphabetic n-grams
  • FilterWcCombineNGrams
  • WC = 30, 50

  • 3.FilterCombine
 
  • 1-gram.2014.30 (804,382)
  • 1-gram.2014.50 (564,244)
  • 2-gram.2014.30 (4,587,349)
  • 2-gram.2014.50 (2,979,558)
  • 3-gram.2014.30 (6,287,536)
  • 3-gram.2014.50 (3,691,583)
  • 4-gram.2014.30 (3,799,377)
  • 4-gram.2014.50 (2,039,445)
  • 5-gram.2014.30 (1,545,175)
  • 5-gram.2014.50 (773,277)
Option 6, 7
  • Sort n-grams by dwt, tdw
  • NGramFilter

  • 4.nGram
 
  • 1-gram.2014 (804,382)
  • 2-gram.2014 (4,587,349)
  • 3-gram.2014 (6,287,536)
  • 4-gram.2014 (3,799,377)
  • 5-gram.2014 (1,545,175)