N-gram Set by Split, Group, Filter, Combine Algorithm

This page describes the details of generating n-grams (n = 1-5) from MEDLINE using split and combine algorithm. Due to the n-grams are too big for the Java HashMap limitation, the n-grams retrieving processes can be split (by

I. Basic N-gram Set
n-gram (uniGram, biGram, triGram, fourthGram, fifthGram) from MEDLINE are retrieved from Medline as follows:

II. Split (MEDLINE):

Split the total input MEDLINE files into N portions.

  • For 2014 release, there are 746 files (PmidTiAbS14n0001.txt - PmidTiAbS14n0746.txt)
  • This program can automatically split into N portions
  • The output file is: n-gram.out.N.heap.MAX_CL.sS.C
    • N: n-gram
    • MAX_CL: max. characters length (all n-grams is longer than MAX_CL is filtered out)
    • S: no. of split
    • C: current no. of split

    For Example: nGram.out.5.heap.50.s20.15
    • 5-gram (N = 5) with max. characters of 50 (MAX_CL), split into 20 (S), current portion of 15 (C, ~ PmidTiAbS14n0519.txt - PmidTiAbS14n0555.txt)
    • S = 20, 746/20 = 37
    • C = 15, 37 x (15-1) + 1 = 519; 37 x 15 = 555;

III. Group (by alphabetic order):

Group all split n-gram files with specified range of characters. All n-grams are independent if group (sorted and combined alphabetically) together. The alphabets are in the following order:

NO, ... 0-9, ... >, ?, @A, B, C, ..., X, Y, Z[, \, ], ^, _, `a, b, c, ..., x, y, z{, |, }, ... NO
  • The program allows users to specify the range of starting and ending characters
  • The output file is: nGram.out.N.heap.MAX_CL.sN.gS.SC-EC
    • MAX_CL: max. characters length (all n-grams is longer than MAX_CL is filtered out)
    • N: n-gram
    • S: serial number
    • SC: starting character (included)
    • EC: ending character (not included)

    For Example: nGram.out.5.heap.50.s20.g05.b-d
    • 5-gram with max. characters of 50, with 20 split, group no. 5 by grouping all n-grams starting with b and c (ends with d, not included).

IV. Filter (by WC) and Combine:

Combine and filter out n-grams by WC (which take most portion in higher grams)

V. Exmaple: