N-gram Set by Prediction Filter

A new approach of prediction filter is developed to resolve issues of limited memory. This approache retrieve an approximate n-Gram as an alternative approach. However, this approach is not comprehensive and should be replace by a more thorough approach.

I. Prediction Filter:

Use the frequency (NWC) of normalized (n)-gram terms as filter to generate (n+1)-gram terms:

  • Only high frequency n-gram terms are used to generate multiwords for our applications (to add in to Lexicon). Low frequency n-gram terms are usually error prone multiwords (e.g. typos).
  • Among these n-gram terms, most of them has very low frequency (as the long tail of dinosaur shown on the single words frequency spectrum)
  • The frequency (WC) of normalized n-gram terms is used to reflect the real usage.
  • If a n-gram term has low NWC, the associated (n+1)-gram terms has lower (or equals) NWC. For example, unigram term "zytochemical" has NWC of 2. The associated bigram of "zytochemical" is "zytochemical findings" is also 2.
  • Let says the n-gram term is "n-gram term", the associated (n+1)-gram terms must be:
    • "prefix n-gram term"
    • "n-gram term suffix"
  • Use the frequency (NWC) of normalized (n)-gram terms as filter to generate (n+1)-gram terms to generate high frequency (n+1) gram terms that fit into to our computing resources (60Gb, which stores about 230 million of n-gram terms)a In other words, this approach retrieve the top NWC (frequency) 230 million of n-gram terms (for n = 1 ~ 5, respectively).

II. N-gram Set with Prediction Filter:

  • Get the nGram (n=1, uniGram) from section I.
  • Run the normalization on nGram and retrieve uniGram with normalized terms|word count
  • Manually review results and select a threshold (wc >= 50)
  • Retrieve a table with all original terms with normalized terms above the selected threshold
  • Generate biGram (n+1) as follows:
    • Filter out the biGram ((n+1)-gram) term if either the starting or ending words in biGram terms are not in the unigram (n-gram) threshold table (below the threshold, because the frequency is too low). These low frequency term will be filtered out after they are generated (even we don't filter them out at this stage anyway).

  • Repeats the above processes for n = 2 ~ 5

III. Example Walk-through (MEDLINE.2014):

 Preprocessunigramsbigramstrigramsfourgramsfivegrams
N n=1n=2n=3n=4n=5
Step 1
PmidTiAbSentences{YY}n{DDDD}.txt
  • 45 min.
  • PmidTiAbS14: 1-746
     
Step 2
Gen uniGram, sorted
 
  • 25 min.
  • Documents: 22,361,238
  • Sentences: 126,681,740
  • Tokens:2,611,996,832 (100%)
  • n-grams (unique tokens): 21,710,680
    
Step 3
Norm (n-1)-gram
  
  • (n-1)=1, 4 min.
  • (n-1)=2, 55 min.
  • (n-1)=3, 3.5 hr.
  • (n-1)=4, 3 hr.
Step 4
Gen (n-1)-gram for threshold on NWC
  
  • (n-1)=1
  • th=50, 1 min.
    • Norm n-grams: 353,831
    • n-grams: 5,030,318
  • (n-1)=2
  • th=7500, 4 min.
    • Norm n-grams: 32,400
    • n-grams: 748,220
  • (n-1)=3
  • th=1000, 5 min.
    • Norm n-grams: 139,662
    • n-grams: 1,175,570
  • (n-1)=4
  • th=100, 5 min.
    • Norm n-grams: 778,320
    • n-grams: 2,929,856
Step 5
Gen n-Gram
 
  • th=0 (no threshold used), 25 min.
  • Documents:22,361,238
  • Sentences: 126,681,740
  • Tokens:2,611,996,832
  • Filtered unigrams: 2,611,996,832 (100%)
  • Unique unigrams: 21,710,680
  • th=50, 5 hr.
  • Documents:22,361,238
  • Sentences: 126,681,740
  • Tokens:2,485,317,128
  • Filtered bigrams: 2,482,775,482 (99.89%)
  • Unique bigrams: 204,020,203
  • th=7,500, 22 hr.
  • Documents:22,361,238
  • Sentences: 126,681,740
  • Tokens: 2,359,078,026
  • Filtered trigrams: 1,542,069,455 (65.37%)
  • Unique trigrams: 225,891,864
  • th=1,000, 9 hr.
  • Documents:22,361,238
  • Sentences: 126,681,740
  • Tokens: 2,233,430,789
  • Filtered fourgrams: 811,458,826 (36.33%)
  • Unique fourgrams: 204,623,839
  • th=100, 6 hr.
  • Documents:22,361,238
  • Sentences: 126,681,740
  • Tokens: 2,108,489,103
  • Filtered fivegrams: 464,595,426 (22.03%)
  • Unique fivegrams: 186,219,982
Step 6
Sort n-Gram
  33 min.45 min.40 min.33 min.