The MEDLINE.2016 N-gram Set

This page describes the details of generating n-grams (n = 1-5) from MEDLINE.2016 using split and combine algorithm.

ProgramNAppro. Time (Hr.) Option 1
  • GenPmidTiAbSentenceFiles
  • PmidTiAbSentences{YY}n{DDDD}.txt
Option 10
  • Gen split n-gram
  • GetNGramFromSentenceFiles
  • MAX_CL = 50

  • 1.Split:
Option 11
  • Group split n-grams by alphabetic characters
  • GroupSpliteNGrams

  • 2.Group:
Option 12
  • Filter by WC and combine alphabetic n-grams
  • FilterWcCombineNGrams
  • WC = 30

  • 3.FilterCombine:
Option 13
  • Sort n-grams by dwt, tdw
  • NGramFilter

  • 3.FilterCombine:
Preprocess 1
  • ~ 1 hr.
  • PmidTiAbS15: 1-812
    
unigramsn=10.5 hr. 
  • param: 10,1, (150000000)
  • 23 min.

  • Documents: 24,358,442
  • Sentences: 143,471,776
  • Tokens: 2,971,013,236

  • split: 1, no split
  • 1-grams (not unique): 24,121,470
    (it is unique beacuse no split)

  • Files:
    • nGram.out.1.heap.50.s01.0001-0812 (389 Mb)
  • param:
    • 11,1,01,NO,NO
  • 1 min.

  • Group Alphabetically
  • 1-gram (unique): 24,121,470

  • Files:
    • ${NGram}.g01.NO-NO (389 Mb)
  • param: 12, 1, 30
  • 1 min

  • 1-gram (WC >= 30): 883,287

  • File:
    • 1-gram.2016.30 (15 Mb)
  • param: 13, 1, 30
  • 1 min.

  • 1-gram (sorted): 883,287

  • File:
    • 1-gram.2016.30.dwt (15 Mb)
bigramsn=21.5 hr. 
  • param: 10,2, (150000000)
  • 1 hr.

  • split: 2
  • 2-gram (not unique): 267,460,094

  • Files:
    • s01.0001-0591 (3.1 Gb)
    • s02.0592-0812 (2.4 Gb)
  • param: see file names below
    • 11,2,01,NO,a
    • 11,2,02,a,NO
  • 19 min.

  • Group Alphabetically
  • 2-gram (unique): 229,691,126

  • Files:
    • ${NGram}.g01.NO-a (1.7 Gb)
    • ${NGram}.g02.a-NO (3.1 Gb)
  • param: 12, 2, 30
  • 4 min.

  • 2-gram (WC >= 30): 5,114,547

  • File:
    • 2-gram.2016.30 (109 Mb)
  • param: 13, 2, 30
  • 1 min.

  • 2-gram (sorted): 5,114,547

  • File:
    • 2-gram.2016.30.dwt (109 Mb)
trigramsn=35 hr. 
  • param: 10,3, (150000000)
  • 2 hr.

  • split: 8
  • 3-gram (not unique): 1,092,117,562

  • Files:
    • s01.0001-0298 (3.8 Gb)
    • s02.0299-0403 (3.7 Gb)
    • s03.0404-0492 (3.7 Gb)
    • s04.0493-0573 (3.7 Gb)
    • s05.0574-0651 (3.8 Gb)
    • s06.0652-0725 (3.7 Gb)
    • s07.0726-0797 (3.8 Gb)
    • s08.0798-0812 (953 Mb)
  • param: see file names below
  • 2.5 hr.

  • Group Alphabetically
  • 3-gram (unique): 788,417,523

  • Files:
    • g01.NO-U (4.2 Gb)
    • g02.U-d (3.9 Gb)
    • g03.d-k (3.7 Gb)
    • g04.k-re (3.9 Gb)
    • g05.re-NO (4.3 Gb)
  • param: 12, 3, 30
  • 10 min.

  • 3-gram (WC >= 30): 7,134,807

  • File:
    • 3-gram.2016.30 (182 Mb)
  • param: 13, 3, 30
  • 1 min.

  • 3-gram (sorted): 7,134,807

  • File:
    • 3-gram.2016.30.dwt (182 Mb)
fourgramsn=410 hr. 
  • param: 10,4, (130000000)
  • 2.5 hr.

  • split: 14
  • 4-gram (not unique): 1,855,460,574

  • Files:
    • s01.0001-0226 (4.0 Gb)
    • s02.0227-0296 (3.9 Gb)
    • s03.0297-0351 (3.9 Gb)
    • s04.0352-0400 (4.0 Gb)
    • s05.0401-0446 (4.0 Gb)
    • s06.0447-0490 (3.9 Gb)
    • s07.0491-0533 (4.0 Gb)
    • s08.0534-0574 (4.0 Gb)
    • s09.0575-0614 (3.9 Gb)
    • s10.0615-0654 (4.0 Gb)
    • s11.0655-0693 (4.0 Gb)
    • s12.0694-0731 (4.0 Gb)
    • s13.0732-0769 (4.0 Gb)
    • s14.0770-0806 (4.0 Gb)
    • s15.0807-0812 (653 Mb)
  • param: see file names below
  • 7 hr.

  • Group Alphabetically
  • 4-gram (unique): 1,460,588,176

  • Files:
    • g01.NO-F (4.2 Gb)
    • g02.F-ab (4.2 Gb)
    • g03.ab-b (4.1 Gb)
    • g04.b-d (4.0 Gb)
    • g05.d-fq (4.1 Gb)
    • g06.fq-is (4.2 Gb)
    • g07.is-o (3.7 Gb)
    • g08.o-pm (3.9 Gb)
    • g09.pm-si (3.9 Gb)
    • g10.si-th (2.2 Gb)
    • g11.th-u (3.8 Gb)
    • g12.u-NO (2.5 Gb)
  • param: 12, 4, 30
  • 20 min.

  • 4-gram (WC >= 30): 4,380,474

  • File:
    • 4-gram.2016.30 (132 Mb)
  • param: 13, 4, 30
  • 1 min.

  • 4-gram (sorted): 4,380,474

  • File:
    • 4-gram.2016.30.dwt (132 Mb)
fivegramsn=518 hr. 
  • param: 10,5, (120000000)
  • 3 hr.

  • split: 18
  • 5-gram (not unique): 2,143,203,249

    Files:

    • s01.0001-0208 (4.3 Gb)
    • s02.0209-0268 (4.3 Gb)
    • s03.0269-0317 (4.3 Gb)
    • s04.0318-0359 (4.3 Gb)
    • s05.0360-0398 (4.4 Gb)
    • s06.0399-0435 (4.4 Gb)
    • s07.0436-0470 (4.4 Gb)
    • s08.0471-0505 (4.4 Gb)
    • s09.0506-0539 (4.4 Gb)
    • s10.0540-0572 (4.4 Gb)
    • s11.0573-0605 (4.4 Gb)
    • s12.0606-0638 (4.5 Gb)
    • s13.0639-0670 (4.4 Gb)
    • s14.0671-0702 (4.5 Gb)
    • s15.0703-0733 (4.4 Gb)
    • s16.0734-0764 (4.5 Gb)
    • s17.0765-0794 (4.4 Gb)
    • s18.0795-0812 (2.6 Gb)
  • param: see file names below
  • 12 hr.

  • Group Alphabetically
  • 5-gram (unique): 1,885,969,537

  • Files:
    • g01.NO-A (3.3 Gb)
    • g02.A-M (3.9 Gb)
    • g03.M-a (3.9 Gb)
    • g04.a-and (3.2 Gb)
    • g05.and-b (4.1 Gb)
    • g06.b-cf (3.1 Gb)
    • g07.cf-d (2.9 Gb)
    • g08.d-em (3.1 Gb)
    • g09.em-g (3.8 Gb)
    • g10.g-inc (4.1 Gb)
    • g11.inc-m (3.8 Gb)
    • g12.m-o (3.6 Gb)
    • g13.o-on (3.3 Gb)
    • g14.on-pp (3.4 Gb)
    • g15.pp-s (4.1 Gb)
    • g16.s-t (4.0 Gb)
    • g17.t-thf (4.2 Gb)
    • g18.thf-v (3.3 Gb)
    • g19.v-NO (3.4 Gb)
  • param: 12, 5, 30
  • 30 min.

  • 5-gram (WC >= 30): 1,812,223

  • File:
    • 5-gram.2016.30 (64 Mb)
  • param: 13, 5, 30
  • 1 min.

  • 5-gram (sorted): 1,812,223

  • File:
    • 5-gram.2016.30.dwt (64 Mb)