The MEDLINE N-gram Set 2014: by Split, Group, Filter, and Combine Algorithm
The MEDLINE n-gram set - 2014 (generated by split, group, filter, and combine algorithm) is listed as bellows. For each MEDLINE record, title and abstract are used as the source of n-grams. They are combined, tokenized into sentences, and then tokenized into tokens (words use space as word boundary). Finally, n-grams are generated by filtering out terms have than 50 characters or the total word count is less than 30. The specifications of generating these n-grams are listed as follows:
- MEDLINE: 2014 - TI and AB (from PmidTiAbS14nXXXX.txt: 1 ~ 746)
- Method: Split, Combine, Filter Algorithm
- Max. Character Size: 50
- Min. word count: 30
- Min. document count: 1
- Total document count: 22,356,869
- Total sentence count: 126,612,705
- Total token count: 2,610,209,406
- N-gram files
- File format - 3 fields:
Document count Word count N-gram - Sorted by document count, word count, then alphabetic order of n-grams.
Download:
N-grams File Zip Size Actual Size No. of n-grams Unigrams 1-gram.2014.tgz 5.4MB 14MB 804,382 Bigrams 2-gram.2014.tgz 33MB 98MB 4,587,349 Trigrams 3-gram.2014.tgz 49MB 160MB 6,287,536 Four-grams 4-gram.2014.tgz 33MB 114MB 3,799,377 Five-grams 5-gram.2014.tgz 15MB 54MB 1,545,175 N-gram Set nGramSet.2014.30.tgz 170MB 437MB 17,023,819 Distilled N-gram Set distilledNGram.2014.tgz 51MB 164MB 6,351,392 - File format - 3 fields:
