The MEDLINE N-gram Set 2020: by Split, Group, Filter, and Combine Algorithm
The MEDLINE n-gram set (generated by split, group, filter, and combine algorithm) is listed as bellows. For each MEDLINE record, title and abstract are used as the source of n-grams. They are combined, tokenized into sentences, and then tokenized into tokens (words use space as word boundary). Finally, n-grams are generated by filtering out terms with more than 50 characters or the total word count is less than 30. The specifications of generating these n-grams are listed as follows:
- MEDLINE: 2020 - TI and AB (from MEDLINE Baseline Repository - MBR, pubmed20nXXXX.xml -> PmidTiAbS20nXXXX.txt: 1 ~ 1015)
- Method: Split, Combine, Filter Algorithm
- Max. Character Size: 50
- Min. word count: 30
- Min. document count: 1
- Total document count: 30,420,660
- Total sentence count: 196,566,513
- Total token count: 4,080,670,967
- N-gram files
- File format - 3 fields:
Document count Word count N-gram - Sorted by document count, word count, then alphabetic order of n-grams. N-gram set is not sorted. It can be sorted by nGramUtil package.
- File format - 3 fields:
- Download:
N-grams File Zip Size Actual Size No. of n-grams Unigrams 1-gram.2020.tgz 7.6 MB 19 MB 1,126,766 Bigrams 2-gram.2020.tgz 49 MB 143 MB 6,702,698 Trigrams 3-gram.2020.tgz 75 MB 248 MB 9,677,700 Four-grams 4-gram.2020.tgz 54 MB 187 MB 6,154,320 Five-grams 5-gram.2020.tgz 26 MB 94 MB 2,649,324 N-gram Set nGramSet.2020.30.tgz 210 MB 689 MB 26,310,808 Distilled N-gram Set distilledNGram.2020.tgz 84 MB 271 MB 10,354,021
