- Lexicon - The MEDLINE N-Gram Set
The MEDLINE n-gram set is used to retrieve multiwords for building the SPECIALIST lexicon. Lexical Systems Group (LSG) would like to share this n-gram set (n = 1 ~ 5) with NLP|MLP community. Please download from the following links.
| Year | Document Count | Sentence Count | Word Count | N-grams | Distilled N-grams | DNg/Ng % | Download |
|---|---|---|---|---|---|---|---|
| 2021 | 31,850,051 | 209,685,517 | 4,365,354,060 | 28,103,252 | 11,127,802 | 39.60% | The MEDLINE n-gram set 2021 |
| 2020 | 30,420,660 | 196,566,513 | 4,080,670,967 | 26,310,808 | 10,354,021 | 39.35% | The MEDLINE n-gram set 2020 |
| 2019 | 29,138,919 | 185,619,887 | 3,824,268,997 | 24,666,816 | 9,595,606 | 38.90% | The MEDLINE n-gram set 2019 |
| 2018 | 27,837,540 | 174,395,209 | 3,585,789,820 | 23,171,133 | 8,979,895 | 38.75% | The MEDLINE n-gram set 2018 |
| 2017 | 26,759,399 | 163,021,640 | 3,386,661,350 | 21,963,037 | 8,461,972 | 38.53% | The MEDLINE n-gram set 2017 |
| 2016 | 24,358,442 | 143,471,776 | 2,971,013,236 | 19,325,338 | 7,402,848 | 38.31% | The MEDLINE n-gram set 2016 |
| 2015 | 23,343,329 | 134,834,507 | 2,786,085,158 | 18,148,692 | 6,793,561 | 37.43% | The MEDLINE n-gram set 2015 |
| 2014 | 22,356,869 | 126,612,705 | 2,610,209,406 | 17,023,819 | 6,351,392 | 37.31% | The MEDLINE n-gram set 2014 |
References:
- Design Documents
- Papers:
- Generating A Distilled N-Gram Set: Effective Lexical Multiword Building in the SPECIALIST Lexicon
The 10th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2017), Vol(5): HEALTHINF, PORTO, Portugal, February 21-23, 2017, p. 77-87 - Generating the MEDLINE N-gram Set
AMIA 2015 Annual Symposium, San Francisco, CA, November 14-18, 2015, p. 1569
- Generating A Distilled N-Gram Set: Effective Lexical Multiword Building in the SPECIALIST Lexicon
