Consumer Health Corpus

I. Introduction

A corpus relevant to consumer health data should increase the performance of CSpell. Accordingly, we established a consumer health corpus by collecting health related articles form 16 web sites that were used for answering consumer health questions:

Consumer Health Corpus (from Ashutosh's Crawler, 10.09.17)

SourcesAbbreviationWeb site Base URLArticle No.
Genetic and Rare Diseases - Diseasesgardhttps://rarediseases.info.nih.gov/gard/6484
Genetics Home Reference - Conditionsghrhttps://ghr.nlm.nih.gov/condition1215
Genetics Home Reference - Genesghrgeneshttps://ghr.nlm.nih.gov/gene1439
MedlinePlus - Drugsmplusdrugshttps://medlineplus.gov/druginfo/1383
MedlinePlus - Medical Encyclopediamplusencyclopediahttps://medlineplus.gov/ency/4425
MedlinePlus - All Health Topicsmplushealthtopicshttps://www.nlm.nih.gov/medlineplus/all_healthtopics.html1013
MedlinePlus - Herbs and Supplementsmplusherbssupplementshttps://www.nlm.nih.gov/medlineplus/druginfo/herb_All.html153/177
National Eye Instituteneihttps://nei.nih.gov/health36
National Heart, Lung, and Blood Institutenhlbihttp://www.nhlbi.nih.gov/health/health-topics/by-alpha141
National Institute of Allergy and Infectious Diseasesniaidhttps://www.niaid.nih.gov/diseases-conditions/all53
National Institute of Arthritis and Musculoskeletal and Skin Diseasesniamshttps://www.niams.nih.gov/health-topics/all-diseases55
National Institute of Child Health and Human Developmentnichdhttps://www.nichd.nih.gov/health/topics/Pages/index.aspx81
National Institute on Deafness and Other Communication Disordersnidcdhttps://www.nidcd.nih.gov/health/hearing-ear-infections-deafness13/15
National Institute of Diabetes and Digestive and Kidney Diseaseniddkhttps://www.niddk.nih.gov/health-information181/185
National Institute of Mental Healthnimhhttp://www.nimh.nih.gov/health/topics/index.shtml25/26
National Institute of Neurological Disorders and Strokenindshttps://www.ninds.nih.gov/Disorders/All-Disorders439
Centers for Disease Control and Preventioncdchttps://www.cdc.gov/TBD
National Cancer InstitutecancerGovhttps://www.cancer.gov/typesTBD
National Institute on Agingniahttps://www.nia.nih.govTBD
National Institutes of health - Office of Research on Women's Healthwomenhealthhttps://orwh.od.nih.gov/TBD

II. Algorithm

  • A crawler was developed to search articles that are consumer health related. The outputs are stored in XML format.
  • These articles are converted to text format
  • N-gram algorithm is applied to the text
  • Lower case, core-term are used to group the raw unigrams for word count
  • The results (WC|unigram) are used as word frequency data for CSpell

III. Consumer Health Corpus

  • Articles: 17,139
  • Sentences: 550,193
  • Tokens: 10,228,699
  • Unique Word: 192,818
  • Unique CoreTerm.Lc: 109,175
  • Dic Words in Corpus: 48690|8.5886%
  • Dic Words WC: 9,979,195|97.6123%

IV. Notes

  • The special code, such as [NUM], [EMAIL], [URL], need to be consistent with the input data for cSpell. For example, the development set used [CONTACT] for telephone number and email, which results in lower precision on context ranking. Need a cleanup on the pre-process for tagging the corpus.