Consumer Health Corpus

I. Introduction

A corpus relevant to consumer health data should increase the performance of CSpell. Accordingly, we established a consumer health corpus by collecting health related articles form 16 web sites that were used for answering consumer health questions:

Consumer Health Corpus (from Ashutosh's Crawler, 10.09.17)

SourcesAbbreviationWeb site Base URLArticle No.
Genetic and Rare Diseases - Diseasesgard
Genetics Home Reference - Conditionsghr
Genetics Home Reference - Genesghrgenes
MedlinePlus - Drugsmplusdrugs
MedlinePlus - Medical Encyclopediamplusencyclopedia
MedlinePlus - All Health Topicsmplushealthtopics
MedlinePlus - Herbs and Supplementsmplusherbssupplements
National Eye Institutenei
National Heart, Lung, and Blood Institutenhlbi
National Institute of Allergy and Infectious Diseasesniaid
National Institute of Arthritis and Musculoskeletal and Skin Diseasesniams
National Institute of Child Health and Human Developmentnichd
National Institute on Deafness and Other Communication Disordersnidcd
National Institute of Diabetes and Digestive and Kidney Diseaseniddk
National Institute of Mental Healthnimh
National Institute of Neurological Disorders and Strokeninds
Centers for Disease Control and Preventioncdc
National Cancer InstitutecancerGov
National Institute on Agingniahttps://www.nia.nih.govTBD
National Institutes of health - Office of Research on Women's Healthwomenhealth

II. Algorithm

  • A crawler was developed to search articles that are consumer health related. The outputs are stored in XML format.
  • These articles are converted to text format
  • N-gram algorithm is applied to the text
  • Lower case, core-term are used to group the raw unigrams for word count
  • The results (WC|unigram) are used as word frequency data for CSpell

III. Consumer Health Corpus

  • Articles: 17,139
  • Sentences: 550,193
  • Tokens: 10,228,699
  • Unique Word: 192,818
  • Unique CoreTerm.Lc: 109,175
  • Dic Words in Corpus: 48690|8.5886%
  • Dic Words WC: 9,979,195|97.6123%

IV. Notes

  • The special code, such as [NUM], [EMAIL], [URL], need to be consistent with the input data for cSpell. For example, the development set used [CONTACT] for telephone number and email, which results in lower precision on context ranking. Need a cleanup on the pre-process for tagging the corpus.