PreProcess - JDI, Phase I

I. Word-Jdid-Wc-Dc
The JDI training data set are obtained from MEDLINE records. Below are the detail procedures to obtain the training data set:

  1. Retrieve destined years of MedLine records:
    In 2004 data, we retrieve 1999, 2000, 2001 MedLine records.

  2. Add Journal Descriptors to retrieved records:
    This step is to add JDs into retrieved records by combining above two steps together.

  3. Retrieve words by filtering records into 3 files based on fields: TI, AB, JD

  4. Calculate Word vs. Jd, word count and document count (w-jd-wc-dc)

  5. Calculate words and normalized total word count (w-signal.lw.gt1.l)

  6. Calculate JD vs document count (jd-dc.gt1.l)

II. Mh-Jdid-Dc & Sh-Jdid-Dc
The training data set are calculate Mesh Headings (MH) and Sub headings (SH) from MedLine records. Below are the detail procedures to obtain the training data set:

  1. Retrieve destined years of MedLine records:
    In 2004 data, we retrieve 1999, 2000, 2001 MedLine records.

  2. Get all Journal Descriptors for all Journals:

  3. Add Journal Descriptors to retrieved records:
    This step is to add JDs into retrieved records by combining above two steps together.

  4. Retrieve and calculate count of Mesh headings and subheadings (with star) by filtering records into 2 files based on the field: MH