PreProcess - JDI, phase III
This page describes the automatic pre-process tasks of generating input files for JDI (Journal Descriptor Indexing). There are three phases of this pre-process for JDI:- Phase I:
generate all files to Java input format from Lisp files. This set of data is tested by comparing to all Lisp files and result of file.9801 and used in tc2006. - Phase II:
use Java programs to generate files from original data (MEDLINE) and Lisp files. This set of data is tested by comparing to all files in phase I and results of file.9801 and used in tc2007. - Phase III:
use Java program to generate files from scratch (MEDLINE, Meta-thesaurus, etc.). This set of data is tested by comparing final files in phase II by similarity (test suite) and used since tc2008.
The detailed procedures of phase III approach are described as below:
- Top directory
${TC_PRE_2008}/data/${YEAR}/Jdi/- ${TC_PRE_2008}: is the version of pre-Process software
- ${YEAR}: is the version of tc release, tc${YEAR}
- Input files
Required files to generate training set- contractions.txt (Contractions, copy from previous year)
- stopWords.txt (stopWords, copy from previous year)
- shs.txt (SubHeadings, copy from previous year)
- lsi.xml (List of Serials Indexed file, get it from NLM)
- jds.txt (Journal Descriptors)
- MEDLINE (MedLine citations, /nfsvol/indaux/MEDLINE_baseline/2009)
- MedLineYear.txt (file name of MedLine Year collections, manually edit for 06~08)
- MedLineFiles.txt (file name of MedLine citations, ./bin/0.GetMedLineFiles)
- MRCON (Meta-Thesaurus release, from ash:/u03/umls/Releases/2008AB/Full/ORF/META/MRCON)
- Generates files
shell> cd ${TC_PRE_2008}/bin shell> 1.GenJdi 2009 11
- Static files
Files are static or manually modified occasionally:- contractions.txt (Contractions)
- stopWords.txt (stopWords)
- shs.txt (SubHeadings)
- Derived from lsi${YEAR}.xml
Get the list of latest Journal descriptors (JDs) and associated Journal IDs (JIDs).- jidTaJds.txt (JID-TA-JDs)
- jds.txt (Journal Descriptors)
- Derived from MEDLINE
Retrieve titles, abstracts, JIDs, starred Mesh main Heading, starred Mesh subheading from MEDLINE citations (training set).- Inter-files for verifications
- ./PMIDJD/pmidJd.{NUM}.txt (PMID-TI-AB-TA-JID-RNs-MHs-JDs)
- ./TI/uiTiW.{NUM}.txt (PMID-TI)
- ./AB/uiAbW.{NUM}.txt (PMID-AB)
- ./JD/uiJidJds.{NUM}.txt (PMID-JID-JDs)
- ./Mesh/mhStarJd.txt (MH-MH_DC-JDs-JD_DC)
- ./Mesh/shStarJd.txt (SH-SH_DC-JDs-JD_DC)
- Inter-files for training set
- ./TIAB/uiTiAbWords.${NUM}.txt (PMID-TIAB)
- wordWcDcGt1.txt (WORD-WC-DC Gt1: total word count and document count)
- mhDc.txt (MH-DC)
- mhJdidDc.txt (MH-JDID-DC)
- shDc.txt (SH-DC)
- shJdidDc.txt (SH-JDID-DC)
- Inter-files for verifications
- Derived from Meta-Thesaurus, MRCON
- restrictWordsGt1.txt (restrictWords Gt1)
- word-Jdid-Wc-Dc table
- wordSignalWcDcGt1.txt(Word-Signal-Wc-Dc Gt1)
- wordJdidWcDcGt1.txt (Word-Jdid-Wc-Dc Gt1)
- jdidDcNFactor.txt (Jdid-Dc-NFactor)
- WordJdidWcDctable.txt (word-Jdid-Wc-Dc table)
- MeSH: Mh-Jdid-Dc & Sh-Jdid-Dc tables
- MhJdidDcTable.txt (Mh-Jdid-Dc table)
- ShJdidDcTable.txt (Sh-Jdid-Dc table)