TC Package - PreProcess Procedures
This page describes the preprocess procedures for generating files for JDI, STI, and STRI.
Please refer to PreProcess Design & Requirements section for details.
- Preparation:
It is a good idea to redo everything to make sure everything is OK!
- copy the ${PRE_YEAR} data to ${YEAR} data
- Generate all Jdi and Sti, Stri, data
- Load to TC database
- JDI:
- shell> cd ${TC_DIR}/preProcess/tcPre2008
- shell> cd data/${YEAR}/Jdi/Input
- Update following files:
- lsi.xml -> ./lsi${YEAR}.xml
- Copy from ftp://ftp.nlm.nih.gov/online/journals (new version posted in mid-Feb.)
- Manually delete the first two lines
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE SerialsSet PUBLIC "-//NLM//DTDSERIALS, 1st January 2010//EN"
"http://www.nlm.nih.gov/databases/dtd/nlmserials_100101.dtd">
- POC: Esther Baldinger (NLM/DLO/TSD)
- MEDLINE -> MEDLINE_basline/${YEAR}
- /net/indfiler/vol/vol3/aux/MEDLINE_baseline/${YEAR}
- POC: Alan Aronson (CgSB)
- MRCON -> ./MRCON.${YEAR}AC
- ash:/u03/umls/Releases/${YEAR}${VERSION}/Full/ORF/META/MRCON
- POC: Kin Wah Fung (CgSB), SA: Dwayne McCully
- contractions.txt -> copy from previous version
- stopWords.txt -> copy from previous version (with modifications)
- shs.txt -> copy from previous version
- jds.txt -> copy from previous version
- MedLineFiles.txt -> get the list of all MEDLINE files
- shell> cd ${TC}/preProcess/tcPre2008/bin
- shell> 0.GetMedLineFiles
2010
2010
- MedLineYears.txt -> specified (3) years of MEDLINE to be used
- 2007, 2008, 2009 for 2010 TC release
- Generating files:
- shell> 1.GenJdi
${YEAR}
11
=> It takes about 8 hours to complete
- Moving files:
- shell> cd ${TC_DIR}/tcPre2008/data/2010/Jdi
- shell> cp -rp Output Output.tc${YEAR}
- shell> 2.DeployJdiFilesToTc
${YEAR}
- Load files to database:
- shell> cd ${TC}/tc${YEAR}/
- update ${TC_DIR} in 1.CreateDb
- update ${TC}/tc${YEAR}/data/Config/tc.properties
- shell> ./bin/loadDb/1.CreateDb
- update hsqldb.cache_file_scale=8 in ${TC}/tc${YEAR}/data/HSqlDb/tc${YEAR}.properties
- shell> 2.AnalyzeInFiles ${YEAR}
To make sure the length of each column in DB tables are big enough
- 1) Word-Jd Scores
- 2) Mh-Jd Scores
- 3) Sh-Jd Scores
- shell> ./bin/loadDb/3.LoadDb
- 1) Word-Jd Scores
- 2) Mh-Jd Scores
- 3) Sh-Jd Scores
=> It takes less than 1 hour to complete
- update readonly=true in ${TC}/tc${YEAR}/data/HSqlDb/tc${YEAR}.properties
- Determine Max. Signal:
The Max. signal is chosen from file wordSignalWcDcGt1.txt. It should:
- includes: "cancer", "blood", "risk", "therapy"
- excludes: "function", "case"
- Suggestion from Susanne: use "cancer" as upper limit since it is not a stopword
- History of Max. signal selection on 2007 ~ 2010
word | 2007 | 2008 | 2009 | 2010
|
---|
risk | 464482 | 805636 | 876271 | 934014
|
cancer | 388950 | 645291 | 705814 | 754647
|
blood | 510753 | 608233 | 629776 | 644743
|
therapy | 444975 | 645880 | 682715 | 695532
|
function | 513072 | 757149 | 807189 | 837714
|
case | 430815 | 648134 | 699541 | 723212
|
Max.Signal | 510754 | 645881 | 705815 | 754648
|
---|
- Test Jdi Tables by Similarity (to previous year)
- STRI (with 1st run StDocument):
- shell> cd ${TC_DIR}/preProcess/tcPre2008
- shell> cd data/${YEAR}/Sti/Input
- Update following files:
- MRSTY -> ./MRSTY.${YEAR},
- ash:/u03/umls/Releases/${YEAR}${VERSION}/Full/ORF/META/MRSTY
- POC: Kin Wah Fung (CgSB), SA: Dwayne McCully
- MRCONSO.RRF -> ./MRCONSO.RRF.${YEAR}
- ash:/u03/umls/Releases/2008AB/Full/RRF/META/MRCONSO.RRF)
- POC: Kin Wah Fung (CgSB), SA: Dwayne McCully
- SRDEF.txt
- downlaod from http://semanticnetwork.nlm.nih.gov/Download/RelationalFiles/SRDEF
- change the file name to SRDEF.txt
- stGroups.txt
- download from http://semanticnetwork.nlm.nih.gov/SemGroups/SemGroups.txt
- change (or link) the file name to stGroups.txt
From JDI:
- wordJdidWcDcTable.txt (generated from pre-JDI processes)
- jds.txt
- Generating files:
- make sure the "ROOT_DIR" in ${TC_DIR}/data/Config/tc.properties is correct!
- shell> 3.GenSti
${TC_VERSION}
${DATA_YEAR}
10
input Max Signal
=> It takes about 10 min. to complete
- shell> 4.Deploy1stRunStriFilesToTc
- stJdTable.txt (1st Run)
- sts.txt
- STI & STRI (with refined StDocuments):
- shell> cd ${TC_DIR}/preProcess/tcPre2008
- Generating files:
- shell> 3.GenSti
${YEAR}
11
=> It takes about 2 hours to complete
Some err message shows on refined-document-2, which is OK!
- shell> 7.DeployStrStriFilesToTc:
- WordStTable.txt
- sts.txt
- stJdTable.txt
- Load files to database:
- shell> cd ${TC}/tc${YEAR}/bin/loadDb/
- shell> 2.AnalyzeInFiles ${YEAR}
- update readonly=false in ${TC}/tc${YEAR}/data/HSqlDb/tc${YEAR}.properties
- shell> cd ${TC}/tc${YEAR}/
- shell> ./bin/loadDb/3.LoadDb
4) Word-St Scores
=> It takes about 4 hours to complete
- update readonly=true in ${TC}/tc${YEAR}/data/HSqlDb/tc${YEAR}.properties
- TC.2010 precision on NLM WSD collection, 100 instances