TC Package - PreProcess Procedures
This preprocess should be perform after the baseline software of new release is completed. Please follow the annual release procedures for a new release. The preprocess procedures for generating files for JDI, STI, and STRI are detailed in this page. Please refer to PreProcess Design & Requirements section for design details.
- It is a good idea to rerun everything on last year to make sure everything is OK!
- Procedure Summary:
- complete the baseline for new release
- copy the ${PRE_YEAR} data to ${YEAR} data
- Generate all Jdi, Stri, stDoc (1st run) data
- Load to TC database and TC package
- Refine stDoc
- Load to TC package
- Copy dataset:
- shell> cd ${TC_DIR}/preProcess/tcPre2008
- cp ${PRE_YEAR} ${YEAR}
- JDI:
- shell> cd ${TC_DIR}/preProcess/tcPre2008
- shell> cd data/${YEAR}/Jdi/Input
- I. Update following files:
- lsi.xml -> ./lsi${YEAR}.xml
- Copy from ftp://ftp.nlm.nih.gov/online/journals (new version posted in mid-Feb.)
- Manually delete the first two lines
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE SerialsSet PUBLIC "-//NLM//DTDSERIALS, 1st January 2010//EN" "http://www.nlm.nih.gov/databases/dtd/nlmserials_100101.dtd"> - POC: Esther Baldinger (NLM/DLO/TSD)
- MEDLINE -> MEDLINE_basline/${YEAR}
- /nfsvol/indaux/MEDLINE_baseline/${YEAR}
- /nfsvol/nls/MEDLINE_baseline/${YEAR}
- POC: Alan Aronson, Jim Mork (CgSB)
- MRCON -> ./MRCON.${YEAR}AX
- ash:/u03/umls/Releases/${YEAR}${VERSION}/Full/ORF/META/MRCON
- POC: Kin Wah Fung (CgSB), SA: Dwayne McCully
- contractions.txt -> copy from previous version
- stopWords.txt -> copy from previous version (with modifications)
- shs.txt -> copy from previous version
- jds.txt -> copy from previous version
- MedLineFiles.txt & MedLineYears.txt
- shell> cd ${TC}/preProcess/tcPre2008/bin
- shell> 0.GetMedLineFiles
2010
2010 - MedLineYears.txt -> specified (3) years of MEDLINE to be used
- ${YEAR}-3, ${YEAR}-2, ${YEAR}-1 for ${YEAR} TC release
- MedLineFiles.txt -> get the list of all MEDLINE files
- lsi.xml -> ./lsi${YEAR}.xml
- II. Generating files:
- shell> 1.GenJdi
${YEAR}
11
=> It takes about 3 hours to complete
- shell> 1.GenJdi
- III. Moving files:
- shell> cd ${TC_DIR}/tcPre2008/data/2010/Jdi
- shell> cp -rp Output Output.tc${YEAR}
- shell> 2.DeployJdiFilesToTc
${YEAR}
- IV. Load files to database:
- shell> cd ${TC}/tc${YEAR}/
- update ${YEAR} in ${TC}/tc${YEAR}/bin/loadDb/1.CreateDb
- update ${TC}/tc${YEAR}/data/Config/tc.properties
- shell> ./bin/loadDb/1.CreateDb
- shell> 2.AnalyzeInFiles ${YEAR}
To make sure the length of each column in DB tables are big enough for- 1) Word-Jd Scores
- 2) Mh-Jd Scores
- 3) Sh-Jd Scores
- shell> ./bin/loadDb/3.LoadDb
- 1) Word-Jd Scores
- 2) Mh-Jd Scores
- 3) Sh-Jd Scores
=> It takes less than 1 hour to complete - update readonly=true in ${TC}/tc${YEAR}/data/HSqlDb/tc${YEAR}.properties
- shell> cd ${TC}/tc${YEAR}/
- V. Determine Max. Signal:
The Max. signal is chosen from file wordSignalWcDcGt1.txt. It should:- includes: "cancer", "blood", "risk", "therapy"
- excludes: "function", "case"
- Suggestion from Susanne: use "cancer" as upper limit since it is not a stopword
- History of Max. signal selection on 2007 ~ 2010
word 2007 2008 2009 2010 2011 risk 464482 805636876271934014977065cancer 388950 645291 705814 754647 792053 blood 510753 608233 629776 644743 671190 therapy 444975 645880 682715 695532 713875 function 513072757149807189837714859487case 430815 648134699541 723212 756545 Max. Signal 510754 645881 705815 754648 792054 - Change the Max. signal in ${TC_SRC}/FilterApi/LegalWordsOption.java
- VI. Test Jdi Tables by Similarity (to previous year)
- link files to ${Test}/TC/TrainSetTest/data/Input/${YEAR}/Jdi
- jds.txt
- shJdidDcTable.txt
- wordJdidWcDcTable.txt
- mhJdidDcTable.txt
- shell> cd ${Test}/TC/TrainSetTest/bin
- shell> 0.LinkFiles
- Similarity between releases
- shell> 1.TestJdi (15 min.)
previous year
current year
7
JDI Similarity Test Results:Releases WordJdidWc WordJdidDc MhJdidDc ShJdidDc 2008~2009 97.08% 97.69% 99.04% 99.99% 2009~2010 96.37% 97.01% 98.64% 99.82% 2010~2011 96.49% 97.10% 98.68% 99.76% - shell> 1.TestJdi (15 min.)
- link files to ${Test}/TC/TrainSetTest/data/Input/${YEAR}/Jdi
- STRI (with 1st run StDocument):
- shell> cd ${TC_DIR}/preProcess/tcPre2008
- shell> cd data/${YEAR}/Sti/Input
- I. Update following files:
- MRSTY -> ./MRSTY.${YEAR}
- ash:/u03/umls/Releases/${YEAR}${VERSION}/Full/ORF/META/MRSTY
- POC: Kin Wah Fung (CgSB), SA: Dwayne McCully
- MRCONSO.RRF -> ./MRCONSO.RRF.${YEAR}
- ash:/u03/umls/Releases/2008AB/Full/RRF/META/MRCONSO.RRF)
- POC: Kin Wah Fung (CgSB), SA: Dwayne McCully
- SRDEF.txt -> ./SRDEF.${YEAR}
- download from http://semanticnetwork.nlm.nih.gov/Download/RelationalFiles/SRDEF
- Edit the file SRDEF.${YEAR}
- link SRDEF.txt to SRDEF.${YEAR}
- stGroups.txt
- download from http://semanticnetwork.nlm.nih.gov/SemGroups/SemGroups.txt
- change (or link) the file name to stGroups.txt
From JDI: - wordJdidWcDcTable.txt (generated from pre-JDI processes)
- jds.txt
- MRSTY -> ./MRSTY.${YEAR}
- II. Generating files:
- make sure the "ROOT_DIR" in ${TC_DIR}/data/Config/tc.properties is correct (not AUTO_MODE)!
- shell> 3.GenSti
${TC_VERSION}
${DATA_YEAR}
10
input Max Signal
=> It takes about 10 min. to complete
- III. Moving files:
shell> 4.Deploy1stRunStriFilesToTc ${YEAR}
- stJdTable.txt (1st Run)
- sts.txt
- STI & STRI (with refined stDocuments):
- shell> cd ${TC_DIR}/preProcess/tcPre2008
- I. Generating files:
- shell> 3.GenSti
${TC_YEAR}
${DATA_YEAR}
11
=> It takes about 2 hours to complete
Some err message shows on refined-document-2, which is OK! - Manually fix
- There are one err message shows on refined-document-1 (humn) and results in no words for humn in both "stDocument1.txt.refine" and "stDocument.txt.combine". The following steps provide a manual fix:
-
shell> 5.RefineStDoc
${YEAR}
${YEAR}
1
1
2
humn
1
1
...
-- RefineStDocuments.RefineStDocuments(): humn, word Size: 28
1. applicant|humn|T016|8|0.5825909|false|0.6438478(0.8119695-0.16812167)
2. applicants|humn|T016|4|0.57193226|false|0.67198503(0.82519585-0.15321079)
3. delegate|humn|T016|6|0.52379584|false|0.61971915(0.77649516-0.15677604)
4. descendent|humn|T016|88|0.32893714|false|0.51979005(0.64032584-0.12053579)
5. human|humn|false
6. human|humn|false
7. human|humn|false
8. human|humn|false
9. human|humn|false
10. humans|humn|T016|93|0.44167516|false|0.676633(0.82196045-0.14532742)
11. individual|humn|T016|65|0.6035321|false|0.76930225(0.9201443-0.15084207)
12. individual|humn|T016|65|0.6035321|false|0.76930225(0.9201443-0.15084207)
13. individual|humn|T016|65|0.6035321|false|0.76930225(0.9201443-0.15084207)
14. interviewee|humn|T016|24|0.4677229|false|0.61210704(0.80709076-0.19498374)
15. invoker|humn|false
16. man|humn|T016|94|0.3168362|false|0.6577292(0.8419048-0.18417563)
17. man|humn|T016|94|0.3168362|false|0.6577292(0.8419048-0.18417563)
18. man|humn|T016|94|0.3168362|false|0.6577292(0.8419048-0.18417563)
19. owner|humn|T016|4|0.63383675|false|0.75085104(0.8832318-0.13238078)
20. owner|humn|T016|4|0.63383675|false|0.75085104(0.8832318-0.13238078)
21. producer|humn|T016|92|0.2529137|false|0.7038802(0.8824603-0.17858009)
22. recipient|humn|T016|105|0.11818392|false|0.39545894(0.4790922-0.083633274)
23. resident|humn|T016|69|0.5160713|false|0.630488(0.7693282-0.1388402)
24. sponsor|humn|T016|65|0.36086074|false|0.60545766(0.75192285-0.14646521)
25. swimmer|humn|T016|2|0.62622374|false|0.7654458(0.86221415-0.09676831)
26. swimmer|humn|T016|2|0.62622374|false|0.7654458(0.86221415-0.09676831)
27. swimmer|humn|T016|2|0.62622374|false|0.7654458(0.86221415-0.09676831)
28. user|humn|T016|39|0.25950813|false|0.7680406(0.8928807-0.1248401)
... - Manually add "applicants owner owner swimmer swimmer swimmer" to humn in "stDocument1.txt.refine" since above words are within the top 5 (and also 2 StdDev).
- Rerun steps 7, 8, 9 in 3.GenSti
- shell> 3.GenSti
- II. Moving files:
- shell> 7.DeployStrStriFilesToTc:
- WordStTable.txt
- sts.txt
- stJdTable.txt
- shell> 7.DeployStrStriFilesToTc:
- III. Load files to database:
- shell> cd ${TC}/tc${YEAR}/bin/loadDb/
- shell> 2.AnalyzeInFiles ${YEAR}
- update readonly=false in ${TC}/tc${YEAR}/data/HSqlDb/tc${YEAR}.properties
- shell> cd ${TC}/tc${YEAR}/
- shell> ./bin/loadDb/3.LoadDb
4) Word-St Scores
=> It takes about 4 hours to complete - update readonly=true in ${TC}/tc${YEAR}/data/HSqlDb/tc${YEAR}.properties
- IV. Test: TC.2011 precision on NLM WSD collection, 100 instances
- Preparation
- install tc to ${PROJECTS}/tc/${YEAR}
- Add tc.properties.${YEAR} to ${PROJECTS}/tc/${YEAR}/data/Config
shell> cd ${TEST}/TC/WsdTest/ - update ./lib/tc${YEAR}dist.jar
- update ./build.xml
- project.year
- recompile
shell> ant clean
shell> ant - Test it (can be skipped, replaced by test all)
shell> ${TEST}/TC/WsdTest/bin/2.TestWsd
shell> ${TEST}/TC/WsdTest/bin/3.TestWsdStats - Test & results
shell> ${TEST}/TC/WsdTest/bin/4.TestAll
This script test all cases: of {YEAR}, 100 instances, all ambiguous words, all 3 test cases, and all 5 score types.ST WSD Collections Tests (both train and test sets):
TC Version Ambiguous Sentence Ambiguous Sentences Ti-AB DC WC CS DC WC CS DC WC CS 2007 74.61% 75.00% 74.91% 74.95% 75.39% 75.05% 74.05% 74.32% 74.32% 2008 73.81% 74.93% 74.36% 74.30% 75.00% 74.77% 73.52% 74.44% 74.01% 2009 77.37% 77.11% 76.91% 76.79% 76.72% 76.62% 76.13% 76.65% 76.12% 2010 76.62% 77.36% 77.27% 75.96% 76.59% 76.73% 74.85% 76.38% 75.24% 2011 77.11% 77.53% 77.24% 76.00% 77.10% 76.49% 74.82% 76.81% 75.55%
- V. Test2: TC.2011 precision on MSH WSD set, data collection, 203 ambiguous words, 37,888 instances
- Preparation
- install tc to ${PROJECTS}/tc/${YEAR}
- Add tc.properties.${YEAR} to ${PROJECTS}/tc/${YEAR}/data/Config
shell> cd ${TEST}/TC/WsdTest2/ - update ./lib/tc${YEAR}dist.jar
- update ./build.xml
- project.year
- recompile
shell> ant clean
shell> ant - Test it (can be skipped, replaced by test all)
shell> ${TEST}/TC/WsdTest2/bin/2.TestWsd
shell> ${TEST}/TC/WsdTest2/bin/3.TestWsdStats - Test & results
shell> ${TEST}/TC/WsdTest2/bin/4.TestAll
This script test all cases: of ${YEAR}, 37,888 instances, all ambiguous words, all 3 test cases, and all 5 score types.MSH WSD Set Tests:
The precision excludes answer can not be found by StWSD:- None: No answer found when ambiguous word size is less than 2 and results in no legal words
- None: when some of mapped ST is not legal in the test TC release
- Multiple Cuis mapped to the same ST
Precision/Weighted Precision Test for MSH WSD set (both ambiguous abbreviatons and ambiguous terms):
TC Version Ambiguous Sentence Ambiguous Sentences Ti-AB DC WC CS DC WC CS DC WC CS 2007 70.66%
71.90%70.58%
72.19%70.70%
72.13%70.56%
71.34%70.59%
71.40%70.58%
71.56%70.79%
70.84%70.76%
71.31%70.79%
70.98%2008 70.42%
70.88%70.49%
71.33%70.48%
71.02%69.85%
70.63%70.09%
71.27%70.06%
71.08%69.54%
69.79%69.30%
69.67%69.23%
69.57%2009 66.63%
66.91%66.21%
66.83%66.44%
66.72%66.46%
67.14%65.79%
66.47%64.23%
66.74%66.93
66.96%66.36%
66.5666.78%
66.81%2010 65.86%
65.62%65.69%
66.05%65.72%
65.92%65.62%
65.96%65.42%
65.93%65.58%
66.03%66.12%
65.73%65.83%
65.85%66.05%
65.83%2011 67.09%
66.64%66.76%
66.93%67.00%
66.76%66.90%
66.43%66.89%
67.21%66.64%
66.55%67.20%
66.35%67.06%
66.67%67.05%
66.34%
- shell> cd ${TC_DIR}/preProcess/tcPre2008