Frequency Analysis on 5 WC ranges: 100, 1K, 10K, 100K, 1M
I. Introduction
Frequenct strategy is important for LMW acquistion. It is applied to LMW candidates obtained from fitlers and matchers for better precision. This page describes an frequency analysis on 5 word count range (100, 1K, 10K, 100K, 1M).
II. Details
- Directory:
${MULTIWORDS}/bin/08.MatcherSpVar
${MULTIWORDS}/data/2015/outData/08.MatcherSpVar/
${MULTIWORDS}/data/2015/outData/08.MatcherSpVar/Candidates/tag.2017.good
- Model:
- Input Data: 2015 Distilled MEDLINE N-gram Set
- Process:
- Step 51: Use SpVar model of M2CES to get SpVar List
medline.2.byM2CES.2.out.30.spVars (min_ed >= 2, WC >= 30) - Step 60: Apply CUI filter
medline.2.byM2CES.2.out.30.spVars.cui - Step 61A: retrieve 500 LMW candidates at 5 WC range
The algorithm only count multiwords of 500 below the WC- 100
- 1000
- 10000
- 100000
- 1000000
- Tag them:
Tag Description AUTO_YES Automatically tagged by computer if term is in Lexicon AUTO_NO Automatically tagged by computer if term is in Lexicon Y Manually tagged by linguists if term is LMW, then add to Lexicon N Manually tagged by linguists if term is not LMW, then add to invalid LMW List
- Step 51: Use SpVar model of M2CES to get SpVar List
III. Results
Frequency | Precision (New Terms) | Precision (Total Terms) |
---|---|---|
100 | 19.81% (= 104/525) | 21.60% (= 116/537) |
1K | 36.77% (= 196/533) | 42.42% (= 249/587) |
10K | 47.73% (= 263/551) | 67.56% (= 604/894) |
100K | 35.72% (= 384/1075) | 68.38% (= 1516/2217) |
1M | 36.77% (= 556/1512) | 71.16% (= 2396/3367) |
The total precision is increased as the frequency increase. Thus, we should acquire LMW from the highest frequency n-grams.
Details data are available at:
${MULTIWORDS}/data/2015/outData/08.MatcherSpVar/Candidates/tag.2017.good/*.rpt