Procedures of preparing Lexical Tools files

There are scripts and programs to generate lexical tools data files automatically. It is detailed as follows:

I. Location:

  • "${LVG_COMPONENTS}/PreDataBase/bin/"

II. Inputs:

  • "${LEXICON}/data/${YEAR}/tables/"
  • "${LVG_COMPONENTS}/PreDataBase/data/${YEAR}/data/

III. Outputs:

  • "${LVG_COMPONENTS}/PreDataBase/data/${YEAR}/data/"

V. Detail procedures:

  • shell> 1.LoadLexiconFiles ${YEAR} ${YEAR}-1

    This script copies initial original files to dataOrg directory ($LVG_COMPONENTS/PreDataBase/data/${YEAR}/dataOrg/).

    StepsNotesSourceTarget
    1Copy inflection variables file $Lexicon/${YEAR}/tables/inflVars.data $PreDataBase/data/${YEAR}/dataOrg/inflVars.data
    2Copy & Modify acronyms file $Lexicon/${YEAR}/tables/LRABR $PreDataBase/data/${YEAR}/dataOrg/acronyms
    $PreDataBase/data/${YEAR}/dataOrg/acr_exp
    3Copy & modify proper file $Lexicon/${YEAR}/tables/LRPRP $PreDataBase/data/${YEAR}/dataOrg/proper
    4Copy nominalization file $Lexicon/${YEAR}/tables/LRNOM $PreDataBase/data/${YEAR}/dataOrg/LRNOM
    5Copy synonyms file $PreDataBase/${YEAR}-1/data/synonyms.data $PreDataBase/data/${YEAR}/dataOrg/synonyms.data
    6Copy derivation files None (derivation.data has its own script to generate after 2013) None

  • shell> 2.GenerateLexiconFiles ${YEAR}

    This script generates final lvg files to data directory from dataOrg directory.

    StepsNotesSourceTarget
    1Copy inflection variables file $PreDataBase/data/${YEAR}/dataOrg/inflVars.data $PreDataBase/data/${YEAR}/data/infl.data
    2Copy & Modify acronyms file $PreDataBase/data/${YEAR}/dataOrg/acronyms $PreDataBase/data/${YEAR}/data/acronym.data
    3Copy proper file $PreDataBase/data/${YEAR}/dataOrg/proper $PreDataBase/data/${YEAR}/data/properNoun.data
    4Copy nominalization file $PreDataBase/data/${YEAR}/dataOrg/LRNOM $PreDataBase/data/${YEAR}/data/nominalization.data
    5Copy synonyms file $PreDataBase/data/${YEAR}/dataOrg/synonyms.data $PreDataBase/data/${YEAR}/data/synonyms.data
    6Copy derivation files $Derivation/5.All/data/${YEAR}/data/derivation.data $PreDataBase/data/${YEAR}/data/derivation.data

  • shell> 3.MoveLexiconFiles ${YEAR}

    This script copies/moves final lvg files from data directory to ${LVG}/data/tables directory.

    StepsNotesSourceTarget
    1Copy infl.data file $PreDataBase/data/${YEAR}/data/infl.data ${LVG_DIR}/data/tables/infl.data
    2Copy acronym.data file $PreDataBase/data/${YEAR}/data/acronym.data ${LVG_DIR}/data/tables/acronym.data
    3Copy properNoun.data file $PreDataBase/data/${YEAR}/data/properNoun.data ${LVG_DIR}/data/tables/properNoun.data
    4Copy nominalization.data file $PreDataBase/data/${YEAR}/data/nominalization.data ${LVG_DIR}/data/tables/nominalization.data
    5Copy synonyms.data file $PreDataBase/data/${YEAR}/data/synonyms.data ${LVG_DIR}/data/tables/synonyms.data
    6Copy derivation.data files $PreDataBase/data/${YEAR}/data/derivation.data ${LVG_DIR}/data/tables/derivation.data

  • shell> 4.AnalyzeLvgFiles ${YEAR}

    Analyze files to find the max. length of each field. Then check with the database design on each fields of each tables

    StepsNotesSourceTable
    1AnalyzeInflection ${LVG_DIR}/data/tables/infl.data Inflection
    2AnalyzeAcronym ${LVG_DIR}/data/tables/acronym.data Acronym
    3AnalyzeProperNoun ${LVG_DIR}/data/tables/properNoun.data ProperNoun
    4AnalyzeNominalization ${LVG_DIR}/data/tables/nominalization.data Nominalization
    5AnalyzeSynonym ${LVG_DIR}/data/tables/synonyms.data LexSynonym
    6AnalyzeDerivation ${LVG_DIR}/data/tables/derivation.data Derivation

    • Check the max. field length, if exceed, change source code to fit
    • Also, recompile if change the source codes

  • Load data from Lexicon files to Lvg database
    Load these data into HSqlDb database
    • shell> cd ${LVG_DIR}/loadDb/bin
    • shell> LoadDb ${YEAR}
    • choose Db (HSqlDb)
      PS. make sure the property value "readonly=false" in ${LVG_DIR}/data/HSqlDb/lvg${YEAR}.properties
    • choose tables option 10) to load Lexicon tables (1 ~ 6)
    • Chnage back the property value "readonly=true" in ${LVG_DIR}/data/HSqlDb/lvg${YEAR}.properties

  • Generate canonical data
    Generate canonical data for luiNorm
    • Make sure reload above files into Db on the {LVG_DEV}
    • Make sure recompile (ant dist) on the ${LVG_DEV}

    • Make directories
      shell> cd ${CANON_GEN}/data/
      shell> mkdir ${YEAR}
      shell> cd ${YEAR}
      shell> mkdir dataOrg
      shell> mkdir data
      shell> mkdir output
      
    • Get atoms.data (from OCCS) and put into > $CANON_GEN/data/${YEAR}/dataOrg/atoms.data.mmddyy
    • ln -s atoms.data.mmddyy atoms.org
    • shell> cd $LVG_Components/CanonGenerator/bin
    • shell> 0.ModifyAtoms ${YEAR}

      StepsNotesSourceTarget
      1Get ENG entry from atoms.org file $CANON_GEN/data/${YEAR}/dataOrg/atoms.org $CANON_GEN/data/${YEAR}/dataOrg/atoms.org.ENG
      1Get SPA entry from atoms.org file $CANON_GEN/data/${YEAR}/dataOrg/atoms.org $CANON_GEN/data/${YEAR}/dataOrg/atoms.org.SPA
      1Generate atoms.data file $CANON_GEN/data/${YEAR}/dataOrg/atoms.org.ENG $CANON_GEN/data/${YEAR}/dataOrg/atoms.data

    • shell> cd ${CANON_GEN}/data/
    • shell> rm -rf HSqlDb
    • Make sure the "varchar(110)" is big enough in
      • "base varchar(110)" in CanonDbBaseForms.CreateBaseTable( );
      • "base varchar(110)" in CanonDbCanon.CreateCanonTable( );
      • "inflection varchar(110)" in CanonDbInflection.CreateInflectionTable( );

    • Update variable ${LVG_DIR} in ${LVG_DIR}/data/config/lvg.properties (can't be AUTO_MODE)
    • shell> 1.RunCanonAll ${YEAR}
      
      --------------------------------------
      Which Program ?
      --------------------------------------
      1) Generate terms list
      2) Generate words list
      3) Generate unique words list
      4) Generate base forms list
      5) Generate unique base forms list
      6) Generate canoncal forms
      7) Check non-ASCII canon
      8) All (default)
      9) Generate canoncal forms from test
      ----------
      
      

      StepsNotesSourceTarget
      1Get terms list
      • ${LVG_DIR}/data/tables/infl.data
      • $CANON_GEN/data/${YEAR}/dataOrg/atoms.data
      $CANON_GEN/data/${YEAR}/data/termList.data
      2Get words list $CANON_GEN/data/${YEAR}/data/termList.data $CANON_GEN/data/${YEAR}/data/wordList.data
      3Sort and unify words list $CANON_GEN/data/${YEAR}/data/wordList.data $CANON_GEN/data/${YEAR}/data/uniqueWordList.data
      4Get base forms of unique words list $CANON_GEN/data/${YEAR}/data/uniqueWordList.data $CANON_GEN/data/${YEAR}/data/baseList.data
      5
      • Combine bases (spelling variants) from infl.vars with baseList.data;
      • normalize non-ASCII characters;
      • sort and unify bases list
      $CANON_GEN/data/${YEAR}/data/baseList.data $CANON_GEN/data/${YEAR}/data/uniqueBaseList.data
      6Generate canonical forms $CANON_GEN/data/${YEAR}/data/uniqueBaseList.data $CANON_GEN/data/${YEAR}/data/canonical.data
      7Check/modify non-ASCII in Canonical forms $CANON_GEN/data/${YEAR}/data/canonical.data
      • $CANON_GEN/data/${YEAR}/data/notKnownUnicode.data
      • $CANON_GEN/data/${YEAR}/data/nonAscii.data
    • It takes about 1.0 hr to run above steps for 2010 using HSqlDb.2.0.0.0
    • It takes about 140 min. to run above steps for 2012 using HSqlDb.2.2.5

  • shell> 5.Generate2Files ${YEAR}

    Generate lvg files from lvg
    The lvg used is in the ${DEV_DIR}
    make sure variable ${LVG_DIR} uses the full path of lvg in the lvg config file (not AUTO_MODE), lvg.properties.
    shell> cd ${RPRE_DATABASE}/bin
    shell> 5.Generate2Files <year>

    StepsNotesSourceTableRun Time
    1Generate fruitful variants ${LVG_DIR}/data/tables/infl.data $PreDateBase/data/${YEAR}/data/fruitful.data 1 hr.
    2Generate AntiNorm ${LVG_DIR}/data/tables/infl.data $PreDateBase/data/${YEAR}/data/antiNorm.data 1 hr.
    3Copy canonical data $CanonGenerator/data/${YEAR}/data/canonical.data $PreDateBase/data/${YEAR}/data/canonical.data 2 hr.

    PS. GenerateAntiNorm requires recompile with new lvg${YEAR}dist.jar

  • shell> 6.Move2Files ${YEAR}

    This script copies/moves final lvg generated files from data directory to ${LVG_DIR}/data/tables directory.

    StepsNotesSourceTarget
    2Copy fruitful.data file $PreDataBase/data/${YEAR}/data/fruitful.data ${LVG_DIR}/data/tables/fruitful.data
    3Copy antiNorm.data file $PreDataBase/data/${YEAR}/data/antiNorm.data ${LVG_DIR}/data/tables/antiNorm.data
    4Copy canonical.data file $PreDataBase/data/${YEAR}/data/canonical.data ${LVG_DIR}/data/tables/canonical.data

  • shell> 7.Analyze2Files ${YEAR}

    Analyze files to find the max. length of each field. Then check with the database design on each fields of each tables

    StepsNotesSourceTable
    1AnalyzeFruitful ${LVG_DIR}/data/tables/fruitful.data Fruitful
    2AnalyzeAntiNorm ${LVG_DIR}/data/tables/antiNorm.data AntiNorm
    3AnalyzeCanon ${LVG_DIR}/data/tables/canonical.data Canonical

  • Load data from 2 files to Lvg database
    Load these data into HSqlDb database
    • shell> cd ${LVG_DIR}/loadDb/bin
    • shell> LoadDb ${YEAR}
    • choose Db (HSqlDb & MySql)
      PS. make sure the property value "readonly=false" in ${LVG_DIR}/data/HSqlDb/lvg${YEAR}.properties
    • choose tables option 11) to load 2 tables

    • After it is done, change "readonly=true" in ${LVG_DIR}/data/HSqlDb/lvg${YEAR}.properties