Validate and Fix LEXICON

  • Step 0-1:The input file: LEXICON
      shell> cp -p LEXICON ${LEXICON}/data/${YEAR}/data/LEXICON.mmddyy
    • Make a symbolic link in the development machine (lexlx1)
      shell> cd ${LEXICON}/data/${YEAR}/data
      shell> ln -sf ./LEXICON.mmddyy LEXICON.freeze

  • Step 0-2:Trim extra space
    • If extra space found, trim extra space in LexBuild, and go back to the previous step
      shell> fgrep "  " LEXICON.freeze | wc -l
      => should be 0, all extra space is taken care of in LexBuild automatically
      If not, need to have data in LexBuild fixed as well

  • Step 1: Remove annotations & signatures
    • shell> ${LEXICON}/bin/1.FinalizeLexicon <year>
      make sure the java version in the script is correct!
      • Input: LEXICON.freeze
      • Operations:
        • Remove annotations & signatures from freeze version to generate LEXICON.freeze.removeAnnotation
        • Check and fix incompliant non-ASCII characters between HTML and Unicode (U+0080 ~ U+009F), and sent the output to LEXICON.release:
        • Correct the illegal non-Ascii characters in LEXICON.release
      • Output:
        • LEXICON.freeze.removeAnnotation
        • LEXICON.freeze.removeAnnotation.nonAscii
        • LEXICON.freeze.removeAnnotation.nonAscii.Stat
        • LEXICON.release.1.NoAnnotationNoIllegalNonAscii
        • LEXICON.release.nonAscii
        • LEXICON.release.nonAscii.Stat

        • LEXICON.release (this is the file name used in the process after this step)
          => This file is the same as ./LEXICON.release.1.NoAnnotationNoIllegalNonAscii
        • mv ./LEXICON.release LEXICON.release.log.1.noAnno
        • Link LEXICON.release to LEXICON.release.log.1.noAnno for the next step
        • ln -sf ./LEXICON.release.log.1.noAnno LEXICON.release

        No need to go through detail of output at this point.
  • Step 2-1:Validate EUI, syntax, content, cross-reference, and illegal non-ASCII characters:
    • Check errors for syntax, content, and cross-ref, etc.
    • Send the list to LexBuilders to fix lexRecords through LexBuild
    • Fix step-by-step and rerun the program using new Lexicon.release (link) until no erros found (except for exceptions).
    • This step take about 1 week to complete (between fixes in LB, LEXICON.release, and rerun the program)
    • shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2
      Go through the log.2 file to ensure the following steps
      1. Check EUI
        => Make sure the total number of EUIs is correct
        shel>fgrep "entry=" LEXICON.release |wc -l
        => Make sure the no EUI is E0000000

      2. Check Syntax
        LexCheck.ValidateSyntaxFromTextFile
        => Update the ${LEX_CHECK}/data/Files/preposition.data.${YEAR}
        shell> cd ${DEV_DIR}/LC/Proc/bin
        shell> GetPrePosition
        Use LexAccessLb to get the preposition from the latest Lexicon
        shell> ln -sf ./preposition.data.${YEAR} preposition.data
        shell> cp -rp ${LEX_CHECK}/data/Files/dupRecExceptions.data.${PRE_YEAR} dupRecExceptions.data.${YEAR}
        shell> cp -rp ${LEX_CHECK}/data/Files/irregExceptions.data.${PRE_YEAR} irregExceptions.data.${YEAR}

        => If errors found, fix LEXICON.release and rerun the script
        => Make sure "No error found"
        => The final fixed copy is saved as LEXICON.release.2.2.syntaxFix

      3. Check Contents,
        LexCheck.ValidateContentFromTextFile
        => update the ${LEX_CHECK}/data/Files/preposition.data.${YEAR}
        => update the ${LEX_CHECK}/data/Files/irregException.data.${YEAR}
        => If errors found, fix LEXICON.release and rerun the script
        => Make sure "Total error: 0"
        => The final fixed copy is saved as LEXICON.release.2.fixContent

        => Use ./LEXICON.release.2.fixContent for the next steps (if it is different from the input)
        ln -sf ./LEXICON.release.log.2.3.contentFix LEXICON.release

      4. Check Cross-Ref,
        LexCheck.LexCrossCheck
        => update the ${LEX_CHECK}/data/Files/preposition.data.${YEAR}
        => update the ${LEX_CHECK}/data/Files/dupRecException.data.${YEAR}
        => If errors found, fix LEXICON.release, then link, and rerun the script
        => Fix errors in the same order as the reports
        => Errors are shown as Content Err in the log.2 file.
        => Go to the end of log.2 file to see the final stats.
        => This step is very time comsumming. It take about 1-2 weeks if everything goes smooth!
        1. dup EUI: must fixed (manually)
        2. dup LexRecord: partially fixed manually and update dupRecException.data
          • Send ${OUT_FILE}.fixCrossCheck.dupRec to linguists to tag "N|C":
            • N: not duplicate and no change
              => add all "N" to dupRecExceptions.data.${YEAR}
              => Manually remove [N] tag
            • C: change (delete or merge duplicate records)
              => records with tag of "C" need to be corrected in LB and will be updated in the next release, no need to correct for this release.
            • Re-run the program with updated dupRecExceptions.data.${YEAR} until the following 2 number are the same:
              • the (number of) ${OUT_FILE}.dupRec (can be found in stats at the end of Step 3 section in log.2 file)
              • ${OUT_FILE}.dupRec.cTag (where only contains C tag)

              are the same. All the N tags (exception, not duplicated) are eliminated by dupRecExceptions.data.${YEAR} and C tags will be updated in the next release.

              YearDupRecNCNotes
              20141376968Only multiword (137/1184) are tagged due to limited resource and due date. The rest (abbreviations or acronyms) are updated in the next release.
              201511831042141Changes are updated in LB and fixed for next release
              201667625Changes are updated in LB and fixed for next release
              201769636Changes are updated in LB and fixed for next release
              201855487Changes are updated in LB and fixed for next release
              20191165Changes are updated in LB and fixed for next release
        3. no EUI:
          This can be fixed at wrong citation (spVar) and wrong citation (spVar):, duplicated.
          • auto-fix for current release: by removing EUI (for those EUI does not exist)
            => use LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
          • Manual fix for future release by linguists:
            shell>fgrep " no EUI (" log.2 > 2.4.03.noEui
            send 2.3.noEui to linguists for following actions:
          • Explanation:
            • These are cross-re terms used for abbreviation, acronym, and nominalization that program can not find the associated citation and EUI by cross-ref check.
            • Cross-ref terms must be citations (not spVars).
            • Ideally, citations (legit LMWs) should have an associated lexRecord for it.
            • Cross-ref terms can be invalid LMWs (not happen oftne) as for the expansion of abbreviations or acronyms
            • Cross-ref terms must be valid LMWs for nominalization.
            • Ignore the suggestions from computer report at the end of each line ("=> remove EUI")
          • Actions:
            Check the cross-ref terms
            Check if the associated EUI or citation exist in Lexicon:
            • If it is misspelled: correct it
              Make sure you correct it to the citation form, not the spVar.
            • If it is correctly spelled,
              =>If it has no associated record/EUI found, add a new record of this citation
              =>If it has an associated record, but not in the record, add to the record as spVar. Also, correct the cross-ref term to the citation.
              =>If it has an associated record as spVar. correct the cross-ref term to the citation.
            • If it is correctly spelled, and it is not a legit citation (LMW)
              => Please let Chris know
            • => Add to notBaseForm.data.${YEAR} (this happen, but not often).
          • Synchronization:
            These issues are temparately auto-fixed by removing EUI for the current release. However, the data are pernament fixed in LexBuild and expect no same issues in the future releases.
          • Log:
            Yearno EUI No. notBaseForm No.
            2017224
            201842
            2019630
        4. wrong citation (spVar):
          • auto-fix for current release by replacing correct citation
            => use LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
          • Manual fix for future release by linguists:
            shell>fgrep " wrong citation (spVar) (" log.2 |fgrep -v " wrong citation (spVar), duplicates (" > 2.4.04.wrongCitSpVar
            send 2.4.04wrongCitSpVar to linguists for following actions:
            • These are citations in the abb, acr, nom are spVar (not cit), they are auto-fixed by the program
            • replace spVar by corrct citation
          • Synchronization:
            These issues are auto-fixed by replacing spVar by correct citation for the current release. However, the data are pernament fixed in LexBuild and expect no same issue in future releases.
          • Init Log:
            Yearwrong citation (spVar) No.
            201771
            20180
            201959
        5. wrong citation (spVar), duplictes:
          • auto-fix for current release by removing the spVar attribute
            => use LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
          • Manual fix for future release by linguists:
            shell>fgrep " wrong citation (spVar), duplicates (" log.2 > 2.5.wrongCitSpVarDup
            send 2.4.05.wrongCitSpVarDup to linguists for following actions:
            • These are citations in the nom are spVar (not cit), after replaced by the correct citation, they becomes duplicates and thus remove (auto-fixed) by the program
            • remove the nom with spVar
          • Synchronization:
            These issues are auto-fixed by removing nom attribute with spVar for the current release. However, the data are pernament fixed in LexBuild and expect no same issue in future releases.
          • Init Log:
            Yearwrong citation (spVar), duplictes No.
            201712
            20180
            20192

          Steps 3, 4, 5 are auto-fixed at the same time when run the validataion program. So, use the LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
          shell> cp -p ./LEXICON.release.3.fixCrossCheck Lexicon.release.3.fixCrossCheck.2.5.cit
          shell> ln -sf ./LEXICON.release.log.${No}.fixCrossRed Lexicon.release

          rerun 2.ValidateLexicon ${YEAR} > log.2
          Please make sure check everything to make sure everything is OK because the auto-fix in different steps might cause new issuess. Such as add EUI and causes duplicates. Rerun this until no error found!

        6. missing EUI: auto-fix
          shell>fgrep "missing EUI (" log.2 > 2.6.missingEui
          Sent to linguists to fix (add EUI in as suggested)


          => use LEXICON.release.3.fixCrossCheck and rerun
          shell> cp -r LEXICON.release.3.fixCrossCheck Lexicon.release.log.${no}.missEuiFix
          shell> ln -sf ./LEXICON.release.log.${no}.missEuiFix Lexicon.release
          Save LEXICON.release.3.fixCrossCheck as LEXICON.release.log.${No}.misEuiFix (link to Lexicon.release) and rerun this step

        7. wrong EUI: must fixed manually
          • wrong EUI:shell> fgrep "wrong EUI" log.2 > 2.4.7.wrongEui.nom
          • Sent list to linguists to:
            • Confirm the correct the EUI
            • Fix lexRecords in the LexBuild

            shell> cp -p LEXICON.release.3.fixCrossCheck Lexicon.release.log.${No}.wrongEuiFix
            shell> ln -sf ./LEXICON.release.log.${No}.wrongEuiFix Lexicon.release
          • Save LEXICON.release.3.fixCrossCheck as LEXICON.release.log.${No}.wrongEuiFix (link to Lexicon.release) and rerun this step
        8. missing EUIs: must fixed manually
        9. wrong EUIs: must fixed manually
        10. symmetric citation: must fixed manually
        11. symmetric catogory: must fixed manually
        12. symmetric none: must fixed manually
          • This feature checks the symmetric issue in nominalization
          • All nominalizations should be symmetric. That is nominalization and nominalization_of.
          • nom:shell> fgrep " symmetric none @ [" log.2 > 2.12.symNone
          • Sent list to linguists:
            • Fix lexRecords in the LexBuild:
              => if the normalization is correct, add nominalizations
              => if the normalization is not correct, delete nominalizations
              => if the fixes is more than adding or deleting nominalizations (complicate fix involves changes/add in other LexRecords), notify Chris and tell him the details of fixes.
          • Save LEXICON.release.3.fixCrossCheck as LEXICON.release.log.${No}.symNoneFix
          • Manually fix Lexicon.release.log.${No}.nonSymFix by synchronizing those fixed records in LB
          • Link Lexicon.release.log.${No}.nonSymFix to LEXICON.release
          • re-run the program until:
            • The number of log.2 for "12. symmetric none:" is 0
            • the input (LEXICON.release) and fixed output (LEXICON.release.3.fixCrossCheck) are the same

        13. new EUI:
          • shell> fgrep " new EUI (" log.2 > 2.4.13.fixCrossRef-newEui
            => the line count should be the same as error count in log.2
          • nom:shell> fgrep "nominalizations - new EUI (" log.2 > 2.13.newEui.nom
            • This file includes all issues with nominalization: new EUI and non-symmetrical (2.13)
            • Send to linguists to fix in LB and then fix manually by comparing to LB (similar step as in 2.13).

          • acr:shell> fgrep "acronyms - new EUI (" log.2 > 2.13.newEui.acr
          • abb:shell> fgrep "abbreviations - new EUI (" log.2 > 2.13.newEui.abb
            • These two files are used as LMW candidate list to add multiwords to Lexicon
            • The expansions of acr/abb are good candidates for LexMultiwords
            • Those not-base-form terms from previous releases are stored in ${LEX_CHECK}/data/Files/notBaseForm.data.
            • This file is used to exclude FP err-msg.
            • This file is updated between releases as described follows:
            • The updates must be completed in LexCheck pre-process before running the next release.
            • Ideally, all terms in these two files are:
              • valid LW (will be added to Lexicon by next release)
              • invalid LW (will be add to notBaseForm)

              So, all errors should be disappear once these post-procedures are done.

              Post-Procedures:

              • send the list to linguist to tag Y|I|N:
              • [Y]: a valid citation or base form
                => A new lexRecord should be added
              • [I]: a valid inflectional form
                => A new lexRecord should be added
                => The associated lexRecord might need to change from inflectional form to citation form
              • [N]: Other than above two tags, not a valid Lexicon word form for citations, spelling variants, or an inflectional form (such as plural form, past tense, etc.)
                => This list is used to exclude exceptions for future releases (we are assuming an invalid base form won't become a valid base form over the time).
              • During this process, LexBuilder might need to delete invalid expansions, modify records, add new records. However, we don't need this detail infromation for the program.

                (This is the post-process that need to be done for current release, before the next release)

              • Inputs: save tagged files to:
                • ${LEXICON}/data/${YEAR}/data/Tags/2.Validation/2.4.13.newEuis.abb.tagged.txt
                • ${LEXICON}/data/${YEAR}/data/Tags/2.Validation/2.4.13.newEuis.acr.tagged.txt
              • Get expansion|POS with "N" tag
                • shell> cd ${DEV}/LC/Proc/bin
                • shell> AnalyzeNewEui ${YEAR}
                  1
                  2
              • Outputs:
                • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.*.n (N tag)
                • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.*.o (Other) => should not exist, resent to Linguist for tagging
                  Should be 0, these might be wrong case or missed tag
                • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.*.i (I tag)
                • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.*.y (Y tag)
                • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.*.notBase (expansion|POS) - include I and N tags
                • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.*.notLmw (expansion|POS) - include N tag
              • Updates ${LEX_CHECK}/data/Files/notBaseForm.data:
                • append ${LEXICON}/data/${YEAR}/data/Tags/2.Validation/2.13.newEuis.*.notBase to ${LEX_CHECK}/data/Files/notBaseForm.data
                • Updates ${LEX_CHECK}/data/Files/notLmw.${YEAR}.data by appending ${LEXICON}/data/${YEAR}/data/Tags/2.Validation/2.13.newEuis.*.notLmw (for LMW project and Lexicon release) to re-run this program.
                  • The [N] and [I] tags are exlcuded because of the updates of notBaseForm.data.
                    => Updated in the current release since 2018+
                  • The [Y] tags are excluded by adding the new records in the Lexicon for the next release.
                    => Updated in the next release
                    Copy ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.*.y to 2.13.newEuis.*.y.${YEAR}
                    Send the ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.*.y.${YEAR} to linguists as LMW candidats

                • Go through purified program (TBD):
                  • Filter out valid terms by the latest LEXICON (inflVars.data)
                    There is about 1 month gap between the freozen Lexicon and this step (not that many).
                    => Auto-tag: [C]:citation, [B]:base, [I]:inflection
                  • Filter out by invalid LMW
                    => Auto-tag: [N]
                  • Send the rest to Linguist to tag
        • Make sure "Total error: 0" or equals to sum of (2. dupLexRecord + 13.newEUI)
          • Must fix issues between (3-12) to minimize the number of issues in 13.newEUI.
          • 13.newEUI: No. of errors = 2.13.newEui.abb.tagged.txt.y + 2.13.newEui.acr.tagged.txt.y
          • Make sure "new 2" is 0
            shell> fgrep " New 2" log.2 > 2.13.newEui2
            These are issues that from issues of step 3-12. This number should be 0 once issues in 3-12 are fixed.
        • Send 2.4.2. dup LexRecord should be deleted/fixed for future releases
        • Send 2.4.13. send new EUI to linguists to add for future releases
          • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.abb.y.${YEAR}
          • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.acr.y.${YEAR}

          Ideally, LEXICON.release should be identical to LEXICON.release.3.fixCrossCheck

      5. Check non-ASCII characters
        • Check if new appear non-ASCII char is legal
          • Compare to the previous year on all nonAscii.char
            • The program compares files of LEXICON.release.NonAscii.line and LEXICON.release.NonAscii.char to the previous release and sent the difference to LEXICON.release.NonAscii.Char.1.3.diff.
            • Go through all new non-ASCII characters in LEXICON.release.NonAscii.Char.1.3.diff in Lexicon.release, and manually check and modify if needed
          • Find illegal non-ASCII chars
            • Some non-ASCII Unicode characters looks the same as ASCII. However, they are different when read in by machine and cause issues downstream.
            • Go through LEXICON.release.NonAscii.char file to see if any illegal non-ASCII characters list in the following table exist (use U+value). If so, fix them.
            • For U+03BC and U+00B5: compare the count
            • Sent to linguists to fix in LB if illegal ASCII chars are found

            NameLetter 1Letter 2 (Illegal non-ASCII)Notes
            postrophe[']-(APOSTROPHE, U+0027)[‘]-(LEFT SINGLE QUOTATION MARK, U+2018)Replace illegal non-ASCII
            [’]-(RIGHT SINGLE QUOTATION MARK, U+2019)
            hyphen[-]-(HYPHEN-MINUS, U+002D)[‑]-(NON-BREAKING HYPHEN, U+2011)Replace illegal non-ASCII
            [–]-(EN DASH, U+2013)
            beta[β]-(GREEK SMALL LETTER BETA, U+03B2)[ß]-(LATIN SMALL LETTER SHARP S, U+00DF)Replace illegal non-ASCII
            mu/micro[μ]-(GREEK SMALL LETTER MU, U+03BC)[µ]-(MICRO SIGN, U+00B5)Both could be legal. Check the records to make sure the right chars are used.
            Y/EPSILON[Y]-(LATIN CAPITAL LETTER Y, U+0059)[Υ]-(GREEK CAPITAL LETTER UPSILON, U+03A5)Both could be legal. Check the records to make sure the right chars are used.

      => The final fixed copy is saved as LEXICON.release.log.2.5.nonAsciiFix
    • Re-run the program with new Lexicon.release (link) until everything is OK:
      shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2

  • Step 2-2:Check TradeMark:
    TradeMark
    • shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2
    • Check the word count (wc) of output files (tradeMark.data)
      Should be 0 because there is no annotation.

  • Step 2-3:Check Irreg Base:
    • Skip - already checked in Check Content after 2014+ release.
    • Old version of Check Irreg

  • Step 2-4:Check cross-Ref:
    • Skip - already checked in the Step of Cross-Ref after 2014+ release.
    • Cross-Ref: A enhanced cross-reference check program was implemented after 2014 and thus it is removed from LexBuild (web tool). So, this step has to be checked. If issues found, fixed them in LexBuild and go back to Step 0.
    • Old version of Check cross-Ref

    Clean up files and logs: move all logs and files to ./${year}