Validate and Fix LEXICON

  • Step 0-1:The input file: LEXICON
      shell> cp -p LEXICON ${LEXICON}/data/${YEAR}/data/LEXICON.mmddyy
    • Make a symbolic link in the development machine (lexlx1)
      shell> cd ${LEXICON}/data/${YEAR}/data
      shell> ln -sf ./LEXICON.mmddyy LEXICON.freeze

  • Step 0-2:Trim extra space
    • If extra space found, trim extra space in LexBuild, and go back to the previous step
      shell> fgrep "  " LEXICON.freeze | wc -l
      => should be 0, all extra space is taken care of in LexBuild automatically
      If not, need to have data in LexBuild fixed as well

  • Step 1: Remove annotations & signatures
    • shell> ${LEXICON}/bin/1.FinalizeLexicon <year>
      make sure the java version in the script is correct!
      • Input: LEXICON.freeze
      • Operations:
        • Remove annotations & signatures from freeze version to generate LEXICON.freeze.removeAnnotation
        • Check and fix incompliant non-ASCII characters between HTML and Unicode (U+0080 ~ U+009F), and sent the output to LEXICON.release:
      • Output:
        • LEXICON.freeze.removeAnnotation
        • LEXICON.freeze.removeAnnotation.nonAscii
        • LEXICON.freeze.removeAnnotation.nonAscii.Stat
        • LEXICON.release.1.NoAnnotationNoIllegalNonAscii
        • LEXICON.release.nonAscii
        • LEXICON.release.nonAscii.Stat

        • LEXICON.release (this is the file name used in the process after this step)
          => This file is the same as ./LEXICON.release.1.NoAnnotationNoIllegalNonAscii
        • Link LEXICON.release to LEXICON.release.log.1.noAnno for the next step
        • cp -p ./LEXICON.release.1.NoAnnotationNoIllegalNonAscii LEXICON.release.log.1.noAnno
        • ln -sf ./LEXICON.release.log.1.noAnno LEXICON.release

        No need to go through detail of output at this point.
  • Step 2-1:Validate EUI, syntax, content, cross-reference, and illegal non-ASCII characters:
    • Check errors for syntax, content, and cross-ref, etc.
    • Send the list to LexBuilders to fix lexRecords through LexBuild
    • Fix step-by-step and rerun the program using new Lexicon.release (link) until no erros found (except for exceptions).
    • This step take about 1 week to complete (between fixes in LB, LEXICON.release, and rerun the program)
    • shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2
      Go through the log.2 file to ensure the following steps
      1. Check EUI
        => Make sure the total number of EUIs is correct
        => Make sure the no EUI is E0000000

      2. Check Syntax
        LexCheck.ValidateSyntaxFromTextFile
        => Update the ${LEX_CHECK}/data/Files/preposition.data.${YEAR}
        shell> cd ${DEV_DIR}/LC/Proc/bin
        shell> GetPrePosition
        Use LexAccessLb to get the preposition from the latest Lexicon
        shell> ln -sf ./preposition.data.${YEAR} preposition.data
        shell> cp -rp ${LEX_CHECK}/data/Files/dupRecExceptions.data.${PRE_YEAR} dupRecExceptions.data.${YEAR}
        shell> cp -rp ${LEX_CHECK}/data/Files/irregExceptions.data.${PRE_YEAR} irregExceptions.data.${YEAR}

        => If errors found, fix LEXICON.release and rerun the script
        => Make sure "No error found"
        => The final fixed copy is saved as LEXICON.release.1.NoAnnotationNoIllegalNonAscii

      3. Check Contents,
        LexCheck.ValidateContentFromTextFile
        => update the ${LEX_CHECK}/data/Files/preposition.data.${YEAR}
        => update the ${LEX_CHECK}/data/Files/irregException.data.${YEAR}
        => If errors found, fix LEXICON.release and rerun the script
        => Make sure "Total error: 0"
        => The final fixed copy is saved as LEXICON.release.2.fixContent

        => Use ./LEXICON.release.2.fixContent for the next steps (if it is different from the input)
        ln -sf ./LEXICON.release.log.${No}.contentFix LEXICON.release

      4. Check Cross-Ref,
        LexCheck.LexCrossCheck
        => update the ${LEX_CHECK}/data/Files/preposition.data.${YEAR}
        => update the ${LEX_CHECK}/data/Files/dupRecException.data.${YEAR}
        => If errors found, fix LEXICON.release, then link, and rerun the script
        => Fix errors in the same order as the reports
        => Errors are shown as Content Err in the log.2 file.
        => Go to the end of log.2 file to see the final stats.
        => This step is very time comsumming. It take about 1-2 weeks if everything goes smooth!
        1. dup EUI: must fixed (manually)
        2. dup LexRecord: partially fixed manually and update dupRecException.data
          • Send ${OUT_FILE}.fixCrossCheck.dupRec to linguists to tag "N|C":
            • N: not duplicate and no change
              => add all "N" to dupRecExceptions.data.${YEAR}
              => Manually remove [N] tag
            • C: change (delete or merge duplicate records)
              => records with tag of "C" need to be corrected in LB and will be updated in the next release, no need to correct for this release.
            • Re-run the program with updated dupRecExceptions.data.${YEAR} until the following 2 number are the same:
              • the (number of) ${OUT_FILE}.dupRec (can be found in stats at the end of Step 3 section in log.2 file)
              • ${OUT_FILE}.dupRec.cTag (where only contains C tag)

              are the same. All the N tags are eliminated by dupRecExceptions.data.${YEAR} and C tags will be updated in the next release.

              YearDupRecNCNotes
              20141376968Only multiword (137/1184) are tagged due to limited resource and due date. The rest (abbreviations or acronyms) are updated in the next release.
              201511831042141Changes are updated in LB and fixed for next release
              201667625Changes are updated in LB and fixed for next release
              201769636Changes are updated in LB and fixed for next release
              201855487Changes are updated in LB and fixed for next release
        3. no EUI:
          • auto-fix for current release: by remove EUI
            => use LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
          • Manual fix for future release by linguists:
            shell>fgrep " no EUI (" log.2 > 2.4.03.noEui
            send 2.3.noEui to linguists for following actions:
          • Explanation:
            • These are citations in abbreviation, acronym, and nominalization that program can not find the associated EUI by cross-ref check. Ideally, citation (legit LMW) should have an associated lexRecord for it. There might be exceptions for abbreviation or acronym (but not for nominalization).
            • Ignore the notes from computer report at the end of each line ("=> remove EUI")
          • Actions:
            Check if the associated EUI or citation exist in Lexicon:
            • If the citation is a misspelled: correct the citation
              Make sure you correct it to the citation form, not the spVar.
            • If the citation has correct spelling, and it is a legit citation (LMW)
              => If the associated EUI is correct, add the citation as spVar to the record
              => If no associated EUI found, add a new record of this citation
            • If the citation has correct spelling, and it is not a legit citation (LMW)
              => Please let Chris know
            • => Add to notBaseForm.data.${YEAR} (this happen when the lexRecord is deleted due to not a LMW).
          • Synchronization:
            These issues are temparately auto-fixed by removing EUI for the current release. However, the data are pernament fixed in LexBuild and expect no same issues in the future releases.
          • Log:
            Yearno EUI No. notBaseForm No.
            2017224
            201842
        4. wrong citation (spVar):
          • auto-fix for current release by replacing correct citation
            => use LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
          • Manual fix for future release by linguists:
            shell>fgrep " wrong citation (spVar) (" log.2 > 2.4.wrongCitSpVar
            send 2.4.wrongCitSpVar to linguists for following actions:
            • These are citations in the abb, acr, nom are spVar (not cit), they are auto-fixed by the program
            • replace spVar by corrct citation
          • Synchronization:
            These issues are auto-fixed by replacing spVar by correct citation for the current release. However, the data are pernament fixed in LexBuild and expect no same issue in future releases.
          • Init Log:
            Yearwrong citation (spVar) No.
            201771
            20180
        5. wrong citation (spVar), duplictes:
          • auto-fix for current release by removing the spVar attribute
            => use LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
          • Manual fix for future release by linguists:
            shell>fgrep " wrong citation (spVar), duplicates (" log.2 > 2.5.wrongCitSpVarDup
            send 2.5.wrongCitSpVarDup to linguists for following actions:
            • These are citations in the nom are spVar (not cit), after replaced by the correct citation, they becomes duplicates and thus remove (auto-fixed) by the program
            • remove the nom with spVar
          • Synchronization:
            These issues are auto-fixed by removing nom attribute with spVar for the current release. However, the data are pernament fixed in LexBuild and expect no same issue in future releases.
          • Init Log:
            Yearwrong citation (spVar), duplictes No.
            201712
            20180

          Steps 3, 4, 5 are auto-fixed at the same time when run the validataion program. So, use the LEXICON.release.3.fixCrossCheck as LEXICON.release (link) and rerun
          shell> cp -p ./LEXICON.release.3.fixCrossCheck Lexicon.release.3.fixCrossCheck.2.5.cit
          shell> ln -sf ./LEXICON.release.log.${No}.crossCheckFix Lexicon.release

          rerun 2.ValidateLexicon ${YEAR} > log.2
          Please make sure check everything to make sure everything is OK because the auto-fix in different steps might cause new issuess. Such as add EUI and causes duplicates. Rerun this until no error found!

        6. missing EUI: auto-fix
          shell>fgrep "missing EUI (" log.2 > 2.6missingEui
          Sent to linguists to fix (add EUI in as suggested)


          => use LEXICON.release.3.fixCrossCheck and rerun
          shell> cp -r LEXICON.release.3.fixCrossCheck Lexicon.release.log.${no}.missEuiFix
          shell> ln -sf ./LEXICON.release.log.${no}.missEuiFix Lexicon.release
          Save LEXICON.release.3.fixCrossCheck as LEXICON.release.log.${No}.misEuiFix (link to Lexicon.release) and rerun this step

        7. wrong EUI: must fixed manually
          • wrong EUI:shell> fgrep "wrong EUI" log.2 > 2.4.7.wrongEui.nom
          • Sent list to linguists to:
            • Confirm the correct the EUI
            • Fix lexRecords in the LexBuild

            shell> cp -p LEXICON.release.3.fixCrossCheck Lexicon.release.log.${No}.wrongEuiFix
            shell> ln -sf ./LEXICON.release.log.${No}.wrongEuiFix Lexicon.release
          • Save LEXICON.release.3.fixCrossCheck as LEXICON.release.log.${No}.wrongEuiFix (link to Lexicon.release) and rerun this step
        8. missing EUIs: must fixed manually
        9. wrong EUIs: must fixed manually
        10. symmetric citation: must fixed manually
        11. symmetric catogory: must fixed manually
        12. symmetric none: must fixed manually
          • This feature checks the symmetric issue in nominalization
          • All nominalizations should be symmetric.
          • nom:shell> fgrep " symmetric none @ [" log.2 > 2.12.symNone
          • Sent list to linguists:
            • Fix lexRecords in the LexBuild:
              => if the normalization is correct, add nominalizations
              => if the normalization is not correct, delete nominalizations
              => if the fixes is more than adding or deleting nominalizations (complicate fix involves changes/add in other LexRecords), notify Chris and tell him the details of fixes.
          • Save LEXICON.release.3.fixCrossCheck as LEXICON.release.log.${No}.symNoneFix
          • Manually fix Lexicon.release.log.${No}.nonSymFix by synchronizing those fixed records in LB
          • Link Lexicon.release.log.${No}.nonSymFix to LEXICON.release
          • re-run the program until:
            • The number of log.2 for "12. symmetric none:" is 0
            • the input (LEXICON.release) and fixed output (LEXICON.release.3.fixCrossCheck) are the same

        13. new EUI:
          • shell> fgrep " new EUI (" log.2 > 2.13.newEui
            => the line count should be the same as error count in log.2
          • nom:shell> fgrep "nominalizations - new EUI (" log.2 > 2.13.newEui.nom
            • This file includes all issues with nominalization: new EUI and non-symmetrical (2.13)
            • Send to linguists to fix in LB and then fix manually by comparing to LB (similar step as in 2.13).

          • acr:shell> fgrep "acronyms - new EUI (" log.2 > 2.13.newEui.acr
          • abb:shell> fgrep "abbreviations - new EUI (" log.2 > 2.13.newEui.abb
            • These two files are used as LMW candidate list to add multiwords to Lexicon
            • The expansions of acr/abb are good candidates for LexMultiwords
            • Those not-base-form terms from previous releases are stored in ${LEX_CHECK}/data/Files/notBaseForm.data.
            • This file is used to exclude FP err-msg.
            • This file is updated between releases as described follows:
            • The updates must be completed in LexCheck pre-process before running the next release.
            • Ideally, all terms in these two files are:
              • valid LW (will be added to Lexicon by next release)
              • invalid LW (will be add to notBaseForm)

              So, all errors should be disappear once these post-procedures are done.

              Post-Procedures:

              • Go through purified program (TBD):
                • Filter out valid terms by the latest LEXICON (inflVars.data)
                  There is about 1 month gap between the freozen Lexicon and this step (not that many).
                  => Auto-tag: [C]:citation, [B]:base, [I]:inflection
                • Filter out by invalid LMW
                  => Auto-tag: [N]
                • Send the rest to Linguist to tag

              • send the list to linguist to tag Y|I|N:
              • [Y]: a valid citation or base form
                => A new lexRecord should be added
              • [I]: a valid inflectional form
                => A new lexRecord should be added
                => The associated lexRecord might need to change from inflectional form to citation form
              • [N]: Other than above two tags, not a valid Lexicon word form for citations, spelling variants, or a inflectional form (such as plural form, past tense, etc.)
                => This list is used to exlude exceptions for future releases (we are assuming a invalid base form won't become a valid base form over the time).
              • During this process, LexBuilder might need to delete invalid expansions, modify records, add new records. However, we don't need this detail infromation for the program.

                (This is the post-process that need to be done for current release, beofre the next release)

              • Inputs: save tagged files to:
                • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.abb.tagged.txt
                • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.acr.tagged.txt
              • Get expansion|POS with "N" tag
                • shell> cd ${DEV}/LC/Proc/bin
                • shell> AnalyzeNewEui ${YEAR}
                  1
                  2
              • Outputs:
                • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.*.n (N tag)
                • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.*.o (Other)
                  Should be 0, these might be wrong case or missed tag
                • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.*.i (I tag)
                • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.*.y (Y tag)
                • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.*.notBase (expansion|POS) - include I and N tags
                • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.*.notLmw (expansion|POS) - include N tag
              • Updates ${LEX_CHECK}/data/Files/notBaseForm.data:
                • append ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.*.notBase to ${LEX_CHECK}/data/Files/notBaseForm.data
                • Updates ${LEX_CHECK}/data/Files/notLmw.${YEAR}.data (for LMW project and Lexicon release) to re-run tis program.
                  • The [N] and [I] tags are exlcuded because of the updates of notBaseForm.data.
                    => Updated in the current release icne 2018+
                  • The [Y] tags are excluded by adding the new records in the Lexicon for the next release.
                    => Updated inthe next release
                    Send the ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.*.y to linguists as LMW candidats
        • Make sure "Total error: 0" or equals to sum of (2. dupLexRecord + 13.newEUI)
          • Must fix issues between (3-12) to minimize the number of issues in 13.newEUI.
          • Make sure "new 2" is 0
            shell> fgrep "- New 2" log.2 > 2.13.newEui2
            These are issues that from issues of step 3-12. This nmber should be 0 once issues in 3-12 are fixed.
        • Send 2. dup LexRecord should be fixed for future releases
        • Send 4. send new EUI to linguists to add for future releases
          • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.abb.y
          • ${LEXICON}/data/${YEAR}/data/Tags/2.13.newEuis.acr.y

      5. Check non-ASCII characters
        • Check if new appear non-ASCII char is legal
          • Compare to the previous year on all nonAscii.char
            • The program compares files of LEXICON.release.NonAscii.line and LEXICON.release.NonAscii.char to the previous release and sent the difference to LEXICON.release.NonAscii.Char.1.3.diff.
            • Go through all new non-ASCII characters in LEXICON.release.NonAscii.Char.1.3.diff in Lexicon.release, and manually check and modify if needed
          • Find illegal non-ASCII chars
            • Some non-ASCII Unicode characters looks the same as ASCII. However, they are different when read in by machine and cause issues downstream.
            • Go through LEXICON.release.NonAscii.char file to see if any illegal non-ASCII characters list in the following table exist (use U+value). If so, fix them.
            • For U+03BC and U+00B5: compare the count
            • Sent to linguists to fix in LB if illegal ASCII chars are found

            NameLetter 1Letter 2 (Illegal non-ASCII)Notes
            postrophe[']-(APOSTROPHE, U+0027)[‘]-(LEFT SINGLE QUOTATION MARK, U+2018)Replace illegal non-ASCII
            [’]-(RIGHT SINGLE QUOTATION MARK, U+2019)
            hyphen[-]-(HYPHEN-MINUS, U+002D)[‑]-(NON-BREAKING HYPHEN, U+2011)Replace illegal non-ASCII
            [–]-(EN DASH, U+2013)
            beta[β]-(GREEK SMALL LETTER BETA, U+03B2)[ß]-(LATIN SMALL LETTER SHARP S, U+00DF)Replace illegal non-ASCII
            mu/micro[μ]-(GREEK SMALL LETTER MU, U+03BC)[µ]-(MICRO SIGN, U+00B5)Both could be legal. Check the records to make sure the right chars are used.
            Y/EPSILON[Y]-(LATIN CAPITAL LETTER Y, U+0059)[Υ]-(GREEK CAPITAL LETTER UPSILON, U+03A5)Both could be legal. Check the records to make sure the right chars are used.

      => The final fixed copy is saved as LEXICON.release.log.2.5.nonAsciiFix
    • Re-run the program with new Lexicon.release (link) until everything is OK:
      shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2

  • Step 2-2:Check TradeMark:
    TradeMark
    • shell> ${LEXICON}/bin/2.ValidateLexicon <year> > log.2
    • Check the word count (wc) of output files (tradeMark.data)
      Should be 0 because there is no annotation.

  • Step 2-3:Check Irreg Base:
    • Skip - already checked in Check Content after 2014 release.
    • Old version of Check Irreg

  • Step 2-4:Check cross-Ref:
    • Skip - already checked in the Step of Cross-Ref after 2014 release.
    • Cross-Ref: A enhanced cross-reference check program is inplement after 2014 and thus it is removed from LexBuild. So, this step has to be checked. If issues found, fixed them in LexBuild and go back to Step 0.
    • Old version of Check cross-Ref

    Clean up files and logs: move all logs and files to ./${year}