Derivations - Final Integration

Derivation pairs generated from nominalizations, prefixes, suffixes, and conversions (zero derivations) are used as fact for the derivational table in Lexical Tools by the following two steps:

I. Affix Filter
All derivation pairs are generated by adding/removing either prefix, suffix or conversion. By definition, derivations can only apply one affix (prefix or suffix). Two step derivations are not allowed in Lexical Tools derivation generation, instead, they can be found by recursive derivation generation.

  • Descriptions:
    Compare the beginning or the ending characters of derivation pairs and filter out possible invalid derivation pairs.

  • To run:
    shell> cd ${DERIVATIONS}/All/bin
    shell> GetAllD ${YEAR}
    5

  • Input files:
    • derivation.data: derivation pairs to be checked
    • derivation.tagYes.txt: derivation pairs with "yes" tag and they have different beginning and ending characters
    • LRSPL: latest spelling variants file from LEXICON

  • Output files:
    • derivation.pattern${NoChar}.rpt: possible invalid derivation pair with different beginning and ending characters

  • Associated Java files:
    • CheckDerivationByAffix.java

  • Algorithm:
    • If term_1 is a substring of term_2 => return true (prefix or suffix)
    • If term_2 is a substring of term_1 => return true (prefix or suffix)

    • If beginning and ending N (3) characters are the same => return true if one of the following is true
      • Compare the beginning N characters
      • Compare the ending N characters

    • If it is a valid derivation pair on both directions => return true
      There are valid derivation pairs with different beginning and ending N (3) characters (derivation.tagYes.txt). There are 20 valid derivation pairs are in this exception list in lvg.2012. For examples,
      able|adj|E0006510|ability|noun|E0006490
      clear|adj|E0017285|clarity|noun|E0017210
      depth|noun|E0021875|deep|adj|E0021131
      die|verb|E0022536|death|noun|E0020918
      fly|verb|E0028337|flight|noun|E0028154
      give|verb|E0029785|gift|noun|E0029737
      high|adj|E0031612|height|noun|E0031035
      icy|adj|E0033252|iciness|noun|E0033232
      long|adj|E0038005|length|noun|E003721
      proud|adj|E0050636|pride|noun|E004997
      sale|noun|E0054164|sell|verb|E0055117
      think|verb|E0060653|thought|noun|E0060732
      use|verb|E0063738|usage|noun|E0063736

    • If spelling variants for the beginning and ending N (3) characters are the same = return true
      There are valid derivation pairs that have different starting/ending characters because of spelling variants. For example:
      aetiological|adj|E0356079|etiology|noun|E0007648
      caesarian|adj|E0016115|cesarean|noun|E0016116
      cozy|adj|E0019509|cosiness|noun|E0019263
      co-chromatograph|verb|E0019511|cochromatography|noun|E0017647
      coexistence|noun|E0017703|co-existent|adj|E0019518
      coinduction|noun|E0581568|co-induce|verb|E0581567
      co-ordinate|verb|E0018943|coordination|noun|E0018947
      deendothelization|noun|E0568700|de-endothelize|verb|E0568699
      dysmaturity|noun|E0438962|dismature|adj|E0333934
      eukaryoticity|noun|E0605921|eucaryotic|adj|E0026345
      haemagglutination|noun|E0030669|hemagglutinate|verb|E0031079
      hijacker|noun|E0588965|high-jack|verb|E0031665
      re-enforcement|noun|E0590229|reenforce|verb|E0539131

    • Else, it is a possible invalid derivation pair => send to output (derivaton.pattern${N}.rpt)
      Results from above step (derivation.rpt.patternN) is then reviewed by experts:
      • Manually validate possible invalid derivations by linguists
      • if valid derivation: add to ../dataOrg/derivation.tagYes.txt
      • if invalid derivation: add to ../${TYPE}/data/${YEAR}/dataOrg/${TYPE}.tagNo.txt.data
        => ${NOMINALIZATIONS}/data/${YEAR}/dataOrg/nomD.tagNo.txt
      • Then, we rerun the whole processes until nothing shown in derivation.patternN.rpt

    II. Final Integration

    The final derivation table for Lexical Tools can be generated by following steps:

    1. copy original derivation file: orgD.data
    2. Reformat nomD.yes.data.${YEAR} to nomD.data
    3. Reformat prefixD.yes.data.${YEAR} to prefixD.data
    4. Reformat zeroD.yes.data.${YEAR} to zeroD.data

    5. combine above four files to allD.data
    6. sort, unify and send to derivation.data

    • To run:
      shell> cd ${DERIVATIONS}/All/bin
      shell> GetAllD ${YEAR}
      6