SD-RULEs

I. Introduction
There are several hundreds of derivational suffix rules. Lexical Tools use the following procedures to derive SD-Rules and SSd pairs from the LEXICON.
SD-Rules are generated by:

  • linguists manually (original Rules from Lexical Tools)
  • automatically derived from suffixD pairs in Fact from Lexical Tools, nomD and org.

both of rules from above should be validated for precision (exception) and frequency as following processes:
  • Retrieve all possible suffixD pairs for SD-Rules from Lexicon base

  • Derive possible SD-Rules from known suffix dPairs:
    There are two major sources for suffix dPairs:
    • nomD: almost all nomD pairs are suffixD (few of them are zeroD)
    • orgD: the original FACTS in Lexical Tools

    Possible SD-Rules can be identified by stripping the same starting characters of these known suffixD pairs. For example:
    location|noun|locate|verb|S|None
    infusion|noun|infuse|verb|S|None
    ...
    => Derive SD-Rule: ion$|noun|e$|verb|1694

  • Analyze derived SD-Rules:
    A good SD-rule should have:
    • low exceptions (high precision)
    • high frequency

    So it can be used in LVG for automatic Rules-generated derivations.
    We apply above two principles to verify all derived SD-Rules to refine derived SD-Rules.
    • Retrieve all possible suffixD pairs from Lexicon by apply SD-Rules
    • Verify exception (precision) and frequency

  • Decompose derived possible SD-Rules:
    The above identified possible SD-Rules must be further analyzed. If the rules has too many exceptions, it should be decomposed by adding linguistic knowledge to form more finely-grained SD-Rules. The above example can be further decompose into more detailed rules as:
    ation$|noun|ate$|verb|1547
    sion$|noun|se$|verb|77
    ution$|noun|ute$|verb|37
    etion$|noun|ete$|verb|22
    otion$|noun|ote$|verb|6
    ition$|noun|ite$|verb|4
    cion$|noun|ce$|verb|1
    --------------------1694
    

  • Compile all SD-Rules
    All rules (from FACTs and original rules) are then sorted and combined together for validation.

  • Validate SD-Rules
    All possible SD pairs are retrieved from LEXICON base on the combined rules.
    • Add tag (yes) is they are known SD pairs from facts.
    • Send the rest to linguist for tagging.
    • Apply above frequency and exceptions rules to finalize SD-Rules.
      • The precision (exception) and frequency must meet the requirements.
      • If the SD-Rules is valid:
        • yes: go to SD-FACTs
        • no: add to exception list

II. Procedures TBD...

  • Prepare input files
    • ${DERIVATIONS}/Nominalizations/data/${YEAR}/dataOrg/LRNOM
      The latest nominalization file (LRNOM) from lexicon.${YEAR}
    • ${DERIVATIONS}/Nominalizations/data/${YEAR}/dataOrg/prepositions.data
      The latest prepositions from lexicon.${YEAR}. This file is used in the latest LexCheck package.
    • ${DERIVATIONS}/Nominalizations/data/${YEAR}/dataOrg/nomD.tagNo.txt
      A file lists all invalid derivations from nominalizations which need to be fixed in LEXICON. These list are not in the pattern of noun + particle|verb
  • Run the program
    shell> cd ${DERIVATIONS}/Nominalizations/bin
    shell> GetNomD ${YEAR}
    3
    
    The following iterative steps are need:
    • update nomD.tagNo.txt

III. Programs Details (GetSdRules) TBD...

  1. Generate derivations from nominalizations
    • Descriptions:
      Retrieve all possible derivation pairs from nominalization and change to derivation format
    • Input files:
      • /dataOrg/LRNOM: nominalization file
    • Output files:
      • ./data/nomD.raw.data: raw data of possible nominalization derivation pairs
        Base 1Cat 1EUI 1Base 2Cat 2EUI 2
    • Associated Java files:
      • GetNomDFromNomFile.java

  2. Get nominalization derivations meta file (nomD.raw.data), and then split into two files of nomD.yes.data and nomD.no.data :
    • Descriptions:
      go through all pairs in "nomD.raw.data" and add tag information to "nomD.meta.data" using following algorithm:
      • yes: all valid derivations from "nomD.raw.data"
      • no: all invalid derivations from "nomD.raw.data"
        • Pattern Filter: if it is invalid pattern (both directions)
          The most common way to nominalize a verb is by adding an affix. However, not every nominalization occurs that way. Thus, not every nominalization will be a derivation. For example, verb particles are not affixes. Four patterns of nominalization with verb particles are identified as invalid derivations. Derivation pairs are filtered out if they fall into these four patterns.
          • baseParticle|noun|eui 1|base|verb|eui 2
            Examples:
            backup|noun|E0321419|back|verb|E0011649|no
            cleanup|noun|E0319808|clean|verb|E0017272|no
            closeout|noun|E0587816|close|verb|E001744|no
            lineup|noun|E0521627|line|verb|E0037599|no
            lookup|noun|E0222422|look|verb|E003804|no
            setup|noun|E0320336|set|verb|E0055458|no
            takeover|noun|E0059818|take|verb|E0059816|no
            washout|noun|E0065084|wash|verb|E0065081|no
            ...

          • base-Particle|noun|eui 1|base|verb|eui 2
            Examples:
            cut-through|noun|E0588311|cut|verb|E0020215|no
            face-off|noun|E0588571|face|verb|E0027103|no
            fade-out|noun|E0587854|fade|verb|E0027177|no
            pull-up|noun|E0576246|pull|verb|E0051064|no
            phase-in|noun|E0588069|phase|verb|E0047185|no
            set-aside|noun|E0587818|set|verb|E0055458|no
            shake-up|noun|E0575525|shake|verb|E0055539|no
            warm-up|noun|E0586553|warm|verb|E0065055|no
            write-off|noun|E0587702|write|verb|E0065685|no
            ...

          • inflParticle|noun|eui 1|base|verb|eui 2
            Examples:
            grownup|noun|E0030484|grow|verb|E0030480|no
            ...

          • infl-Particle|noun|eui 1|base|verb|eui 2
            Examples:
            grown-up|noun|E0030484|grow|verb|E0030480|no
            salting-in|noun|E0587997|salt|verb|E0054234|no
            ...

            Please also note that above four patterns should not apply when:

            • preposition is "per" and
            • noun ends with "pper".

            The following examples are valid derivations:
            chopper|noun|E0343361|chop|verb|E0016729|yes
            ripper|noun|E0360460|rip|noun|E0053656|yes
            shipper|noun|E0360483|ship|noun|E0055655|yes
            shopper|noun|E0354647|shop|verb|E0055686|yes
            snapper|noun|E0346235|snap|verb|E0056428|yes
            worshipper|noun|E0554172|worship|verb|E0065637|yes

          Please note that the following example is a valid derivation because it does not belong to above pattern:
          run-on|noun|E0338312|run on|verb|EUI 2|yes

        • Invalid Derivations: if it is in invalid nomD list (nomD.tagNo.txt)
          Derivational are bi-directional. For examples, if A is a derivation of B, then B is a derivation of A. On the other hand, if A is not a derivation of B, then B is not a derivation of A. File, nomD.tagNo.txt, lists all known invalid derivation pairs with only one direction. Thus, we also need to check the reversed direction of these invalid derivations. These pairs should be filtered out.
          • this list does not include the pattern-filter described above:
          • list all invalid derivations (known by linguists)
          • 22 exceptions found in lvg.2012
          • Examples:
            face-saving|noun|E0027112|save|verb|E0054430|no
            decision-making|noun|E0021045|make|verb|E0038623|no
            merry-making|noun|E0039645|make|verb|E0038623|no
            lovemaking|noun|E0502721|make|verb|E0038623|no
            warm|noun|E0065054|warmed-up|adj|E0588482|no
            instability|noun|E0034830|unstable|adj|E0063378|no
            irradiation|noun|E0035884|nonirradiated|adj|E0042869|no
            ...
    • Input files:
      • ./data/nomD.raw.data

      • ./dataOrg/nomD.tagNo.txt: nomD with "no" tag, invalid nomD.
        Base 1Cat 1EUI 1Base 2Cat 2EUI 2
      • ./dataOrg/prepositions.data: prepositions
        • This file lists all particles (prepositions) found in LEXICON
        • 198 prepositions found in lvg.2012
        • This file is generated by LEXICON program and should be updated annually (used in LexCheck)
    • Output files:
      • ./data/nomD.meta.data: meta data with "yes", "no", "tbd" tags.
        Base 1Cat 1EUI 1Base 2Cat 2EUI 2tag
      • ./data/nomD.yes.data: valid nomD pairs
        Base 1Cat 1EUI 1Base 2Cat 2EUI 2
      • ./data/nomD.no.data: filtered out invalid nomD pairs
    • Associate java files:
      • GetNomDMetaFile.java