Testing Data: UMLS-Core

This list of terms was used in UMLS-Core projects. It is used as gold standard in this project for testing.

I. General Information

  • UMLS-Core: SCTMap_withCUI_201302 (provided by Dr. K.W. Fung)
  • In MS Excel Format:
    Term IdLocal TermSNOMED CIDSNOMED FSNUMLS CUI
  • Contains 15,487 terms with valid mapped CUI (used as gold standard)
  • Contains 13,077 unique terms
  • 1,492 terms are duplicated with different ID (sources)
  • 35 terms have multiple CUIs (ambiguous)

II. Data Process

  • Convert from Excel to CVS format
  • Convert from CVS to pipe separate format (gov.nih.nlm.nls.stmt.Lib.FromCsvToPipeFile)
  • Filter out duplicated terms to unify term|CUI

  • For testing input: Retrieve fields 2
    Local Term

  • For gold standard: Retrieve fields 2,5
    Local TermUMLS CUI

III. Source of UMLS-Core data

  • Problem list terminologies (local terms) from 6 (8) institutions
    • HA: Hong Kong Hospital Authority
    • IH: Intermountain Healthcare
    • KP: Kaiser Permanente
    • MA: Mayo Clinic
    • NU: University of Nebraska Medical Center
    • RI: Regenstrief Institute
  • A problem list is a complete list of all patient's problem
  • The data in the original paper:
    • 76,237 terms and their usage frequenies in 14 million patients were submitted from six institutions
    • 65,678 terms unique across instutions
    • mapping from the local problem list terms to standard terminologies (ICD-9-CM, SNOMED CT) if available
    • 14,395 terms covered 95% of usage in each institution (10,081 terms unique across institutions)
    • 13,26 terms were successfully mapped to 6,776 UMLS concepts
      • UMLS mapping - 2008AA: 10,812 (75%)
        • exact match - case-insensitive: 8,102 (56%)
        • normalized match: 2,035 (14%)
        • synonym substitution: 576 (5%)

      • local maps to standard terminilogies: 1,007 (7%)
        • automatically map - if labeled as exact match
        • manully reviewed for exact match - if not labeled as exact match

      • manual mapping use RRF browser: 1,442 (10%)

      • unmapped: 1,134 (8%)
        • Highly specific: 53%
        • Very general: 11%
        • Administrative: 7%
        • Laterality: 7%
        • Negative finding: 3%
        • Composit comcept: 3%
        • Meaning unclear (ambiguous): 2%
        • Miscellaneous: 13%
  • References: UMLS-Core Project