Journal Descriptor (JD) Indexing

Home
Background:

As part of the research underlying the Indexing Initiative, we are investigating a novel approach to fully-automated indexing (Humphrey 1998; Humphrey 1999) based on NLM's practice of maintaining a subject index to journal titles using terms corresponding to specialties associated with biomedicine. This journal descriptor (JD) indexing is meant to complement the methods described earlier in this report.

JD's are a set of 141 preferred MeSH terms designated for use in indexing Medline journals by subject. Of the 4,000 Medline serials, 3,330 are assigned JD's; of these, about 74% are assigned a single JD, 21% just two JD's, and the remainder up to five JD's. For example, the American Journal of Cardiology has the JD Cardiology.

Since all citations inherit the JD's of their respective journals, JD's can be thought of as indexing terms for documents as well as for journals, and, in fact, JD's have often been used indirectly by professional searchers. For example, to retrieve literature on neurotransmitters in the field of cardiology, the search term Neurotransmitters may be intersected with the JD Cardiology, which can only be searched by specifying the title abbreviations or codes for journals with this JD.

There are several impediments to using JD's as currently implemented for Medline retrieval. The considerable difficulty in accessing JD's in available retrieval systems and the fact that some journals have no JD seem not to be research issues. JD indexing research addresses the problem that retrieval based on searching a JD as an inherited descriptor is restricted to the particular journals having this JD. For example, a cardiology citation in the New England Journal of Medicine inherits only the JD assigned to this journal, which is Medicine, and thus cannot be retrieved by the JD Cardiology. The research discussed here investigates the feasibility of automatically assigning JD's to all Medline citations, as appropriate, regardless of the JD's currently assigned.

Relying on the intellectual effort of NLM catalogers who maintain current information about NLM serials, JD indexing has the potential to supplement other forms of automatic indexing by providing powerful access points in certain types of searching, much as is done by search engines on the Web that organize documents in very general categories, thereby subdividing the Web into domain-specific information sources. Further applications being investigated include using JD indexing as an aid to resolving word-sense ambiguity in natural language processing.

Methodology:

The methodology for using JD's as indexing terms is based on the characterization of a word by a "JD profile." The JD profile associates the JD's of journals with words commonly occurring in titles and abstracts of papers in these journals and is computed from a training set of Medline citations to these papers. Once the JD profile has been computed for each word occurring in the training set, it can be used as the basis for indexing any medical text outside the training set, including but not limited to Medline citations. This automatic indexing relies on a calculation based on a composite of the JD profiles for the words occurring in the document to be indexed. The ranked list of JD's resulting from this calculation become the indexing terms for the document under consideration.

The association captured in the JD profile is based either on the number of occurrences of the word in the training set or on the number of citations in the training set containing the word. In the former case, the number of occurrences of a word in association with a particular JD is divided by the total number of occurrences of this word, while in the latter, the number of citations associated with a particular JD and containing this word is divided by the total number of citations containing this word. Both methods of computing the JD profile are under consideration. For brevity, only one method, based on citation count, will be illustrated in this written report.

For example, the word JD profile for chemotherapy is as follows, using our current training set of 21,760 citations. Chemotherapy occurs a total of 657 times in 304 citations, and forty-eight JD's have been associated with these citations by NLM. The distribution of chemotherapy in association with the JD's occurring in the training set is listed in Figure 1 as a percentage of the total number of citations in which this word appears (only the most frequent associations are given). For instance, the frequency of occurrence of chemotherapy in association with the top-ranked JD Medical Oncology (0.453947) is determined by dividing the number of citations associated with Medical Oncology and containing chemotherapy (138) by the total number of citations in which this word appears (304).

CITATION COUNT FOR WORD PER JD/TOTAL CITATION COUNT, BY COUNT:
   |Medical Oncology|138/304=0.453947
   |Hematology|39/304=0.128289
   |Medicine|31/304=0.101974
   |Pediatrics|24/304=0.078947
   |Surgery|16/304=00.052632
   etc.
Figure 1. The JD profile for chemotherapy.

The principle underlying the feasibility of JD's as document descriptors is the fact that the distribution of words in citations associated with a particular JD is not uniform. Frequency of occurrence is purported to correlate with the semantic content of the text of the citation. In this example, chemotherapy occurs most frequently associated with the JD Medical Oncology, considerably less so with Hematology, and so on. The association of a word with a particular array of JD's forms the basis for automatic indexing. For example, once the JD profile for chemotherapy has been computed from the training set, this profile functions as an indicator of the semantic content of any text in which chemotherapy occurs. JD's derived from the JD profiles for all the words in a document can then be used as indexing terms for that document.

To illustrate indexing a document outside the training set, we can use the following title from the New England Journal of Medicine: "Dexamethasone, Granisetron, or Both for the Prevention of Nausea and Vomiting during Chemotherapy for Cancer." Considering this title as a document, the top-ranked JD's assigned as indexing terms are shown in Figure 2. (Note that although this journal has the JD Medicine assigned by NLM, this assignment is not used in the JD indexing of this text.)

JD'S AND RANK BASED ON CITATION COUNT FOR WORD, BY RANK:
   |Medical Oncology|0.18495
   |Medicine|0.105122
   |Pharmacology|0.00679
   etc.
Figure 2. The top-ranked JD's for "Dexamethason, Granisetron, or Both for the
Prevention of Nausea and Vomiting during Chemotherapy for Cancer."

In order to arrive at these indexing terms, tables associating words with JD's are computed. Table 1 shows how many times words from the document co-occur with the JD Medical Oncology in the training set based on citation count. For example, granisetron occurs in six citations in the training set and three of these have the JD Medical Oncology (i.e., are from Medical Oncology journals). As noted earlier in the JD profile for chemotherapy this word occurs in 304 citations and 138 of these are associated with Medical Oncology. The ranking (0.18495) of this JD as an indexing term for the text under consideration is computed by averaging the percentages given in the third column of Table 1. The fact that Medical Oncology was the top-ranked JD in the JD profiles for five words in the text (chemotherapy, as illustrated earlier, as well as granisetron, nausea, vomiting, and cancer) contributed to this being the top-ranked JD for the text.

WORD IN DOCUMENT TO
BE INDEXED
Medical Oncology
CITE COUNT/TOTAL CITE COUNT
Medical Oncology
RANK (CITE COUNT)
DEXAMETHASONE 1/68 0.014706
GRANISETRON 3/6 0.5
BOTH 147/4038 0.036404
FOR 570/11837 0.048154
THE 740/17625 0.041986
PREVENTION 9/317 0.028391
NAUSEA 19/51 0.372549
AND 736/17301 0.042541
VOMITING 13/45 0.288889
DURING 119/3050 0.039016
CHEMOTHERAPY 138/304 0.453947
CANCER 320/907 0.352811
Table 1. Words associated with the JD Medical Oncology.

Training Set:

The initial training set consisted of 21,760 citations from the July, 1995, Medline file having a 1995 publication date. The distribution of JD's assigned to citations in this set is given in Table 2 for the ten most frequently-assigned JD's. (Citations from journals not assigned a JD (1% of the total) were eliminated from the training set).
NO. OF
CITATIONS

JD
% OF
CITATIONS
2,467 Medicine 11%
1,585 Biochemistry 7%
972 Nursing 4%
851 Medical Oncology 4%
828 Surgery 4%
768 Allergy and Immunology 4%
681 Pharmacology 3%
632 Science 3%
584 Neurology 3%
576 Biotechnology 3%
Table 2. Most frequently assigned JD's in the
training set.

Biases presumably not representative of Medline are a consequence of the relatively small size of this training set. For example, the citation count for Nursing is abnormally high simply because an unusually large number of issues of journals in that discipline were indexed during July, 1995. Such biases degrade the results of indexing with this method. The construction of a considerably larger training set is being pursued in order to address this problem.

Any training set representative of Medline, even a large one, will reflect the inherent biases of Medline with regard to discipline. Citation counts for a discipline reflect not only the number of journals indexed in a discipline, but also frequency of publication and number of articles per issue of such journals. Word counts for a discipline reflect in addition the length of titles and number and length of abstracts keyed in per indexed journal. Again, once a larger training set is constructed, a truer picture of these inherent biases should emerge so that techniques to deal with them can be explored.

Current research:

The thrust of current research underlying this project is to improve system performance. Ways of doing this include enlarging the training set to better represent domain distributions in Medline, trying standard statistical Information Retrieval methods such as term weighting, developing statistical and other methods to compensate for under- and over-representation of certain domains, investigating problems in JD assignments, and exploring methods that associate JD's with elements of Medline citations other than individual words.

JD's assigned to journals that are not representative of every paper in the journal are a particular problem. A solution may be to combine certain JD's, for example, Medical Oncology and Neoplasms, Experimental combined as Neoplasms. Associating JD's with elements other than individual words presumes the availability of techniques for identifying these elements. An example of such an element would be automatically-generated noun phrases.

Since the system does not use the JD assignment for a test article, a ready-made measure of system performance can be whether the system recommends this JD, for example, whether the system assigns the JD Cardiology to test documents from the American Journal of Cardiology. An evaluation may also be based on any of the test collections available generally in the Indexing Initiative, including the recently created CBM (Current Bibliographies in Medicine) collection.

Research using JD indexing as the basis for semantic type (ST) indexing is under way. ST indexing is being investigated as an approach to resolving disambiguation for the MetaMap project. JD and ST indexing are described in the paper:

Word sense disambiguation by selecting the best semantic type based on Journal Discriptor Indexing: preliminary experiment.
Humphrey, SM; Rogers, WJ; Kilicoglu H; Demner-Fushman, D; Rindflesch, TC. J Am Soc Inf Sci Technol 2006 Jan;57(1):96-113.
Erratum in: J AM Soc Inf Sci, Mar. 2006, 57(4):726.   PDF: Erratum for Word sense disambiguation by selecting ... paper  (20.6kb)
PDF: Word sense disambiguation by selecting the best semantic ...
 (386kb)
 
Automatic indexing by discipline and high-level categories: methodology and potential applications.
Humphrey, SM; Rindflesh, TC; Aronson, AR. In: Soergel D, Srinivasan P, Kwasnik B, editors. Proceedings of the 11th ASIST SIG/CR Classification Research Workshop; 2000 Nov 12; Chicago. Silver Spring (MD): American Society for Information Science and Technology; 2000. p. 103-16.
PDF: Automatic indexing by discipline and high-level categories ...
 (76kb)
 
Automatic indexing of documents from journal descriptors: a preliminary investigation.
Humphrey, SM. J Am Soc Inf Sci. 1999 Jun;50(8):661-74.
PDF: Automatic indexing of documents from journal descriptors ...
 (551kb)
[an error occurred while processing this directive]