Friday, April 18, 2008

MeSH (Medical Subject Headings) more complex than simple "Trees"

MeSH (Medical Subject Headings) is a wonderful nomenclature of medical terms available from the U.S. National Library of Medicine.

The download site is:

MeSH is one of the greatest gifts provided by the U.S. National Library of Medicine and can be used freely for a variety of projects involving indexing, tagging, searching, retrieving, coding, analyzing, merging, and sharing biomedical text. In my opinion, there are many projects that rely on commercial and legally encumbered nomenclatures that would be better served by MeSH.

My only quibble with MeSH is that it is incorrectly described as a Tree structure.

Here is the official word (from the NLM website) on MeSH Trees from:

"Because of the branching structure of the hierarchies, these lists are sometimes referred to as "trees". Each MeSH descriptor appears in at least one place in the trees, and may appear in as many additional places as may be appropriate. Those who index articles or catalog books are instructed to find and use the most specific MeSH descriptor that is available to represent each indexable concept."

When you look at individual entries in MeSH, you find that a single entry may be assigned multiple MeSH numbers.

For example, the MeSH term, "Family" is assigned two MeSH numbers,

MN = F01.829.263
MN = I01.880.225

The parent "number" for any MeSH entry is found by removing the last set of decimal demarcated digits.

For example:
F01.829.263 MeSH name, Family
F01.829 MeSH name, Plychology, Social
F01 MeSH Name, Behavior and Behavior Mechanisms

For each MeSH number, there is a separate hierarchy.

It is tempting to think of each hierarchy for each number as a tree (then MeSH could be envisioned as a dense forest), but each parent term could be assigned multiple MeSH numbers, each producing a multi-branching hierarchy.

Because each MeSH term (including the ancestral terms for a MeSh term) may be assigned multiple MeSH numbers, each with its own hierarchy, the MeSH data structure is more accurately thought of as a complex ontology, with terms existing in multiple classes, with specified relationships among any class and its parent classes.

The tree metaphor breaks down because branches and nodes within a branch can be connected to other branches and to other nodes. Trees do not do this kind of thing.

It is possible to write a script that parses through every MeSH entry, finds all of the MeSH numbers for the entry, determines the parent terms for the MeSH numbers, determines all of the alternate MeSH numbers for the parent terms, then finds all of the grandparent terms for all of the parent terms, etc., until all of the hierarchical terms for the term are found.

Here is the Perl script. This Perl script is provided "as is", by its creator, Jules J. Berman, without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

open(MESH, "D2007.BIN"); #the file name for the raw ascii MeSH version
open(OUT, ">mesh.out");
$/ = "\n\n\*NEWRECORD\n";
$line = " ";
while ($line ne "")
my $numbers = "";
$line = <MESH>;
$line =~ /\nMH = ([^\n]+)\n/;
$name = $1;
while ($line =~ m/\nMN ?= ?([^\n]+)(?=\n)/mg)
$number = $1;
$number =~ s/^ *//o;
$number =~ s/ *$//o;
$number =~ s/ +/ /;
$numberhash{$number} = $name;
$numbers = $numbers . " " . $number;
$numbers =~ s/^ *//o;
$numbers =~ s/ *$//o;
$numbers =~ s/ +/ /o;
$namehash{$name} = $numbers;
while((my $key, my $value) = each (%namehash))
@cumlist = ("");
print OUT "\nTERM LINEAGE FOR " . uc($key) . "\n";
my @valuelist = split(/ /,$value);
@cumlist = (@cumlist, @valuelist);
@cumlist = grep { $marked{$_}++; $marked{$_} == 1; }
@cumlist = grep { $marked{$_}++; $marked{$_} == 1; }
@cumlist = grep { $marked{$_}++; $marked{$_} == 1; }
foreach my $thing (@cumlist)
print OUT "$thing $numberhash{$thing}\n";
sub splitlist()
@valuelist = @_;
foreach my $meshno (@valuelist)
if ($meshno =~ /\.[0-9]+$/)
$meshno = $`;
push(@cumlist, $meshno);
sub allmeshnums()
@meshnumber = @_;
foreach my $thing (@meshnumber)
my $name = $numberhash{$thing};
my $value = $namehash{$name};
my @valuelist = split(/ /,$value);
@cumlist = (@valuelist, @cumlist);

The output file, mesh.out is over 9 megabytes in length.

Here is an example of one entry, in the output file, mesh.out

A11.118.637 Leukocytes
A15.145.229.637 Leukocytes
A15.382.490 Leukocytes
A11.118.637.555 Leukocytes, Mononuclear
A15.145.229.637.555 Leukocytes, Mononuclear
A15.382.490.555 Leukocytes, Mononuclear
A15.378 Hematopoietic System
A11.148 Bone Marrow Cells
A15.378.316 Bone Marrow Cells
A12.207.152 Blood
A15.145 Blood
A11.118 Blood Cells
A15.145.229 Blood Cells
A11.329.372.376 Giant Cells, Foreign-Body
A11.502.376 Giant Cells, Foreign-Body
A11.627.624.480.376 Giant Cells, Foreign-Body
A11.733.397.376 Giant Cells, Foreign-Body
A15.382.680.397.376 Giant Cells, Foreign-Body
A15.382.812.522.376 Giant Cells, Foreign-Body
A11.329 Connective Tissue Cells
A11 Cells
A11.502 Giant Cells
A11.118.637.555.652 Monocytes
A11.148.580 Monocytes
A11.627.624 Monocytes
A11.733.547 Monocytes
A15.145.229.637.555.652 Monocytes
A15.378.316.580 Monocytes
A15.382.490.555.652 Monocytes
A15.382.680.547 Monocytes
A15.382.812.547 Monocytes
A11.627 Myeloid Cells
A11.733 Phagocytes
A15.382.680 Phagocytes
A15.382 Immune System
A15 Hemic and Immune Systems
A11.329.372 Macrophages
A11.627.624.480 Macrophages
A11.733.397 Macrophages
A15.382.680.397 Macrophages
A15.382.812.522 Macrophages
A15.382.812 Reticuloendothelial System
A12.207 Body Fluids
A12 Fluids and Secretions

When we examine the multi-lineage ancestry of "Foreign body giant cells" we see that MeSH is not a tree hierarchy. This means that the MeSH data structure is highly complex and requires some computational know-how to fully explore all the term relationships.

Jules Berman

tags: Perl programming for medicine and biology, nomenclature, thesaurus, nlm, medical subject headings, open source, medical indexing, medical data retrieval, medical informatics, biomedical informatics, national library of medicine
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.