Saturday, January 30, 2010

One-way hash: Perl, Python, Ruby

I have prepared short scripts , in Perl, Python, and Ruby, for implementing one-way hash operations. One-way hashes are extremely important in medical informatics. The following text is extracted from a public domain document that I wrote, in 2002 (1).

A one-way hash is an algorithm that transforms a string into another string is such a way that the original string cannot be calculated by operations on the hash value (hence the term "one-way" hash). Examples of public domain one-way hash algorithms are MD5 and SHA (Standard Hash Algorithm). These differ from encryption protocols that produce an output that can be decrypted by a second computation on the encrypted string.

The resultant one-way hash values for text strings consist of near-random strings of characters, and the length of the strings (e.g. the strength of the one-way hash) can be made arbitrarily long. Therefore name spaces for one-way hashes can be so large that the chance of hash collisions (two different names or identifiers hashing to the same value) is negligible. For the fussy among us, protocols can be implemented guaranteeing a data set free of hash-collisions, but such protocols may place restrictions upon the design of the data set (e.g. precluding the accrual of records to the data set after a certain moment)

In theory, one-way hashes can be used to anonymize patient records while still permitting researchers to accrue data over time to a specific patient' record. If a patient returns to the hospital and has an additional procedure performed, the record identifier, when hashed, will produce the same hash value held by the original data set record. The investigator simply adds the data to the "anonymous" data set record containing the same one-way hash value. Since no identifier in the experimental data set record can be used to link back to the patient, the requirements for anonymization, as stipulated in the E4 exemption are satisfied (vida supra).

There is no practical algorithm that can take an SHA hash and determine the name (or the social security number or the hospital identifier, or any combination of the above) that was used to produce the hash string. In France, the name-hashed files are merged with files from many different hospitals and used in epidemiologic research. They use the hash-codes to link patient-data across hospitals. Their methods have been registered with SCSSI (Service Central de la Securite des Systemes d'information).

Implementation of one-way hashes creates some practical problems. Attacks on one-way hash data may take the form of hashing a list of names and looking for matching hash values in the data set. This can be solved by encrypting the hash or by hashing a secret combination of identifier elements or both or keeping the hash value private (hidden). Issues arise related to the multiple ways that a person may be identified within a hospital system (Tom Peterson on Monday, Thomas Peterson on Tuesday), all resulting on inconsistent hashes on a single person. Resolving these problems is an interesting area for further research.

The scripts are available at:

http://www.julesberman.info/factoids/1wayname.htm


1. Berman JJ. Confidentiality for Medical Data Miners. Artificial Intelligence in Medicine 26:25-36, 2002.

- Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 year by Morgan Kaufmann.



I urge you to read more about my book. Google books has prepared a generous preview of the book contents. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, encryption, security, informatics, perl programming, ruby programming, python programming, jules j berman, md5, sha, 1-way hash, 1way hash, oneway hash, confidentiality, de-identification

Friday, January 29, 2010

Chronology of Earth Website Updated

I've just updated the Chronology of Earth website. It now has about 500 entries, covering significant events of terran interest.

The site is available at:

http://www.julesberman.info/chronos.htm


If you spot any inaccuracies on the page, please let me know.

- © 2010 Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published last year by Morgan Kaufmann.



I urge you to read a litte about my book. Google books has prepared a generous preview of the book contents. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, world timeline, world time line, terran chronology, terrestrial chronology, chronology of science, timeline of science, science events, history of science, medical history history of earth, science through history, science through the ages, science past and present, date of occurrence

Thursday, January 28, 2010

Scripts for fetching and testing web pages

Web pages are files (usually in HTML format) that reside on servers that accept HTTP requests from clients connected to the Internet. Browsers are software applications that send HTTP requests and display the received web pages. Using Perl, Python, or Ruby, you can automate HTTP requests. For each language, the easiest way to make an HTTP request is to use a module that comes bundled as a standard component of the language.

I've written very simple scripts, in Perl, Python, and Ruby, for fetching web files. The scripts, and an explanation of how they work, are available at:

http://www.julesberman.info/factoids/url_get.htm


Perl, Python and Ruby use their own external modules for HTTP transactions, and each language's module has its own peculiar syntax. Still, the basic operation is the same: your script initiates an HTTP request for a web file at a specific network address (the URL, or Uniform Resource Locator); a response is received; the web page is retrieved, if possible, and printed to the monitor. Otherwise, the response will contain some information indicating why the page could not be retrieved.

With a little effort, you can use these basic scripts to collect and examine a large number of web pages. With a little more effort, you can write your own spider software that searches for web addresses within web pages, and iteratively collects information from web pages within web pages.

© 2010 Jules J. Berman

key words: testing link, ruby programming, perl programming, python programming, bioinformatics, valid web page, web page is available, good http request, valid http request testing if web page exists, testing web links, jules berman, jules j berman, Ph.D., M.D.
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Wednesday, January 27, 2010

Familial and Heritable Neoplasm Syndromes

Yesterday, I revised my web page on familial and inherited neoplasm syndromes.

The former version of the page was a computer generated compilation of every OMIM (Online Mendelian Inheritance in Man) entry that contained the name of a cancer somewhere in the record.

The current version is a pared down list of about 230 conditions, that fall into one of three categories:

1. Familial cancer and neoplastic syndromes (i.e., affected individuals have an inherited predisposition to a particular set of neoplsms).

2. Mutations in the germ line (i.e., in every cell of the body), that predispose to neoplasia.

3. Inherited diseases or diseases with a heritable component that predispose to neoplasia.

The list is not limited to cancers, and includes benign tumors and hamartomas. The list is described in my book,

Neoplasms: Principles of Development and Diversity.


- © 2010 Jules J. Berman, Ph.D., M.D.
In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.



I urge you to read more about my book. There's a generous preview of the book at the Google Books site.

tags: common disease, orphan disease, orphan drugs, genetics of disease, disease genetics, rules of disease biology, rare disease, pathology, neoplasms, cancer

Tuesday, January 26, 2010

THYROID PRECANCER

Most, if not all cancers are preceded by a precancerous lesion, which has a number of biologic and morphologic features that are different from the fully developed cancer. Precancers are much easier to treat than cancers. By treating precancers, we can prevent cancers from developing.

In prior blog posts, I have discussed the biological properties of the precancers. One of these properties is an observed co-occurrence with cancers. Basically, if a particular type of cancer arises from a precancer, you would expect to see some instances wherein the cancer co-occurs with the precancer (i.e., where the cancer can be seen adjacent to its precancer). Co-occurrence of precancer and cancer is rare because cancers overgrow and replace their precancers.

Papillary thyroid carcinoma accounts for about 80% of thyroid cancers diagnosed in the U.S. Until recently, little attention has been directed to the putative precursor lesion of this cancer. A recent paper by Cameselle-Teijeiro and associates described a case of papillary thyroid carcinoma adjacent to a focus of solid cell next hyperplasia (1). The authors microdissected both lesions and found the same BRAF mutation in the solid cell nests and in the adjacent cancer.

Their findings suggest that solid cell nest hyperplasia is the precancer lesion for the adjacent cancer.

In their case report, the particular type of papillary carcinoma was the follicular variant of papillary microcarcinoma. More research is necessary to answer the following questions:

1. Is their observation generalizable (i.e., can it be shown that solid cell nest hyperplasia is found in additional cases of thyroid carcinoma)?

2. Does solid cell nest hyperplasia have all of the defining properties of a precancer?

3. If so, for which thyroid cancers is solid cell nest hyperplasia the precancer (i.e., is it the exclusive precancer for the follicular variant of papillary microcarcinoma, or is it the precancer of other types of cancers that arise from thyroid follicle cells)?

[1] Cameselle-Teijeiro J, Abdulkader I, P‚rez-Becerra R, V zquez-Boquete A, Alberte-Lista L, Ruiz-Ponte C, Forteza J, Sobrinho-Simoes M. BRAF mutation in solid cell nest hyperplasia associated with papillary thyroid carcinoma. A precursor lesion? Hum Pathol 40:1029-1035, 2009.

In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.



I urge you to read more about my book. There's a generous preview of the book at the Google Books site.

© 2010 Jules Berman

tags: papillary carcinoma of the thyroid, papillary carcinoma of thyroid, thyroid cancer, precancer, thyroid precancer, precancerous, premalignant, early cancer, early lesions, pathology, pathogenesis, carcinogenesis, informatics, jules j berman, orphan drugs, rare diseases, genetic diseases, orphan diseases

Monday, January 25, 2010

COMPLEXITY 8

This is the eighth and last post in a series on complexity in scientific research. The theme of this collection is that scientific progress, particularly in the realm of healthcare, has declined as a consequence of the high complexity in software and other technologies.

-- POST BEGINS HERE --

The U.S. military enjoys working on huge, complex projects, and the scientists involved in these projects will go to extremes to keep failed efforts alive. A fine example of a long-running military research effort is the V-22 Osprey, affectionately renamed "The Grand Ole Osprey." In the history of engineering, there have been many attempts at dual-purposed devices: automobiles that can sprout wings and fly, boats that come ashore and covert to automobiles, washing machines that also dry clothes, houses on wheels that can be towed, behind a car. All of these devices exist, but they have not replaced their single-purposed components. It's very difficult to engineer a reliable and inexpensive composite device when each component is complex.

Circa 1980, the Pentagon decided it needed a hybrid aircraft that could take-off and land like a helicopter, but fly like a plane. Thus began the long, expensive and disappointing sago of the V-22 Osprey. After more than a quarter century in the making, and $16 billion dollars spent, the U.S. government has not created a safe and dependable vertical take-off airplane. Through the years, multiple crashes of the Osprey have resulted in 30 deaths.

In January, 2001, the New York times reported that a Marine Lieutenant-Colonel had been fired for falsifying Osprey records and for ordering the members of his squadron to do the same (1). "We need to lie or manipulate the data, or however you wanna call it," he said (1). The lies were intended to win new funding, but a squadron member caught the orders on tape, and the plan backfired.

Despite these problems, funding continued (2). The Osprey became operational in 2006 and is currently used in a wide variety of operations for the military.

If you try hard enough and long enough, it is possible to create functional complex products (hospital information systems, manned expeditions to mars, supersonic transports). The purpose of this series of posts is to show that complexity has very high costs. Aside from the money, the highest cost of a complex item comes from our inability to fully understand what we have created. We cannot always predict how complex objects will operate, how they will fail, or how they can be fixed. In many cases, it is better to acknowledge our limitations, by building very simple systems, and by developing ways of simplifying complex systems.

[1] Ricks TE. Data Faking Could Lower Osprey's Prospects Further. Washington Post Jan 21, 2001.

[2] Berler R. Saving the Pentagon's Killer Chopper-Plane. 22 years. $16 billion. 30 deaths. The V-22 Osprey has been an R&D nightmare. But now the dream of a tilt-rotor troop transport could finally come true. Wired 13.07, July 2005.

© 2010 Jules Berman

key words:informatics, complexity, jules j berman, medical history
My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.



I urge you to explore my book. Google books has prepared a generous preview of the book contents.

Sunday, January 24, 2010

Grabbing, Testing Web Links

Once more, I'm interrupting a series of blogs on the topic of complexity to provide a pointer to a recently constructed web page showing how to grab web pages and test web links, using Perl, Python, or Ruby.

Perl, Python and Ruby have their own external modules for HTTP transactions. Each language's module has its own peculiar syntax. Still, the basic operation is the same: your script initiates an HTTP request for a web file at a specific network address (the URL, or Uniform Resource Locator). A response is received determining if the page is available (equivalent to testing a link). With a little effort, you can modify the provided scripts to collect and examine a large number of web pages. With a little more effort, you can write your own spider software that searches for web addresses, iteratively collecting information from links within web pages.

The article is at:

http://www.julesberman.info/factoids/url_get.htm


© 2010 Jules J. Berman

key words: html, http, hypertext transfer protocol, testing link, ruby programming, perl programming, python programming, bioinformatics, valid web page, web page is available, good http request, valid http request testing if web page exists, testing web links, jules berman Ph.D., M.D.
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Saturday, January 23, 2010

Batch image conversions

I'm interrupting a series of blogs on the topic of complexity to provide a pointer to my recently constructed web page on batch image conversions.

When you write your own image software, you can automate activities that would otherwise require repeated operations, on multiple image files, with off-the-shelf image processing software. For example, you might want to delete, add, or modify annotations for a group of images, or you might want to resize an image collection to conform to specified dimensions. When you have more than a few images, you will not want to repeat the process by hand, for each image. When you have thousands of images, stored in a variety of image formats, it will be impossible to implement global conversions, if you do not know how to batch your operations.

The Batch conversions: Perl, Python, Ruby page provides three equivalent scripts, in Perl, Python and Ruby, that converts a batch of images from color to greyscale. The scripts are preceded by a step-by-step explanation of the code.

- © 2010 Jules Berman

tags: perl programming, python programming, ruby programming, image magick, imagemagick, grayscale, greyscale, big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, conversion, image format, image processing, perl, python, Ruby Jules J. Berman Ph.D., M.D.
My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.



I urge you to explore my book. Google books has prepared a generous preview of the book contents.

Friday, January 22, 2010

COMPLEXITY 7

This is the seventh in a series of new posts on the subject of complexity in scientific research. The theme of this collection is that scientific progress, particularly in the realm of healthcare, has declined as a consequence of the high complexity in software and other technologies.

-- POST BEGINS HERE --

The National Reconnaissance Office is the U.S. agency that handles spy satellites. In 1998, the agency offered a contract to build a new generation of satellites. The contract went to Boeing, which had never built the kind of satellite specified in the contract. According to an investigative article written for the New York Times, the Boeing engineers designed subsystems of such complexity that they could not be built (1). Because the workforce were inexperienced in assembling a satellite, they used construction materials that were inappropriate for spacecraft. Most noteworthy was their planned use of tin parts, which deform in space, sometimes leading to short circuits. Seven years later, the project was killed, after running up costs estimated as high as $18 billion dollars. Experts reviewing the failed project indicated that it was doomed from the start. Basically, the level of complexity of the project exceeded Boeing's ability to fulfill the contract, and exceeded the government's ability to initiate and supervise the contract (1).

There are projects that tantalize, hovering just outside human reach: sending men to mars, commercializing supersonic transport jets, long-term stock market predictions, introduction of species to a foreign ecological environment, tamper-proof computerized voting machines, planned tactical warfare, etc. It is not as though the world does not contain complex, and functional, objects. Jet planes, supercomputers, skyscrapers, telecommunication satellites, butterflies, and humans are just a few examples. These highly complex objects all arose from less complex objects. Butterflies and humans slowly evolved, over billions of years, from an early life form. Jets and other complex machines were built by teams of humans, working from a collective experience, adding improvements incrementally, over decades

[1] Taubman P. Failure to Launch: In death of spy satellite program, lofty plans and unrealistic bids. The New York Times. November 11, 2007.

-- TO BE CONTINUED --

© 2010 Jules Berman

key words:informatics, complexity, jules j berman, institutional memory, medical history, blog list
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Thursday, January 21, 2010

COMPLEXITY 6

This is the sixth in a series of new posts on the subject of complexity in scientific research. The theme of this collection is that scientific progress, particularly in the realm of healthcare, has declined as a consequence of the high complexity in software and other technologies.

-- POST BEGINS HERE --

"Any informatics problem can be solved by adding an extra layer of abstraction."
- Anonymous source, sometimes referred as the golden rule of computer science

Much can be learned about the consequences of complexity by reviewing technology disasters. A 2003 article in the British Medical Journal described a project to install a computerized integrated hospital information system in Limpopo (Northern) Province of South Africa Rlita. This poor province 42 hospitals and invested heavily to acquire the system. This fascinating article describes what went wrong and provides a list of factors that led to the failure of the system. this included a failure to take into account the social and cultural milieu in which the system would be used. There was an underestimation of the complexity the undertaking and insufficient appreciation of the length of training required by the hospital staff.

One of the most challenging features of many Hospital Information Systems is computerized physician order entry (CPOE). The intent of CPOE is to eliminate the wasteful hand-written (often illegible) doctor's orders that may need to be transcribed by nurses, pharmacists, and laboratory personnel before finally entered into the HIS. Having the physicians directly enter their orders into the HIS has been a long-awaited dream for many hospital administrators. In a fascinating report, patient mortality was shown to increase after implementation of CPOE. In this study, having CPOE was a strong, independent predictor of patient death. Somehow, a computerized service intended to enhance patient care had put patients at increased risk (1).

High-tech medical solutions seldom achieve the desired effect when implemented by low-tech medical staff. Introducing complex informatics services, such as CPOE, requires staff training. There needs to be effective communication between the clinical staff and the hospital IT staff and between the hospital IT staff and the HIS vendor staff. Everyone involved must cooperate until the implemented system is working smoothly. This is virtually impossible. Hospital personnel know that a wide range of standard practices (such as complex tests, tests using specialized imaging equipment, procedures that require patient preparation or transportation, timed-interval dosage administration, expert consultations, interventions that require close attending staff supervision) become very iffy on weekends, holidays, and after about 4:00 PM on weekdays. It is difficult to get shift workers to interface seamlessly with a computer system that never sleeps.

When it comes to hiding in the safe shadow of complexity, nobody does it better than software designers. They will take a problem, such as computer-aided diagnosis, or computer-aided medical decision-making, and produce a software application that purports to provide an answer. Your input is an x-ray or a series of lab tests and clinical finding, and out comes a diagnosis. We fool ourselves into thinking that the designers of complex software systems must understand how the system works. Not so. Computers allow us to design complex, interdependent, systems that are unpredictable and inherently chaotic.

Software failure is probably the most sensitive indicator of the limits of complexity. It is very easy to create software that works at a level of complexity beyond anything found in physical systems. The weakest programmers tend to fix bugs with layers of subroutines. Stronger programmers will simplify the problem and re-write their code, eliminating unnecessary subroutines. A 1995 report by the Standish group showed that most software projects sponsored by large companies are failures. Only 9% of such project are finished on time and within budget, and many of the finished projects do not meet the required performance specifications (2). Complexity is a plague on almost every area of science.

Probably the most famous medical software disaster involved the Therac-25 (3). Between 1985 and 1987, at least 6 patients received massive overdoses of radiation due to a software error in a radiation therapy device. A review of the incidents uncovered numerous errors in the engineering and in procedures for detecting and correcting softare problems.

Medical software errors are not rare. The FDA analyzed 3140 medical device recalls conducted between 1992 and 1998 reveals that 242 of them (7.7%) are attributable to software failures. Of those, 192 (or 79%) were caused by software changes made after the software's initial production and distribution (4).

[1] Han YY, Carcillo JA, Venkataraman ST, Clark RS, Watson RS, Nguyen TC, Bayir H, Orr RA. Unexpected increased mortality after implementation of a commercially sold computerized physician order entry system. Pediatrics 116:1506-1512, 2005.

[2] The Standish Group Report: Chaos. http://www.projectsmart.co.uk/docs/chaos-report.pdf, 1995.

[3] Leveson N. Medical Devices: The Therac-25. Appendix A in: Leveson N. Safeware: system safety and computers, Addison-Wesley, Reading, 1995.

[4] General Principles of Software Validation; Final Guidance for Industry and FDA Staff. January 11, 2002.

-- TO BE CONTINUED --

© 2010 Jules Berman

key words:informatics, complexity, jules j berman, medical history
My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.



I urge you to explore my book. Google books has prepared a generous preview of the book contents.

Wednesday, January 20, 2010

COMPLEXITY 5

This is the fifth in a series of new posts on the subject of complexity in scientific research. The theme of this collection is that scientific progress, particularly in the realm of healthcare, has declined as a consequence of the high complexity in software and other technologies.

-- POST BEGINS HERE --

Gone are the days when a scientist could describe a simple, elegant experiment (on a mouse, a frog, or some other easily obtained chemical reagents) and another scientist would, in a matter of a few hours, repeat the process in his own laboratory. When several laboratories perform the same experiment, using equivalent resources, and producing similar results, it is a safe bet that the research is valid (1).

Today, much of research is conducted in a complex, data-intensive realm. Individual studies can cost millions of dollars, involve hundreds of researchers, and produce terabytes of data. When experiments reach a high level of cost and complexity, repetition of the same experiment, in a different laboratory, becomes impractical.

In the late 1990s, a variety of data-intensive methods were developed for molecular biology, all of which generated vast amounts of data, requiring complex and sophisticated algorithms to convert the raw data into measured quantities and to analyze the huge assortment of measurements. Once such method is gene expression microarrays. In these studies, RNA molecules in tissue samples are converted to DNA and incubated against an array of pre-selected DNA samples. DNA sequences in the sample and the microarray that match, will, under precise conditions, anneal to form double-stranded molecules. The number of matches can be semi-quantitated, and a profile of the relative abundance of every RNA species in the original sample can be produced and compared with the profiles of other specimens. Using these profiles, medical researchers have tried to identify profiles (of diseased tissues) that predict responsiveness to particular types of treatment. In particular, researchers have tried use cancer tissue profiles to predict the likelihood that a specific tumor will respond to a specific type of treatment. Since the late 1990s, an enormous number of studies have been funded to produce the tissue microarray profiles for many different diseases, in many different clinical stages, and to correlate these profiles with treatment response.

Because there are so many different variables in the selection of patients, the selection of tissues, the preparation of tissues for annealment, the selection of microarray reagents, the collection of data, the conversion of data to a quantifiable measure, and the methods of analyzing the data, it is impossible for different laboratories to faithfully repeat a microarray experiment. Michiels and co-workers have shown that most microarray studies could not classify patients better than chance (2). Still, the field of microarray profiling continues, as it should, because successful fields must overcome their limitations. Continued efforts may resolve the seemingly intractable problems discussed here, or may open up alternate areas of more fruitful research. Much money has been invested into microarray profiling, and many laboratories depend on the continued funding of this technology. Experience suggests that it takes at a few decades to thoroughly discredit a well-funded but ill-conceived idea.

Here is another case in point. The U.S. Veterans Administration Medical System operates about 175 hospitals. This is an immense undertaking, but the work is accomplished fairly well, using a rather simple algorithm. The VA hires a bunch of doctors, nurses and healthcare workers, gives them a set salary, and houses them in hospital buildings. When registered patients appear in their clinics, the VA pays for the supplies necessary to treat the patients. Each year, the Congress appropriates the money to keep the VA going the next year. One of the greatest benefits of the VA system is the lack of billing. Patient visits, medical procedures, diagnostics, pharmaceuticals, and other medical arcana are absorbed into budget. If you were to compare the level of complexity of the VA healthcare system with the level of complexity of 175 private hospitals, you would find the VA system to be a model of simplicity.

Then one day, somebody asked, "Should the VA pay for medical services rendered on veterans who have their own private insurers?" Having no affirmative answer, the VA undertook an effort to pry reimbursements from the private insurers of veterans treated at VA hospitals. Suddenly, billing and expense records became important the VA, an institution with no experience in fee-for-service care.

The VA planned a $427 million software system to track billing and other financial transactions. The pilot site was the Bay Pines VA, in Florida. After preliminary testing at Bay Pines, the system, known as the Core Financial and Logistics System, or CoreFLS, would be rolled out to all of the VA hospitals nationwide. Unfortunately, the system could not be implemented at Bay Pines. Neither the software nor the humans were up to the job. In 2005, VA decided to pull the plug on a $472-million system at because it did not work (3).

Four years later, in 2008, the Government Accounting Office reviewed the billing performance on just 18 of the 175 or so VA hospitals. It found that these 18 hospitals, in fiscal year 2007, failed to collect about $1.4 billion that could have been paid by private insurers. The report from the Government Accounting Office concluded, "Since 2001 we have reported that continuing weaknesses in VA billing processes and controls have impaired VA’s ability to maximize the collections received from third-party insurers. (4)"

Why, after years of effort, has the VA not succeeded in billing private insurers for VA care received by privately insured veterans? The reason can be distilled in a single word: complexity. Private insurance reimbursement has reached a level of complexity that exceeds the ability of bureaucratic organizations to cope. There are many insurers, each with their own policies and their own obstructionist bureaucracies. When the VA tries to collect from third party payers, they must deal with insurers across fifty states. The VA paid dearly to acquire a financial database that could handle the problem, but the software wasn't up to the job.

Hospital information systems are among the most complex and most expensive software systems. The cost of a hospital information system for a large medical center can easily exceed $200 million. It is widely assumed that hospital information systems have been of enormous benefit to patients, but reports suggest that 75% of installed systems are failures (5). If Hospital Information Systems worked well, why does the cost of healthcare continue to rise? Has information technology eliminated the fragmentation of medical care or reduced the the complexities of health payment plans? Evidence for the value of implementing complex health information technology in community hospitals is scant. Most of the credible reports on the benefits of Hospital Information Systems come from large institutions that have developed their own systems incrementally, over many years (6).


[1] Golden F. Science: Fudging Data for Fun and Profit. Time December 7, 1981. http://www.time.com/time/printout/0,8816,953258,00.html

[2] Michiels S, Koscielny S, Hill C. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 365:488-492, 2005.

[3] De La Garza P, Nohlgren S. VA yanks troubled computer system: the $472-million computer system being tested at Bay Pines just doesn't work, veterans officials say. St. Persburg Times July 27, 2004.

[4] GAO United States Government Accountability Office Testimony Before the Subcommittee on Health, Committee on Veterans' Affairs, House of Representatives. VA HEALTH CARE: Ineffective Medical Center Controls Resulted in Inappropriate Billing and Collection Practices. Statement of Kay L. Daly Director Financial Management and Assurance. GAO-10-152T October 15, 2009.

[5] Littlejohns P, Wyatt JC, Garvican L. Evaluating computerised health information systems: hard lessons still to be learnt British Medical Journal 326:860-863, April 19, 2003. http://bmj.com/cgi/content/full/326/7394/860. Comment. The authors report that about about three quarters of installed hospital information systems are considered failures.

[6] Chaudhry B, Wang J, Wu S, Maglione M, Mojica W, Roth E, Morton SC, Shekelle PG. Impact of health information technology on quality, efficiency, and costs of medical care. Ann Intern Med 144:742-752, 2006.]chaudhry.pdf

-- TO BE CONTINUED --

© 2010 Jules Berman

key words:informatics, complexity, jules j berman, medical history
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Tuesday, January 19, 2010

COMPLEXITY 4

This is the fourth in a series of new posts on the subject of complexity in scientific research. The theme of this collection is that scientific progress, particularly in the realm of healthcare, has declined as a consequence of the high complexity in software and other technologies.

-- POST BEGINS HERE --

In the prior blog, I discussed the lack of tangible progress in medical research in the past few decades. The general perception that basic research advances are not yielding clinically useful medical breakthroughs has inspired the "translational research" rhetoric currently spewing from funding agencies. Is it possible that the current generation of medical researchers has made no progress whatsoever? Well, maybe there were a few bright spots. Here are some of the major breakthroughs in medicine occurring since 1960.

1. Zinc drastically reduces childhood deaths from diarrhea, a disease that kills 1.6 million children under the age of five, every year (1).

2. Helicobacter pylori causes gastritis, gastric ulcers, and some stomach cancers (2). A simple antibiotic treatment cures gastritis and reduces the incidence of stomach cancers (3). This work earned the two discoverers, Barry Marshall and Robin Warren, the 2005 Nobel prize

3. When babies sleep on their backs, instead of their stomachs, the incidence of SIDS (sudden infant death syndrome, or crib death) plummets (4).

4. Daily aspirin ingestion seems to reduce deaths from cardiovascular disease and colon cancer (5).

The most significant medical advances in the past few decades (and there haven't been many) have been simple measures. All of the great debacles in medicine have been complex. This is because scientific methods have reached a level of complexity that nobody can understand.

Gone are the days when a scientist could describe a simple, elegant experiment (on a mouse, a frog, or some other easily obtained chemical reagents) and another scientist would, in a matter of a few hours, repeat the process in his own laboratory. When several laboratories perform the same experiment, using equivalent resources, and producing similar results, it is a safe bet that the research is valid, but we seldom see that kind of validation.

Today, much of research is conducted in a complex, data-intensive realm. Individual studies can cost millions of dollars, involve hundreds of researchers, and produce terabytes of data. When experiments reach a high level of cost and complexity, repetition of the same experiment, in a different laboratory, becomes impractical.

[1] Walt V. Diarrhea: the great zinc breakthrough. Time August 17, 2009.

[2] Warren JR, Marshall BJ. Unidentified curved bacilli on gastric epithelium in active chronic gastritis. Lancet 1:1273-1275, 1983.

[3] Kidd M, Modlin IM. A century of Helicobacter pylori: paradigms lost-paradigms regained. Digestion 59:1-15, 1998.

[4] Vennemann MM, Fischer D, Jorch G, Bajanowski T. Prevention of sudden infant death syndrome (SIDS) due to an active health monitoring system 20 years prior to the public "back-to-sleep-campaigns." Arch Dis Child. Jan 6, 2006.

[5] Writing Group; Hennekens CH, Dyken ML, Fuster V. Aspirin as a therapeutic agent in cardiovascular disease: a statement for healthcare professionals from the American Heart Association. Circulation 96:2751-2753, 1997.

-- TO BE CONTINUED --

© 2010 Jules Berman


My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.



I urge you to explore my book. Google books has prepared a generous preview of the book contents. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

Jules J. Berman, Ph.D., M.D.
tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, informatics, complexity, jules j berman, medical history

Monday, January 18, 2010

COMPLEXITY 3

This is the third in a series of new posts on the subject of complexity in scientific research. The theme of this collection is that scientific progress, particularly in the realm of healthcare, has declined as a consequence of the high complexity in software and other technologies.

-- POST BEGINS HERE --

Scientists tell us that they are making great advances in the treatment of cancer. This is not so. The total U.S. cancer death rate has barely budged in the past 60 years. Though deaths from some types of cancer have dropped, these drops have been offset by the rise in deaths caused by other cancers. Of the cancers that have dropped the most: stomach cancer and cancer of the uterine cervix, improved mortality is due to a drop in cancer incidence, not due to any progress in treating advanced cancers. The reduced incidence of stomach cancer is generally credited to refrigeration and improved methods of food preservation. With better preserved food, the incidence of stomach cancer dropped. The drop in cervical cancer has been due to effective Pap smear screening for precancerous lesions (small lesions that precede the development of invasive cancer). When uterine precancers are excised, the cancer never develops. Further reduction in deaths from uterine cancer will probably result from population-wide inoculations with the new HPV vaccine; an effective measure that bypasses the need to treat advanced cancers.

Beginning about 1991, the U.S. has seen a small, but continuous drop in the cancer death rate. This drop is due almost entirely to the reduced incidence of lung cancer among men (due to smoking cessation). Even with this small drop, the cancer death rate is still about the same as it was in the middle of the twentieth century (i.e., there is still a net increase in the long-term U.S. cancer death rate, even with the recent drop). With the exception of curing a few types of rare tumors, cancer research has yielded none of the dramatic medical advances we saw in the 1950s, against such diseases as polio and tuberculosis.

Cancer death rates that have increased since 1950 include: esophageal cancer, liver cancer, pancreatic cancer, lung cancer, melanoma, kidney cancer, brain cancer, non-Hodgkins lymphoma, and multiple myeloma. The list includes some of the most common types of cancer. If cancer research were effective, we would have prevented the rise in incidence of these common cancers.

If you speak to cancer researchers, they will tell you that we have made great advances in understanding cancer genetics; the mutations in DNA that contribute to the development of cancers. Yes; we've gotten a lot of news about cancer, but it's mostly bad news. We now know, thanks to billions of dollars of funding, that cancer cells are remarkably complex, often containing thousands of genetic alterations. No two genetically complex cancers can be characterized by the same set of mutations, and no two tissue samples of any one cancer will be genetically identical. The complexity of cancer far outstrips our ability to characterize the alterations in a cancer cell. Consequently, it is highly unlikely that any single drug will correct all of the genetic changes in the cells of advanced (i.e., invasive and metastatic) common cancers. Newly acquired knowledge of cancer genetics have taught us that we cannot cure advanced common cancers with currently available techniques.

There are a few exceptions. Not all cancers are complex. Some cancers (particularly rare inherited cancers and rare cancers occurring in children, and a few types of rare sporadic cancers) are characterized by simple genetic alterations. These genetically simple tumors turn out to be the tumors we can cure or the tumors for which we can most likely develop a cure in the near future. A simple genetic error can be targeted by a new generation of cancer drugs. Unfortunately, the commonly occurring cancers of adults are all genetically complex. It is unlikely that we will be able to cure these tumors anytime soon. The best chance of curing common cancers may come from studying how rare (genetically simple) tumors respond to new types of treatment.

In cancer research, as in so many other modern areas of scientific research, it seems that complexity is a major barrier to scientific progress.

-- TO BE CONTINUED --

© 2010 Jules Berman

key words:informatics, complexity, jules j berman, medical history
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Sunday, January 17, 2010

COMPLEXITY 2

This is the second in a series of new posts on the subject of complexity in scientific research. The theme of this collection is that scientific progress, particularly in the realm of healthcare, has declined as a consequence of the high complexity in software and other technologies.

-- POST BEGINS HERE --

If the rate of scientific accomplishment is dependent upon the number of scientists on the job, you would expect that that rate of scientific accomplishment would be accelerating, not decelerating. According to the National Science Foundation, 18,052 science and engineering doctoral degrees were awarded in the U.S., in 1970. By 1997, that number had risen to 26,847, nearly a 50% increase in the annual production of the highest level scientists (1). In 1953, according to the National Science Foundation, the total U.S. expenditures on research and development was $5.16 billion, expressed in current dollar values. In 1998, that number has risen to $227.173 billion, greater than a 40-fold increase in research and development spending (1). The growing work force of scientists failed to advance science very much, but it was not for lack of funds.

The U.S. Department of Health and Human Services has published an interesting document, entitled, "Innovation or Stagnation: Challenge and Opportunity on the Critical Path to New Medical Products. (2) " The authors note that fewer and fewer new medicines and medical devices are reaching the Food and Drug Administration. Significant advances in genomics, proteomics and nanotechnology have not led to significant advances in the treatment of diseases. Extrapolating from the level of scientific progress in the past half century, there's not much reason to expect great improvements in the next 50 years. The last quarter of the 20th century has been described as the "era of Brownian motion in health care" (3).

1. National Science Board, Science & Engineering Indicators - 2000. Arlington, VA: National Science Foundation, 2000 (NSB-00-1).

2.Innovation or Stagnation: Challenge and Opportunity on the Critical Path to New Medical Products. U.S. Department of Health and Human Services, Food and Drug Administration, 2004.

3. Crossing the Quality Chasm: A New Health System for the 21st Century. Quality of Health Care in America Committee, editors. Institute of Medicine, Washington, DC., 2001.

-- TO BE CONTINUED --

-© 2010 Jules Berman

key words: complexity, scientific progress, jules j berman, medical history, informatics, software
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Saturday, January 16, 2010

COMPLEXITY: PART 1

I'm beginning a series of new blogs written on the subject of complexity in scientific research. The point of this collection of essays is to show that scientific progress has declined as a consequence of the high complexity in software and other technologies.

PROGRESS? WHAT PROGRESS?

When you watch a movie circa 1960, and you look at their streets and houses, and furniture, and clothing, is there any difference between then and now? Not much. Basically, the scientific advance that shapes the world today was discovered prior to 1960. The only visible difference between people then and people now is personal appearance. The twenty-first century citizen has abandoned keeping neat and trim, preferring an alluring fat slob look.

What did we have in 1960? We had home television (1947), transistors (1948), commercial jets (1949), computers (Univac, 1951), nuclear bombs (fission , fusion in 1952), solar cells (1954), fission reactors (1954), satellites orbiting the earth (Sputnik I, 1957), integrated circuits (1958), photocopying (1958), probes on the moon (Lunik II, 1959), practical business computers (1959), lasers (1960).

These engineering and scientific advancements pale in comparison to the advances in medicine that occurred by 1960. Prior to 1950, we had the basic principles of metabolism, including the chemistry and functions of vitamins; the activity of the hormone system (including the use of insulin to treat diabetes and dietary methods to prevent goiter), the methodology to develop antibiotics and to use them effectively to treat syphilis, gonorrhea, and the most common bacterial diseases. Sterile surgical technique was practiced, bringing a precipitous drop in maternal post-partum deaths. We could provide safe blood transfusions, using A,B,O compatibility testing (1900). X-ray imaging had improve medical diagnosis.

Disease prevention was a practical field of medical science, bringing methods to prevent a wide range of common diseases using a clean water supply and improved waste management; and safe methods to preserve food including canning, refrigeration and freezing. In 1941, Papanicolaou introduced the smear technique to screen for precancerous cervical lesions, resulting in a 70% drop in the death rate from uterine cervical cancer in populations that implemented screening. In 1947, we had strong epidemiologic evidence that cigarettes caused lung cancer.

When we entered 1950, Linus Pauling had essentially invented the field of molecular genetics by demonstrating a single amino acid mutation accounting for the the defective gene responsible for sickle cell anemia. In 1950 Chargoff discovered base complementarity in DNA. Also, in 1950, Arthur Vineburg routed an internal mammary artery, in place, to vascularize the heart. In 1951, fluoridation was introduced, greatly reducing dental disease. Then came isoniazid, the drug that virtually erased tuberculosis (1952). Also, in 1952, Harold Hopkins designed the fibroscope, heralding fiberoptic endoscopy. In 1953, Watson and Crick showed that DNA was composed of a double helix chain of complementary nucleotides encoding human genes. John Gibbon performed the first open heart surgery using a cardiopulmonary bypass machine (1953), and D.W. Gordon Murray used arterial grafts to replace the left anterior descending coronary artery (the coronary artery bypass graft). Oral contraceptives (birth control pills) were invented in 1954. That same year, Salk developed an effective killed vaccine for polio, followed just three years later with Sabin's live polio vaccine. Thus, in the 1950s, the two most dreadful scourges of developed countries, tuberculosis and polio, were virtually eradicated.

Don't believe those reports announcing longer life expectancies for Americans. The people who are living longer today are the people who were born in the twentieth century and benefited directly from the advances in medicine occurring prior to 1960. Nobody has any way of knowing whether children born in the twenty-first century will live longer lives than their twentieth century predecessors. But their chances for long lives do not look very good. Here are some of the medical reversals that have occurred since 1960.

1. The worldwide spread of AIDS, a virus-spread disease that could have been eradicated with a few simple precautions, but was not.

2. The emergence of multiple drug-resistant tuberculosis. The root cause of the rise of resistant TB is the incomplete treatment of identified patients.

3. The emergence of multiple antibiotic resistant strains of Staphylococcus aureus.

4. Global warming, loss of the ozone layer, and other consequences of atmospheric pollution.

5. Mass starvation.

6. Reduced access to potable water affecting the vast majority of humans.

7. Planetary scale deforestation and desertification.

8. Monoculture of a few favored crops replacing biodiversity.

9. Large scale emergence of invasive and destructive species of plants and animals.

10. Increases in the total number of U.S. deaths from cancer.

11. The re-emergence of resistant insect and other vectors carrying viral and parasitic diseases.

12. Astronomical costs of new medications for chronic diseases, unaffordable to all but a small percentage of the world population.

13. The rising worldwide incidence of obesity and sequelae disorders.

14. The rapid geographic spread of outbreaks of new strains of influenza and other evolving viruses, including HIV and hemorrhagic fever viruses.

-- TO BE CONTINUED --

-© 2010 Jules Berman

key words: complexity, scientific progress, jules j berman, medical history, informatics, software
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.