Monday, June 30, 2008

Biomedicine in the Post-Information Age: 6

This is part six of a multi-part blog on biomedicine in the post-information age.

In the post-information age, solo experts will use three tools (universal access to information, computational power, and the world-wide communications infrastructure) to be innovative and productive, without being employed by bricks-and-mortar institutions.

What is an example of a post-information age innovation? I'll give you an example from my personal experience.

Governments advocate the development of a standard biomedical vocabulary that will put an end to the profusion of non-standard vocabularies that are used to annotate biomedical tests. Text annotation (sometimes called text coding) is a necessary step for data retrieval, indexing, classification, integration, etc.

So, instead of having lots of separate nomenclatures, the governments prefer a single , standard nomenclature that everyone uses. In the U.S., England and much of Europe, there is a push to use SNOMED-CT as the standard medical vocabulary.

There is one problem with this. It has proven impossible to build a single nomenclature that includes all of the terminology used in specialty domains. A specialist in the domain of dermatologic diseases (in which there are many thousands of obscure diseases and multiple synonyms for individual diseases, and very little biological research to relate these diseases with other skin diseases or with systemic diseases) is unlikely to be satisfied with a general disease nomenclature.

In my area of specialty (tumor biology), this is also true. I found the standard nomenclatures (ICD-O, SNOMED-CT, NCI Thesaurus, UMLS metathesaurus) to have only a small number of the neoplasms that can be found in the biomedical literature. In addition, the relationships among the different neoplasms were, in my opinion, not adequately expressed in these standard nomenclatures.

So, I built my own specialty nomenclature, the Developmental Classification and Taxonomy of Neoplasms (usally called the Neoplasm Classification), with includes its own biological hierarchy of neoplasms and which has about ten-fold the number of neoplasm terms as the standard nomenclatures.

Anyone who wants to use a comprehensive, biologically classified list of neoplasms, is welcome to use the nomenclature that I, as a post-information age solo expert, developed. This is an open source document available in gzipped XML format at:

http://www.julesberman.info/neoclxml.gz

Or in zipped XML format at:

http://www.julesberman.info/neoclxml.zip

Do you need to abandon the standard nomenclature? No. Use both. I have written extensively on autocoding, double autocoding (autocoding with two or more nomenclatures), re-coding (autocoding again and again to satisfy the requirements of a particular project), and on-the-fly coding. My papers are linked to full-text articles on my publications page.

This is just one example showing how an post-information age individual can contribute in areas that large groups and institutions have ignored.

- Copyright (C) 2008 Jules J. Berman

key words: biomedical informatics, medical informatics, health coverage, health insurance, medical insurance
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Sunday, June 29, 2008

Biomedicine in the Post-Information Age: 5

This is part five of a multi-part blog on biomedicine in the post-information age.

As the prior blogs in this series emphasized, the distinctive feature of the post-information age is that everyone has personal access to compuational power, communications, and information. In the post-information age, individuals will use these empowering tools to be innovative and productive, without being employed by bricks-and-mortar institutions.

Who gets to be a player in the post-information age?

In the U.S., the lingering impediment to being a solo information expert is medical insurance. Here, health insurance coverage is something that is usually received through employment. Individuals who are not part of an empoyer's group can be denied health insurance by the insurance agencies, for almost any reason. Because medical care is extravagantly expensive in the U.S., it is very important to have a health insurance provider. Many people hold onto unrewarding jobs, just for the available health insurance (for themselves and their families).

It's really an enormous waste of potential talent, because many of the opportunities for innovation are best accomplished by small groups of experts (maybe 1, 2 or 3 people) who might be geographically dispersed. Contries that guarantee health care to their citizens (e.g., the EU), will have an enormous advantage in the new post-information age.

- Copyright (C) 2008 Jules J. Berman

key words: biomedical informatics, medical informatics, health coverage, health insurance, medical insurance
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Saturday, June 28, 2008

Biomedicine in the Post-Information Age: 4

This is part four of a multi-part blog on biomedicine in the post-information age.

To continue from the prior posts, in the post-information age (when everyone has instant access to enormous amounts of information), many services will be rendered by solo experts, who are not employees of bricks-and-mortar institutions, but who are contracted, as needed, to work on specific projects.

These solo contractors will write on-demand software utilities (very short turn-around time), annotate large databases (so that they can be integrated with other databases), check work done by the staff of bricks-and-mortar institutions (for mistakes and weaknesses, for conformity to some specialized standard), munge non-standard data into any of many standards, do literature/data research, assist in writing grants and proposals, etc.

I predict that there will be big changes in the book publishing industry as solo experts begin writing/illustrating/publishing/marketing/distributing their own books. Basically, we're just waiting for someone to market an inexpensive, convenient high-quality ebook reader. Once we get a good ebook reader, individuals will find that they have the expertise to manage every facet of the book industry. Traditional publishers will have a major role in this post-information age enterprise, only if they are willing to change with the times and use their established marketing and production skills to create innovative books that utilize all of the available facilities of the ebook medium (including seamless links between books and other media). If all goes well, the public will benefit from wonderful books that offer a remarkably exciting (and educational) reading experience.

In the post-information age, turn-around for all sorts of services will be shortened. Users will not tolerate a procrastinating work-force. Solo contractors will be valued for their rapid turn-around and high accuracy.

The skill sets of the solo contractor will change. They will be information integrators, and they will be expected to master ontologies (and the syntax for ontologies, RDF), several high-level programming languages (e.g., Perl, Python, Ruby), collective intelligence tools, web services, at least one highly specialized data domain (such as molecular biology, biomedical imaging, or genetics), and the legal/ethical aspects of their services.

- Copyright (C) 2008 Jules J. Berman

key words: biomedical informatics, medical informatics, big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining
My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.



I urge you to explore my book. Google books has prepared a generous preview of the book contents.

Friday, June 27, 2008

Biomedicine in the Post-Information Age: 3

This is part three of a multi-part blog on biomedicine in the post-information age.

I confess that my core concept of the post-information age comes from Blade Runner. In that movie, there were large bricks-and-mortar corporations that controlled much of the world's industry (constructing dangerous androids, conducting off-world mining operations, etc.), and there were solo contractors who worked in the back rooms of antique stores and noodle shops and who made the android skin or nano-bots or psionic circuits and what-not for the replicants and the off-world mining operations, etc. The idea is that the large companies dealt with the macro-economics but that the little guys had the specialized expertise that was used by the large corporations.

This is how I think of the post-information age. Big corporations like Microsoft and Google will dominate the information world, but highly trained free-lancers will do some of their most specialized work. In the biomedical world, large academic universities and federal and private funding agencies will spearhead huge initiatives (hundreds of millions of dollars), but the most fastidious work will be done by free-lancers.

Why is this? Why won't specialized work be done in-house? When large corporations hire, they are looking for people with a generalized skill-set that is appropriate for the activities of a department. So a Department of Surgery hires lots of surgeons. They may even hire an information officer or two. But they will never be in a position to hire (and keep) someone with all of the computational skills needed for a complex project that collects clinical data and integrates it with biomedical data from heterogeneous sources. It just makes sense to identify one of the few people in the world with the needed skills and have that person help out, when the need arises, for a negotiated fee.

In the post-information age, everyone has access to computers and software and lots of people have access to information. In the case of biomedicine, this information would be public biological databases, and de-identified medical databases, and associated ontologies, nomenclatures and classifications that help integrate all the data. The free-lancers would be hired to add value to or make sense of the data or write software to handle some specific purpose, that sort of thing. In the next few blogs I'll provide some examples.

- Copyright (C) 2008 Jules J. Berman

key words: biomedical informatics, medical informatics, common disease, orphan disease, orphan drugs, genetics of disease, disease genetics, rules of disease biology, rare disease, pathology
In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.



I urge you to read more about my book. There's a generous preview of the book at the Google Books site.

Thursday, June 26, 2008

Biomedicine in the Post-Information Age: 2

In yesterday's blog , I began a series on the post-information age of biomedicine.

In the post-information age, everyone is empowered with lots of information, as well as the hardware and software tools to use the information.

This means that there will be less dependence on bricks-and-mortar institutions to carry on research, development, and entrepreneurial ventures. People can do an awful lot from their homes, or from nearly any location on the planet.

My guess is that we will see a growing workforce of talented, free-lance technologists who make enormous contributions to biomedical research in the post-information age. These individuals will come from two groups:

1. The recent college grads, who are technology-enabled and who developed a group of collaborators through social networking sites while they were in college.

2. Retirees, who bring their technical expertise with them into their retirements and who are fully capable of leading technologically productive lives from their homes.

Just about everyone one else (i.e., age 30 to 60) is fully invested in the bricks-and-mortar paradigm. They're dependent on regular pay checks, and on the family health coverage provided by their employers. They are not sufficiently secure, financially or medically, to leave their jobs to begin a new life as free-lancers.

Tomorrow, I'll discuss the kinds of projects that can be done "from home" in the post-information age.

- Copyright (C) 2008 Jules J. Berman

key words: biomedical informatics, medical informatics
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Wednesday, June 25, 2008

Biomedicine in the Post-Information Age: 1

Apparently, we have entered the post-information age. I've been doing a little research on "post-information", and I'm not sure that a good definition exists.

My impression is that you enter a "post-fill in the blank" age when the basic tools and principles of an age are completed and available. In the "post" age, the world develops new and useful advances from pre-existing tools.

So, for example, the industrial revolution involved developing machinery and engines that could perform some of the difficult chores that humans were struggling with: ginning cotton, moving goods from the Eastern states to the Western territories. The post-industrial age involved using the principles of machine design to clever things, like visiting the moon.

Another example is the genomic and post-genomic ages. The age of the genome was devoted to sequencing the bases in human DNA. Once the human genome was sequenced, the post-genomic age began. Now, we're expected to use our knowledge of genes to cure cancer and halt the aging process.

The information age was focused on building powerful, fast, and affordable computers and to develop computational strategies for collecting, storing, accessing, exchanging, annotating, and analyzing huge amounts of information. We've done that, and we've used computers and software to do many of the tedious tasks that were once done "by hand." So now we're in the post-information age. Now, we're expected to use all that information to do completely new things; things that were not envsioned during the information age.

Over the next few days or weeks, I hope to write a few blogs about what we might expect from the post-information age. As usual, I will tie everything to my favorite subjects (data annotation, classification, new methods of data analysis, and biomedical progress).

- Copyright (C) 2008 Jules J. Berman

key words: biomedical informatics, post-genomic
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Thursday, June 19, 2008

Biomedical Informatics Book

I just visited (6/19/08) my book's Amazon page to see how sales were doing, and I found that Amazon has reduced its price. They're currently selling it for $50.91 (a 32% savings) and free shipping. This is pretty good, because they usually sell it with no reduction, or with a negligible reduction.

If you're curious about Biomedical Informatics, here is a list of contents.

0. Preface.

1. What is biomedical data, and what do we do with it?

1.1. Background.

1.2. The challenge of translational research.

1.3. Disasters in translational research.

1.4. The role of biomedical data in translational research.

1.5. Expertise in biomedical informatics.

1.6. The good news: no-cost tools.

1.7. The bad news: the high-cost of human cooperation.

1.8. Realistic opportunities for biomedical informaticians.

2. The data of biomedical informatics.

2.1. Background.

2.2. Data files and databases.

2.3. Medical databases and hospital information systems.

2.4. Every patient must be uniquely identified within the system.

2.5. All data entered should be retrievable.

2.6. Entered data should only be modified with great caution.

2.7. The government as a source of biomedical data.

2.8. Your right to obtain government data - freedom of information act.

2.9. Access to research data discovered under u.s. grants.

2.10. Grantees strike back: the u.s. bayh-dole act.

2.11. Intellectual property.

2.12. Fair use and other academic privileges.

2.13. Madey v duke and the erosion of academic privilege.

2.14. Further cautions on the use of proprietary software and data.

2.15. The often misunderstood concept of patient data "ownership".

2.16. Sharing data.

2.17. Legacy data.

2.18. Free, open source and proprietary software and data.

2.19. Undifferentiated software.

2.20. What are some of the open access biomedical databases?

2.21. Open access medical terminologies.

2.22. Mesh (the national library of medicine's medical subject headings).

2.23. Taxonomy.

2.24. Disease data and epidemiologic data.

2.25. The impact of free and open source data and software on biomedical informatics.

3. Confidential biomedical data.

3.1. Background.

3.2. Human subject risks.

3.3. The risk to life and health as a direct result of a medical intervention.

3.4. The risk of loss of database functionality.

3.5. The differences between confidentiality and privacy.

3.6. Example: loss of privacy resulting from participation in a medical study.

3.7. Loss of confidentiality.

3.8. The responsibilities of biomedical informaticians to human subjects.

3.9. Patient record anonymization.

3.10. Patient record de-identification.

3.11. An example of the law of unintended consequences.

3.12. Violations against the common rule (in u.s.).

3.13. Violations against hipaa (in u.s.).

3.14. Tort and violations against individuals.

3.15. What consents does the patient have on record?

3.16. Consented versus unconsented human subject research.

4. Standards for biomedical data.

4.1. Background.

4.2. The criticality of common standards.

4.3. The non-role of government (in the u.s.) in standards-making.

4.4. The hazards of creating a new standard.

4.5. Overview of standards development.

4.6. How are standards developed, approved and adopted?

4.7. The utility of non-standards.

4.8. The non-standard present - specifications and unique objects.

4.9. The non-standard future - data semantics.

4.10. Unique object identifiers.

4.11. Life science unique identifiers.

4.12. Hl7 unique identifiers.

4.13. Unique problems associated with uniqueness.

4.14. Specifying information: do you have the time?

4.15. Introduction to meaning.

5. Just enough programming.

5.1. Background.

5.2. Why you should learn some fundamental programming.

5.3. Just enough perl.

5.4. Downloading perl.

5.5. File operations.

5.6. Perl script basics.

5.7. The directory path to perl.

5.8. Accessing files.

5.9. The open1.pl script, line by line.

5.10. An 8-line perl word processor.

5.11. Don't panic! perl will forgive you.

5.12. Pseudocode for a general biomedical informatics program.

5.13. Interactively reading lines from a file.

5.14. Scanning enormous files quickly.

5.15. Getting just what you want with perl regular expressions.

5.16. Pseudocode for common uses of regex (regular expression pattern matching).

5.17. Regular expression syntax.

5.18. Removing periods that do not delineate sentences.

5.19. Counting all the words in a text file.

5.20. Finding the frequency of occurrence of each word in a text file (zipf distribution).

5.21. Creating a persistent database object.

5.22. Retrieving information from a persistent database object.

5.23. Validating xml tags using regular expressions.

5.24. What have we learned?

6. Programming common biomedical informatics tasks.

6.1. Background.

6.2. Computing a one-way hash for a word, phrase or file.

6.3. Simple statistics.

6.4. Invoking statistical tests through perl modules.

6.5. Avoiding type 4 errors with resampling.

6.6. Using random numbers.

6.7. Resampling and monte carlo statistics.

6.8. How often can i have a bad day?

6.9. Rough test of the built-in random number generator.

6.10. The monty hall problem: solving what we cannot grasp.

6.11. Internal and external math modules for perl.

6.12. Using external modules - fast fourier transform.

6.13. Indexing text.

6.14. Searching large text files.

6.15. Finding needles fast using a binary-tree search of the haystack.

6.16. Clustering: algorithms that group similar objects.

6.17. Retrieving information from the internet.

6.18. Gene sequence parsing: finding palindromes in a gene database.

6.19. Why counting is non-trivial and important.

6.20. Why you should write your own counting programs.

6.21. Software utilities versus software applications.

6.22. Software evaluation.

7. Biomedical nomenclatures.

7.1. Background.

7.2. Big nomenclatures and small nomenclatures.

7.3. Curating nomenclatures.

7.4. Automatic expansion of a medical nomenclature.

8. Misbehaving text: dealing with poorly written medical text.

8.1. Background.

8.2. Spelling errors.

8.3. Homonymous terms.

8.4. Abbreviations that are sometimes both acronyms and shortened forms.

8.5. Prepositions and articles retained in an acronym.

8.6. Single expansions with multiple abbreviations.

8.7. Nonsense abbreviations.

8.8. Common usage that confounds meaning.

8.10. Pejorative abbreviations.

8.11. Locale-dependent abbreviations.

8.12. Classifying abbreviations by their expansion algorithms.

8.13. Ephemeral abbreviations.

8.14. Hyponymous abbreviations.

8.15. Polysemous abbreviations.

8.16. Abbreviations masquerading as words.

8.17. Fatal abbreviations: innocent victims of abbreviation drift.

8.18. Forbidden abbreviations.

9. Autocoding unstructured data (narrative ext).

9.1. Background.

9.2. Machine translation.

9.3. Autocoding.

9.4. Human fallibility and the limitations of human-collected data.

9.5. A fast lexical autocoder.

9.6. Evaluating autocoders: dealing with precision and recall.

9.7. Other performance issues.

9.8. On-the-fly coded data retrieval without pre-coding.

9.9. Different philosophical approaches to term-based data retrieval.

9.10. Why it is important to have fast autocoding software.

10. Computational methods for de-identification and data scrubbing.

10.1. Background.

10.2. Anonymization, de-identification, data scrubbing.

10.3. Identifiers.

10.4. Stripping identifiers.

10.5. How good is good enough?

10.6. Scrubbing data.

10.7. De-identification algorithms.

10.8. Feasibility of de-identification.

10.9. Non-uniqueness and de-identification.

10.10. Leveraging some confidential information to learn more confidential information.

10.11. Performance considerations for de-identification software.

10.12. De-identification and data sharing patents.

11. Cryptography in biomedical informatics.

11.1. Background.

11.2. One-way hashing algorithms.

11.3. One-way hash weaknesses: dictionary attacks and collisions.

11.4. Zero-knowledge patient reconciliation.

11.5. Threshold protocol.

11.6. Electronic signatures.

12. Describing data with metadata.

12.1. Background.

12.2. Metadata, xml (extensible markup language) and rdf (resource description framework).

12.3. Enforced and defined structure (xml rules and schemas).

12.4. Formal metadata (through the iso11179 specification).

12.5. Namespaces (sharing metadata).

12.6. Linking data via the internet.

12.7. Logic and meaning.

12.8. Self-awareness (embedded protocols and commands).

12.9. Integrating heterogeneous data with rdf.

12.10. Meaning requires a fully-specified subject.

12.11. Meaningfully biomedical description with notation 3.

12.12. The daml extension of rdf .

12.13. Owl extension of daml.

13. Simplifying complex data with classifications and ontologies.

13.1. Background.

13.2. The value of hospital information technology.

13.3. Understanding complexity.

13.4. The importance of data simplification.

13.5. Example case: a molecular classification of cancer.

13.6. Cancer nomenclatures, taxonomies, classifications and ontologies.

13.7. Practical limitations of classifications.

13.8. Ontologies: multi-class inheritance and logical inferences.

13.9. Go, the gene ontology that is not an ontology.

14. Clinical trials: the informatician lives in a statistical world.

14.1. Background.

14.2. Do we need clinical trials?

14.3. The length and expense of clinical trials.

14.4. An imaginary clinical trial.

14.5. Modeling a clinical trial.

14.6. What do models tell us?

14.7. The informatics of clinical trials.
14.8. Clinical trials need to be validated by post-trial experience.

15. Distributed computing.

15.1. Background.

15.2. Remote procedure calls, soap, web services and grid computing.

15.3. Data utopia.

15.4. Data dystopia.

16. Grantsmanship for biomedical informaticians.

16.1. Background.

16.2. Institutional risks from biomedical informatics research.

16.3. Funders' risks from biomedical informatics research.

16.4. Suggestions for biomedical informaticians who write grant applications.

17. A practical approach to ethics for biomedical informaticians.

17.1. Background.

17.2. Is it ever ok to lie?

17.3. When can you use unconsented identified medical records?

17.4. When can you use proprietary software and standards?

17.5. When is it ok to have conflicts of interest?

17.6. When is it ok to refuse consent?

17.7. Is it ethical to patent biomedical discoveries?

17.8. The etiquette of free software usage.

17.9. Hoarding research data.

17.10. Are there ethical alternates to hipaa's safe harbor de-identification method?

17.11. Can you use consented data for unconsented research?

17.12. When is it ethical to enforce copyright medical research publications?

17.13. Is it ok to profit from tissue banking services?

17.14. How likely is a hipaa lawsuit?

17.15. Being fair to the outraged patient.

17.16. When can i be wrong?

17.17. Closing platitudes.

18. References (commented).

19. Appendix.

19.1. The c programming language.

19.2. The java programming language.

19.3. Perl, open source programming language.

19.4. Python, open source programming language.

19.5. Ruby, open source object oriented programming language.

19.6. Swig, open source glue tool.

19.7. Open microscopy environment (ome).

19.8. R open source statistical programming language and bioconductor.

19.9. Open source bioperl, biopython, bioruby.

19.10. Open source electronic laboratory notebook, neurosys.

19.11. Open source gimp image software.

19.12. Open source nih image.

19.13. Pov-ray image rendering open source software.

19.14. Open source compression and archiving utilities (gzip, gunzip, tar, 7-zip, bunzip).

19.15. Cygwin, open source unix/linux emulator.

19.16. Gnupg, open source encryption tool.

19.17. Wget web site mirroring software.

19.18. Open source indexing software (swishe-e and lucene).

19.19. Open source wordprocessing software (abiword and openoffice writer).

19.20. Open source emacs text editor.

19.21. Open source spreadsheet software.

19.22. Open source presentation software.

19.23. Mumps, an ansi standard programming language for medical informatics.

19.24. MySQL, open source database software.

19.25. Protege, open source ontology editor.

19.26. Vista, a free hospital information system courtesy of the u.s. government.

19.27. CWM, a closed world machine for rdf (in python).

19.28. Pubmed and pubmed central.

19.29. Resources from the national center for biotechnology information.

19.30. Database issue of nucleic acids research.

19.31. Locuslink and its successor, entrez gene.

19.32. Time stamping.

19.33. Google, as if you didn't already know.

19.34. Sourceforge.

19.35. CVS, concurrent versions system.

19.36. Cpan, the comprehensive perl archive network.

19.37. Requests for comment.

19.38. Omim - online mendelian inheritance in man.

19.39. Loinc, logical observations identifiers, names, and codes.

19.40. HL7 - health level 7.

19.41. Seer.

19.42. U,LS metathesaurus.

19.43. Medical subject headings - mesh.

19.44. Gene ontology - GO .

19.45. OBO (open biology ontologies).

19.46. Ushik metadata registry.

19.47. Neoplasm classification.

19.48. US census.

20. Glossary.

21. List of lists.

22. Index.

23. Author biography.

More book information is available from the publisher's web site.

-Jules Berman

key words: medical informatics, bioinformatics, Perl programming, biomedical data, medical confidentiality, medical privacy, hipaa, big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining
My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.



I urge you to explore my book. Google books has prepared a generous preview of the book contents.

Wednesday, June 18, 2008

Interoperability efforts: please re-think

I'm sure that every reader of this blog has noticed that the word "interoperabiity" has become hackneyed, along with "standards" and "data integration".

What is wrong with having interoperability for hospitals, and interoperability for biomedical scientists, and interoperability for cancer researchers, and so on?

The problem is that interoperability should extend between the individual data domain. It is self-defeating to spend a lot of money (usually taxpayer's) on efforts that try to achieve information interoperability for one or another interest group.

Assuming that you can achieve any kind of interoperability, you'll still be faced with making the hospital data integrate with the medical research data, and the cancer research data, and so on.

Wouldn't it make a lot more sense to develop generalized methods for describing data, of any kind? That's what this blog is chiefly about (though I often digress). Anyone with data should use a general syntax for describing the data and for relating the data to other data.

The method for describing and relating data is RDF. The best way to advance data interoperability is to start with an RDF-literate workforce.

Last year I addressed a group of about 100 scientists who had convened to discuss image standards in pathology (my field). I asked for a show of hands for the number of people who had "heard of" RDF. Only two or three people raised their hands.

This is a really big problem. How can you achieve interoperability if nobody speaks the language of data specification, RDF?

I've posted, with Bill Moore, a primer in medical image specification, using RDF. It's not a bad place to start, if you're interested in the subject.

-Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.



I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, biomedical informatics, standards organizations, resource description framework, xml, data integration

Thursday, June 12, 2008

Defending Precancer Research:6 of 6 blogs

As regular readers of this blog know, I am an advocate for studying the precancers. I believe that successful treatment of the precancers is feasible, and that it will lead to the near-eradication of cancer.

In a prior blog, I listed arguments, that I have encountered over the years, against the the importance of precancer research. This is the last of six blogs where I respond to the arguments.

Argument. Treating precancers is not feasible for the majority of precancerous lesions that occur in humans. Reducing the incidence of cervical cancer by treating cervical precancer was possible only because the cervix can be inspected and sampled. There is no equivalent method to find and excise the precancerous lesions of pancreas, lung, prostate and breast. Therefore, procedures to detect and treat most precancers are not practical.

Response. As discussed in a prior blog, it is wrong to think that precancers must be detected and diagnosed prior to treatment.

The paradigm for treating cancer has been:

1. Detect the cancer (usually involves recognizing a sign or symptom or picking up the cancer on a screening text)

2. Diagnose the cancer (usually involves getting a tissue sample through a surgical procedure and sending the sample to a pathologist who renders a diagnosis indicating the type of tumor and its grade (level of malignancy). Diagnosis is sometimes supplemented with special studies, such as cytogenetics).

3. Stage the cancer (determining how far the tumor may have spread at the time of diagnosis)

4. Treat the cancer (one or more of surgery, chemotherapy, radiation therapy).

5. Follow-up

With precancers, we may be able to skip most of these steps, going straight to treatment. This is because the treatment for precancers can be simple and effective.

If a precancer can be eradicated with a relatively non-toxic systemic drug, or if the transition from precancer to cancer can be delayed with hormonal manipulation, or if the initiation step of carcinogenesis (leading to precancer development) can be blocked with a dietary supplement or a vaccine (e.g. Gardasil for cervical precancer), why not just forego the detection/diagnosis/staging steps?

The idea of receiving medical treatment for undiagnosed diseases is not new. How many people in the U.S. take statins, even though they have no reason to think that any of their arteries are significantly blocked by atheroma (never had stroke, never had angina, never had claudication, etc.)? How many people in the U.S. are treated for hypertension even if they've never had any of the associated diseases (never had renal failure, never had stroke, etc.)? Virtually everyone in the U.S. has been vaccinated for diseases they do not have (polio, smallpox, tetanus, etc.).

Intelligent people accept treatment for diseases they do not have, because they know how bad such diseases (myocardial infarction, stroke, kidney failure, polio, etc.) can be.

Treating precancers in high-risk people, avoiding the steps of precancer screening, detection, and diagnosis, is an option that we should be studying.

Prior blog in series

Copyright (C) Jules J. Berman 2008

key words: preneoplasia, premalignant, preneoplastic, incipient neoplasia, pre-cancer, dysplasia, metaplasia, intraepithelial neoplasia, premalignancy, premalignancies, precancers, precancerous, pretumor, carcinogenesis, pathology, cancer research, cancer funding, cancer research funding, funding for cancer research
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Wednesday, June 11, 2008

Defending Precancer Research: 5

As regular readers of this blog know, I am an advocate for studying the precancers. I believe that successful treatment of the precancers is feasible, and that it will lead to the near-eradication of cancer.

In a prior blog, I listed arguments, that I have encountered over the years, against the the importance of precancer research. This is the fifth of several blogs where I respond to the arguments.

Argument. Precancer research falls under cancer prevention. This is true because when you treat a precancer, you prevent a cancer. Cancer prevention is an adequately funded area of cancer research, so we really do not need to assign any special funding to precancer research.

Response. Prevention is a funded research area at the U.S. National Cancer Institute, within the Division of Cancer Prevention. Cancer prevention involves identifying and eliminating carcinogens as well as adopting lifestyles that are thought to minimize exposure to carcinogens or to increase the ingestion of anti-carcinogens (found in fruits and vegetables) in the diet. These priorities are not really aimed at precancer detection, diagnosis and treatment.

Likewise, the precancers have not been included in the initiatives of NCI's Division of Cancer Treatment and Diagnosis, most of which are aimed at finding or testing new chemotherapeutic agents for cancers.

Precancer treatment is an area that has not fallen neatly into any of the National Cancer Institute Divisions. The precancers have never gotten all the attention they deserve.

Argument. People are dying from malignant tumors every day. Even if we could stop the occurrence of new cancers, by treating the precancers, we cannot deviate from our commitment to help people whose cancers have progressed beyond the precancer stage.

Response. Individuals with advanced, metastatic cancers may or may not benefit from an expansion of research activity in the precancers. At this point, we simply do not know whether agents that treat precancers will have activity against metastatic or invasive tumors. The commonly occurring advanced, metastatic cancers have been an intractable problem, resisting all past efforts.

All those who are attacked by metastatic cancer wish they lived in a world where cancers are stopped before they invade and metastasize. Precancer research will help create a better world for their loved ones.

Next blog entry in precancer series

Copyright (C) Jules J. Berman 2008

key words: preneoplasia, premalignant, preneoplastic, incipient neoplasia, pre-cancer, dysplasia, metaplasia, intraepithelial neoplasia, premalignancy, premalignancies, precancers, precancerous, carcinogenesis, pathology, cancer research, cancer funding, cancer research funding, funding for cancer research

Tuesday, June 10, 2008

Defending Precancer Research: 4

As regular readers of this blog know, I am an advocate for studying the precancers. I believe that successful treatment of the precancers is feasible, and that it will lead to the near-eradication of cancer.

In a prior blog, I listed arguments, that I have encountered over the years, against the the importance of precancer research. This is the fourth of several blogs where I respond to the arguments.

Argument. The mission of the National Cancer Institute, the primary funding agency for cancer research in the U.S., is to develop cures for cancer, not precancer. If precancers were as important as you say they are, there would be a National Precancer Institute. But there isn't.

ResponseThis argument turns the tables on people, such as myself, who insist that precancer is a lesion that is distinct and separable from cancer. If precancers really are a different lesion from cancer, then why should precancers receive research funds earmarked for cancer? Well, the answer is obvious. The job of the National Cancer Institute is to eliminate cancer through research. Pursuing precancer research is the best strategy to eliminate cancer.

Argument. Precancers regress spontaneously. Why should we try to develop treatments for a disease that usually resolves without treatment?

Response. Yes, many precancers regress spontaneously, and if we were to treat all of the precancers, we would be treating many lesions that would have regressed without treatment. At this time, we cannot distinguish the precancers that will regress from the precancers that will progress to invasive cancer. Until we can distinguish regressing precancers from progressing precancers, we need to treat them all.

At this point, we know almost nothing about the causes of precancer regression. We could potentially cause all precancers to regress, if we knew how to control the conditions that favor regression. If we could arrest the transition of precancer to cancer, we could halt the occurrence of invasive cancers. If we could simply delay the transition of precancers to cancer, even if it were for just a few years, we could greatly reduce the burden of cancer in the population.

Next blog entry in precancer series

Jules Berman

key words: preneoplasia, premalignant, preneoplastic, incipient neoplasia, pre-cancer, dysplasia, metaplasia, intraepithelial neoplasia, premalignancy, premalignancies, precancers, precancerous, carcinogenesis, pathology, cancer research, cancer funding, cancer research funding, funding for cancer research

Monday, June 9, 2008

Defending Precancer Research: 3

As regular readers of this blog know, I am an advocate for studying the precancers. I believe that successful treatment of the precancers is feasible, and that it will lead to the near-eradication of cancer.

In a prior blog, I listed arguments, that I have encountered over the years, against the the importance of precancer research. This is the third of several blogs where I respond to the arguments.

Argument. There are many genetic and morphological disparities among the different recognized precancers. Since these lesions seem to have no properties in common, other than the defining property of "cancer precedence", it hardly seems as though they should be assigned any biological class.

Response. As discussed in Chapter 4, there are many diverse types of precancers. Squamous dysplasia of the uterine cervix, tubular adenoma of colon, barrett esophagus, myelodysplasia, and nephrogenic rests, are all types of precancers, but they seem to be biologically unrelated.

The observation is correct, and the conclusion is valid. The precancers do not form a biological class of neoplasms, any moreso than flying animals form a single class of related animals.



Figure caption. Public domain image from, "The Outline of Science: A Plain Story Simply Told," by J. Arthur Thomson, originally copyrighted 1922, with copyright now expired.

The fact that there are types of precancers that differ greatly from one another does not imply that precancers do not exist or that we cannot study the biology of precancers. Nobody asserts that flight does not exist or that flight is an invalid area area of investigation, simply because different classes of animals can fly.

Because precancers come in different biological forms, it is necessary to create a biological categorization of the precancers, so that the types of precancers with similar phenotypes can be grouped together. This was the subject of an open access paper published by Don Henson and myself.

Jules J Berman and Donald E Henson. Classifying the precancers: A metadata approach. BMC Medical Informatics and Decision Making 2003, 3:8.

Next blog in precancer series

Jules Berman

key words: preneoplasia, premalignant, preneoplastic, incipient neoplasia, pre-cancer, dysplasia, metaplasia, intraepithelial neoplasia, premalignancy, premalignancies, precancers, precancerous, carcinogenesis, pathology, cancer research, cancer funding, cancer research funding, funding for cancer research

Sunday, June 8, 2008

Defending Precancer Research: 2

As anyone who reads this blog knows, I am an advocate for studying the precancers. I believe that successful treatment of the precancers is feasible, and that it will lead to the near-eradication of cancer.

In a prior blog, I listed arguments, that I have encountered over the years, against the the importance of precancer research. This is the second of several blogs where I respond to the arguments.

Argument. The transition from precancer to cancer is characterized by the acquisition of invasiveness. However, there is no practical way to determine the precise moment that invasiveness is acquired by a lesion. Therefore, there is no practical method to reliably distinguish a precancer from a cancer in every instance (i.e., there is no way to be confident that a lesion has not acquired the ability to invade). Therefore, precancers have no validity as biological entities.

Response. Biologists are not very adept at determining the precise moment of naturally occurring events. For example, the moment of death has been a subject of debate for centuries. Whenever we think we have a good measurement (e.g., heart-beat cessation, flat electroencephalogram), an exception occurs. Even the phrase, "the patient has expired," is an anachronism that dates to the time when the cessation of breathing (i.e. a final expiration, with no subsequent inspiration) was considered ample demonstration of death. Nonetheless, nobody would argue that death does not exist simply because we fail to accurately measure the moment when life stops.

If we cannot accurately measure the moment when a precancer becomes a cancer, it no more invalidates the existence of the precancer than it invalidates the existence of the cancer.

Next precancer blog entry

Jules Berman

key words: preneoplasia, premalignant, preneoplastic, incipient neoplasia, pre-cancer, dysplasia, metaplasia, intraepithelial neoplasia, premalignancy, premalignancies, precancers, precancerous, carcinogenesis, pathology, cancer research, cancer funding, cancer research funding, funding for cancer research

Saturday, June 7, 2008

Defending Precancer Research: 1

As anyone who reads this blog knows, I am an advocate for studying the precancers. I believe that successful treatment of the precancers is feasible, and that it will lead to the near-eradication of cancer.

In a prior blog, I listed arguments, that I have encountered over the years, against the the importance of precancer research. This is the first of several blogs where I respond to the arguments.

Argument. There is no such thing as a precancer. The lesions that are called precancers are simply early (or small) cancers.

Response. Some people question why we need to specify some lesions as precancers when we know that carcinogenesis is a multistep process and that every cancer traverses many un-named biological states as it develops into a fully malignant lesion. Why can't we recognize that precancers are just an early form of cancer and refer to the precancers by the name of its developed cancer? Wouldn't that make life a lot easier than naming and characterizing a new disease entity for the pre-invasive stage of every cancer?

Much as we all like data simplification, it just can't be done in the case of the precancers.

Precancers have specific, characteristic properties that separate them from cancers. They are not simply small, or early, versions of cancers. These properties, particularly spontaneous regression and the transition from non-invasion to invasion, deserve to be studied. If we are to study the biology of the preinvasive stage of cancers, we will need to have standard identifiers for this stage, so that all investigators will use the same morphologic features and biologic features to identify the same named lesions, Otherwise, the research results, from laboratory to laboratory, will not be comparable, and the field of precancer research will not advance.

Next blog in precancer series

First blog in the series

Jules Berman

key words: preneoplasia, premalignant, preneoplastic, incipient neoplasia, pre-cancer, dysplasia, metaplasia, intraepithelial neoplasia, premalignancy, premalignancies, precancers, precancerous, carcinogenesis, pathology, cancer research, cancer funding, cancer research funding, funding for cancer research


Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.