Monday, February 1, 2016

When to terminate (or at least reconsider) a data repurposing project

"Not everything that counts can be counted, and not everything that can be counted counts." - William Bruce Cameron

The most valuable features of data worth repurposing are:

1. Data that establishes uniqueness or identity
2. Data that accrues over time, documenting the moments when data objects are obtained (i.e., time-stamped data)
3. Data that establishes membership in a defined group or class
4. Data that is classified, for every object in a knowledge domain
5. Introspective data - data that explains itself

A different set of properties characterize data sets that are virtually useless for data repurposing projects.

1. Data sets that are incomplete or unrepresentative of the subject domain. You cannot draw valid conclusions, if the data you are analyzing is unrepresentative of the data domain under study.

Having a large set of data does not guarantee that your data is complete and representative. Danah Boyd, a social media research, gives the example of a scientist who is analyzing the complete set of tweets made available by Twitter (1). If Twitter removes tweets containing expletives, or tweets composed of non-word character strings, or tweets containing highly charged words, or tweets containing certain types of private information, then the resulting data set, no matter how large it may be, is not representative of the population of senders (See Glossary item. Privacy versus confidentiality). If the tweets are available as a set of messages, without any identifier for senders, then the compulsive tweeters (those who send hundreds or thousands of tweets) will be over-represented, and the one-time tweeters will be under-represented. If each tweet were associated with an account, and all the tweets from a single account were collected as a unique record, then there would still be the problem created by tweeters who maintain multiple accounts (See Glossary item, Representation bias).

Contrariwise, having a small amount of data is not necessary fatal for data repurposing projects. If the data at hand cannot support your intended analysis, it may be sufficient to answer an alternate set of questions, particularly if the data indicate large effects and achieve statistical significance. In addition, small data sets can be merged with other small or large data sets to produce representative and complete aggregate data collections.

2. Data that lacks metadata. It may seem a surprise to some, but most of the data collected in the world today is poorly annotated. There is no way to determine how the data elements were obtained, or what they mean, and there is no way of verifying the quality of the data.

3. Data without unique identifiers. If there is no way to distinguish data objects, then it impossible to distinguish 10 data values that apply to one object versus 10 data values that apply to 10 different objects.

The term "identified data," a concept that is central to data science, must be distinguished from "data that is linked to an identified individual," a concept that has legal and ethical importance. In the privacy realm, the term, "data that is linked to an identified individual," is shortened to "identified data," and this indulgence has caused no end of confusion. All good data must be identified. Private data can be deidentified, in the regulatory sense, by removing any links between the data and the person to whom the data applies (See Glossary items, Deidentification, Deidentification versus anonymization, Reidentification). The data itself should never be deidentified (i.e., a unique alphanumeric identifier for every data object must exist). Removing links that connect the data object to an individual is all that is necessary for so-called privacy deidentification.

4. Undocumented data (e.g., data with no known creator, or no known owner, or with no "rights" statement indicating who may use the data and for what purposes). Data scientists cannot assume that they can legally use every data set that they acquire.

5. Illegal data or legally encumbered data or unethical data. Data scientists cannot assume that they have no legal liability when they use data that was appropriated unlawfully.

Data quality is serious business. The U.S. government passed the Data Quality Act in 2001, as part of the FY 2001 Consolidated Appropriations Act (Pub. L. No. 106-554). The Act requires Federal Agencies to base their policy decisions on high quality data and to permit the public to challenge and correct inaccurate data (2), (3). The drawback to this legislation, is that science is a messy process, and data may not always attain a high quality. Data that fails to meet standards of quality may be rejected by government committees or may be used to abrogate policies that were based on the data (4), (5).

References:

[1] Boyd D. 2010. "Privacy and publicity in the context of big data." Open Government and the World Wide Web (WWW2010). Raleigh, North Carolina, April 29, 2010. Available from: http://www.danah.org/papers/talks/2010/WWW2010.html, viewed August 26, 2012.

[2] Data Quality Act. 67 Fed. Reg. 8,452, February 22, 2002, addition to FY 2001 Consolidated Appropriations Act (Pub. L. No. 106-554. codified at 44 U.S.C. 3516).

[3] Guidelines for ensuring and maximizing the quality, objectivity, utility, and integrity of information disseminated by federal agencies. Federal Register Vol. 67, No. 36, February 22, 2002.

[4] Sass JB, Devine JP Jr. The Center for Regulatory Effectiveness invokes the Data Quality Act to reject published studies on atrazine toxicity. Environ Health Perspect 112:A18, 2004.

[5] Tozzi JJ, Kelly WG Jr, Slaughter S. Correspondence: data quality act: response from the Center for Regulatory Effectiveness. Environ Health Perspect 112:A18-19, 2004.

- Jules Berman (copyrighted material)

key words: data science, data repurposing, data renalysis, data analysis, primary data, secondary data, data quality act, jules j berman