To the Editor: In February's Taking Issue, Drake and McHugo voice their concern that "large data sets can be dangerous" (
1). We believe that the large administrative and operational databases that our society generates on an ongoing basis can contribute to an unprecedented growth of knowledge in medical and behavioral sciences. Instead of focusing on the negative, we propose that researchers focus on the strengths of these data sets and develop new methodologies that will allow us to learn from our massive stores of data regarding public health, vital records, social and criminal justice programs, public and private insurance, and positive participation in society—for example, in school and in gainful employment. We live in an information-rich society. We have a responsibility to use this resource to promote the advancement of knowledge.
Administrative and operational databases have many advantages over narrowly focused, special-purpose data collection. One of the greatest strengths of these databases is comprehensiveness. Minority populations are included in numbers adequate to provide confidence in findings, identical outcome measures for relevant comparison groups exist within the databases, the problem of subjects lost to contact is minimized, and studies can be replicated at minimal cost because the data and analytical tools are already in place. Unlike experimental research, use of administrative and operational databases allows examination of treatments as they are routinely administered in community settings where best practices may not be universal. Administrative and operational databases also avoid many reactive effects of testing (
2).
Criticisms of the quality of administrative and operational data frequently overlook the fact that these systems typically include strong data quality controls, such as audits and utilization review. Submission of false-positive records, such as insurance claims and death certificates, is often punishable by fine or imprisonment. Failure to report is limited by economic forces, as in the case of insurance claims, or by legal mandates, as in reports of births and deaths. Critics judge administrative and operational databases by their weakest data elements rather than by the data elements used in an analysis. In mortality databases, for instance, the objective fact of death is rarely refuted. We believe all research, including research using existing databases, should acknowledge the degree of objectivity and subjectivity involved in the creation of the data being analyzed.
The cost of research using large data sets is minimal compared with the cost of research that involves special-purpose data collection. We believe that the research community should discuss the relevance of an ethic of efficiency to medical and behavioral research. Is it ethical to conduct a very expensive study when a very inexpensive study has equal—or superior—promise of generating useful knowledge?
Our administrative and operational databases have the potential to move science forward at a rapid rate. In the past, science tended to operate deductively, seeking repeated verification of hypothesized relationships. As anomalies converged, new theories emerged to challenge the old (
3). Our wealth of administrative and operational data could accelerate this process by supporting a world of research in which inductive and deductive models coexist and interact (
4). Medical and behavioral researchers should explore new models for creating and sharing knowledge, models that embrace the wealth of information and analytical power that we now have at our fingertips. Instead of looking backward, medical and behavioral research should focus on developing methodologies that maximize the knowledge we extract from our administrative and operational databases.