According to Page 1 of the National Academies of Sciences, Engineering, and Medicine’s Data Science for Undergraduates: Opportunities and Options, “data science spans a broad(er) array of activities that involve applying principles for data collection, storage, integration, analysis, inference, communication, and ethics.”
On Page 3, members of the Committee on Envisioning the Data Science Discipline: The Undergraduate Perspective “… underscore the centrality of studying the many ethical considerations that arise as workers engage in data science. These considerations include deciding what data to collect, obtaining permissions to use data, crediting the sources of data properly, validating the data’s accuracy, taking steps to minimize bias, safeguarding the privacy of individuals referenced in the data, and using the data correctly and without alteration. It is important that students learn to recognize ethical issues and to apply a high ethical standard.”
The following two National Academies of Sciences action items from 2018, also on Page 3, follow directly from the committee’s focus on the centrality of ethical professional practice in data science:
- Recommendation 2.4: Ethics is a topic that, given the nature of data science, students should learn and practice throughout their education. Academic institutions should ensure that ethics is woven into the data science curriculum from the beginning and throughout.
- Recommendation 2.5: The data science community should adopt a code of ethics; such a code should be affirmed by members of professional societies, included in professional development programs and curricula, and conveyed through educational programs. The code should be reevaluated often in light of new developments.
Role for Ethical Data Science Practice Standards
Two books were written in October of 2022 to help instructors in all settings respond to recommendations 2.4 and 2.5 and offer an accessible way for instructors in any higher education department to immediately integrate teaching and assessable learning about ethics specific to statistics, data science, and statistics and data science.
The books were designed to get readers to and beyond Blooms Level 4 (application), so learners can use ethical practice standards—like those from the American Statistical Association and Association for Computing Machinery—effectively.
The scientific reproducibility crisis is recognized to be at least partly due to poor training relating to data, its analysis, and the communication of those analyses.
As Philip Stark and Andrea Saltelli assert in Cargo-Cult Statistics and Scientific Crisis, “(t)he problem is one of cargo-cult statistics—the ritualistic miming of statistics rather than conscientious practice” (emphasis added). It is difficult to justify not committing to preparing data collectors, wranglers, analyzers, and interpreters to do each of these tasks in an ethical manner.
It is essential to avoid the cargo-cult approach as biomedical researchers integrate data science into their studies by also ensuring “conscientious practice” is an integral part of that training. Responsibilities accrue to the user of statistics and data science in each of the seven key tasks on the Statistics and Data Science Pipeline (SDSP):
- Planning/designing
- Data collection/munging/wrangling
- Analysis
- Interpretation
- Documenting your work
- Reporting your results/communication
- Engaging in team science/team work
Ethical challenges can arise in new and wholly unexpected situations throughout a career. It is not plausible to assume practitioners in/users of data science will somehow prepare themselves, since Min Wang, Alice Yan, and Ralph Katz reported ubiquitous requests for unethical data analysis, interpretation, and other data-specific behaviors made by biomedical researchers of predominantly federally funded biostatisticians in Researcher Requests for Inappropriate Analysis and Reporting: A US Survey of Consulting Biostatisticians.
The American Statistical Association and Association for Computing Machinery guidelines for ethical practice specifically describe the components for scientific rigor, reproducibility, and responsible conduct of research and how ethical practitioners can demonstrate they accept their responsibility for the implications of ethical statistics and data science practice on the credibility of science. These guidelines are also explicit in their assertions that ethical practice is the responsibility of all who use these methods and techniques. These practice standards specifically charge those who use the relevant methods and techniques to do rigorous and reproducible work and take—and require others to take—responsibility in their conduct of research.
Further Reading
National Academies of Sciences, Engineering, and Medicine. 2017. Fostering integrity in research. Washington, DC: The National Academies Press.