The Dark Side of Data Research:  An Interview with Laura Norén 

Laura Norén writes about the social impact of technology.  She has taught Steinhardt’s undergraduate and graduate students Data Science for Social Impact (with Steinhardt Professor Jennifer Hill), Ethics for Data Science, and New Media Research Studio.  Norén is director of research at Obsidian Security in Newport Beach, California and a contributing editor of the Data Science Community Newsletter.  She earned her doctorate at NYU’s Graduate School of Arts and Science and her bachelor’s degree at MIT.

We asked her to speak about the ethics of empirical research.

Laura Norén

Your class, Ethics of Data Science, focuses on understanding the ethical implications of empirical research. Many people feel that this is a great time to be a researcher because so much data is available, but there is also a dark side to data research; can you speak to that idea?

First, it is important to note that access to great troves of relevant data is inconsistent across fields. Astronomers have massive amounts of relevant stellar data available to them. Geneticists and neuroscientists have a fair amount of data available to them, too. But social scientists have in some ways been left to rely on imperfect data, say, using Twitter data and the portion of Facebook available to the public as a proxy for human behavior. This sets up a division between academics and employees of companies like Google, Facebook, and Amazon who simply have more relevant data available to them.

With respect to data, we are at the 2.0 stage of development. The 1.0 stage of development started in Europe in the 1800s with the collection of individual and household characteristics that continues with the Census. The word statistics comes from data 1.0 and shares the same root as the word state because the drive to know more about citizens came from the state. Now, in data 2.0, the drive to know more about individuals comes from corporations, driven initially by advertising aims, and scientists, driven largely by the desire to accelerate scientific advances. Both of these present opportunities to improve society, but they also have a dark side. Scientists have a history of placing the advancement of science above the ethical treatment of their subjects. The Centers for Disease control ran the Tuskegee syphilis experiment from the 1930-1970s in which African American men were deliberately infected with syphilis and left untreated (!) in order to better understand the disease’s progression. That seems crazy now, but at the time, the project’s lead scientists thought they would be able to save thousands of soldiers’ lives, justifying the sacrifice of 400-600. We tend not to agree in hindsight, but our current discomfort with the idea of sentencing innocent people to a life of misery followed by an early death runs against utilitarian ethical principles that posit sacrificing a few for the benefit of the whole. These ethical dilemmas have deep intellectual roots that typically do not overlap with data science classes.

When we consider data science, we hear echoes of those science-uber-alles blinders. One Stanford project used machine learning to predict whether men are homo- or heterosexual. Others have helped government agencies improve war-making technology. That’s why one of the key questions I want students to instinctively ask is: just because I can, does it mean I should? Just because I can try to build a gaydar technology, does it mean I should? What are the consequences going to be for people correctly identified as gay? For people incorrectly identified as gay? For people incorrectly identified as straight? For the social construct of sexuality as a fixed binary characteristic of a person?

How might a researcher who uses data get a result that contains biases rooted in race, gender, class, and other characteristics?

Data science predictions are driven by some combination of data, techniques, and decisions applied to ‘clean’ the data, and mathematical analytical models used to draw insights. Because nearly all data are derived from humans and humans are biased, almost all data are also biased. Due to the long historical record of differential treatment of people by race, gender, class, sexual orientation, and other elements, almost all data science is likely to replicate the biases of the past. If the predictive goal is met more efficiently by strongly considering race, class, gender, sexual orientation, or party affiliation, then the bias may actually be amplified. When bias is predictive, it will become more important to the model.

The bias is always already there. The goal of my class is to teach students how to develop an ethical imagination that allows them to identify biases and think through how they may impact sub-samples and society.

In your classes, you offer practical guidance about how to uncover ethical weaknesses in existing protocols and how to undertake fair data scientific research.  Could you share some ideas on this?

One of the best pieces of practical advice I can offer is this: know your history. Because all data are biased, it is critical to know where the data come from, who collected them, for which purposes, and what happened when they were originally “applied” as tools of insight during data 1.0. If my students are not historical experts with respect to the type of data they have, I encourage them to find subject matter experts to help them tease apart the underlying assumptions and potential biases.

I also teach students how to conduct what I would consider a first-pass ethical audit.

The course is a success if students finish with a healthy skepticism about what they should do with data science and a structured approach to developing an ethical imagination.