Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are by Seth Stephens-Davidowitz (page

This is an extreme story. But it points to a potential problem with people using data to make decisions. Numbers can be seductive. We can grow fixated with them, and in so doing we can lose sight of more important considerations. Zo? Chance lost sight, more or less, of the rest of her life.

Even less obsessive infatuations with numbers can have drawbacks. Consider the twenty-first-century emphasis on testing in American schools—and judging teachers based on how their students score. While the desire for more objective measures of what happens in classrooms is legitimate, there are many things that go on there that can’t readily be captured in numbers. Moreover, all of that testing pressured many teachers to teach to the tests—and worse. A small number, as was proven in a paper by Brian Jacob and Steven Levitt, cheated outright in administering those tests.

The problem is this: the things we can measure are often not exactly what we care about. We can measure how students do on multiple-choice questions. We can’t easily measure critical thinking, curiosity, or personal development. Just trying to increase a single, easy-to-measure number—test scores or the number of steps taken in a day—doesn’t always help achieve what we are really trying to accomplish.

In its efforts to improve its site, Facebook runs into this danger as well. The company has tons of data on how people use the site. It’s easy to see whether a particular News Feed story was liked, clicked on, commented on, or shared. But, according to Alex Peysakhovich, a Facebook data scientist with whom I have written about these matters, not one of these is a perfect proxy for more important questions: What was the experience of using the site like? Did the story connect the user with her friends? Did it inform her about the world? Did it make her laugh?

Or consider baseball’s data revolution in the 1990s. Many teams began using increasingly intricate statistics—rather than relying on old-fashioned human scouts—to make decisions. It was easy to measure offense and pitching but not fielding, so some organizations ended up underestimating the importance of defense. In fact, in his book The Signal and the Noise, Nate Silver estimates that the Oakland A’s, a data-driven organization profiled in Moneyball, were giving up eight to ten wins per year in the mid-nineties because of their lousy defense.

The solution is not always more Big Data. A special sauce is often necessary to help Big Data work best: the judgment of humans and small surveys, what we might call small data. In an interview with Silver, Billy Beane, the A’s then general manager and the main character in Moneyball, said that he actually had begun increasing his scouting budget.

To fill in the gaps in its giant data pool, Facebook too has to take an old-fashioned approach: asking people what they think. Every day as they load their News Feed, hundreds of Facebook users are presented with questions about the stories they see there. Facebook’s automatically collected datasets (likes, clicks, comments) are supplemented, in other words, by smaller data (“Do you want to see this post in your News Feed?” “Why?”). Yes, even a spectacularly successful Big Data organization like Facebook sometimes makes use of the source of information much disparaged in this book: a small survey.

Indeed, because of this need for small data as a supplement to its mainstay—massive collections of clicks, likes, and posts—Facebook’s data teams look different than you might guess. Facebook employs social psychologists, anthropologists, and sociologists precisely to find what the numbers miss.

Some educators, too, are becoming more alert to blind spots in Big Data. There is a growing national effort to supplement mass testing with small data. Student surveys have proliferated. So have parent surveys and teacher observations, where other experienced educators watch a teacher during a lesson.

“School districts realize they shouldn’t be focusing solely on test scores,” says Thomas Kane, a professor of education at Harvard. A three-year study by the Bill & Melinda Gates Foundation bears out the value in education of both big and small data. The authors analyzed whether test-score-based models, student surveys, or teacher observations were best at measuring which teachers most improved student learning. When they put the three measures together into a composite score, they got the best results. “Each measure adds something of value,” the report concluded.

In fact, it was just as I was learning that many Big Data operations use small data to fill in the holes that I showed up in Ocala, Florida, to meet Jeff Seder. Remember, he was the Harvard-educated horse guru who used lessons learned from a huge dataset to predict the success of American Pharoah.

After sharing all the computer files and math with me, Seder admitted that he had another weapon: Patty Murray.

Murray, like Seder, has high intelligence and elite credentials—a degree from Bryn Mawr. She also left New York City for rural life. “I like horses more than humans,” Murray admits. But Murray is a bit more traditional in her approaches to evaluating horses. She, like many horse agents, personally examines horses, seeing how they walk, checking for scars and bruises, and interrogating their owners.

Murray then collaborates with Seder as they pick the final horses they want to recommend. Murray sniffs out problems with the horses, problems that Seder’s data, despite being the most innovative and important dataset ever collected on horses, still misses.

I am predicting a revolution based on the revelations of Big Data. But this does not mean we can just throw data at any question. And Big Data does not eliminate the need for all the other ways humans have developed over the millennia to understand the world. They complement each other.

8

MO DATA, MO PROBLEMS?

WHAT WE SHOULDN’T DO

Sometimes, the power of Big Data is so impressive it’s scary. It raises ethical questions.

THE DANGER OF EMPOWERED CORPORATIONS

Recently, three economists—Oded Netzer and Alain Lemaire, both of Columbia, and Michal Herzenstein of the University of Delaware—looked for ways to predict the likelihood of whether a borrower would pay back a loan. The scholars utilized data from Prosper, a peer-to-peer lending site. Potential borrowers write a brief description of why they need a loan and why they are likely to make good on it, and potential lenders decide whether to provide them the money. Overall, about 13 percent of borrowers defaulted on their loan.

It turns out the language that potential borrowers use is a strong predictor of their probability of paying back. And it is an important indicator even if you control for other relevant information lenders were able to obtain about those potential borrowers, including credit ratings and income.

Listed below are ten phrases the researchers found that are commonly used when applying for a loan. Five of them positively correlate with paying back the loan. Five of them negatively correlate with paying back the loan. In other words, five tend to be used by people you can trust, five by people you cannot. See if you can guess which are which.

God

promise

debt-free

minimum payment

Seth Stephens-Davidowitz's books