But what is really interesting about doppelganger searches, considering their power, is not how they’re commonly being used now. It is how frequently they are not used. There are major areas of life that could be vastly improved by the kind of personalization these searches allow. Take our health, for instance.
Isaac Kohane, a computer scientist and medical researcher at Harvard, is trying to bring this principle to medicine. He wants to organize and collect all of our health information so that instead of using a one-size-fits-all approach, doctors can find patients just like you. Then they can employ more personalized, more focused diagnoses and treatments.
Kohane considers this a natural extension for the medical field and not even a particularly radical one. “What is a diagnosis?” Kohane asks. “A diagnosis really is a statement that you share properties with previously studied populations. When I diagnose you with a heart attack, God forbid, I say you have a pathophysiology that I learned from other people means you have had a heart attack.”
A diagnosis is, in essence, a primitive kind of doppelganger search. The problem is that the datasets doctors use to make their diagnoses are small. These days a diagnosis is based on a doctor’s experience with the population of patients he or she has treated and perhaps supplemented by academic papers from small populations that other researchers have encountered. As we’ve seen, though, for a doppelganger search to really get good, it would have to include many more cases.
Here is a field where some Big Data could really help. So what’s taking so long? Why isn’t it already widely used? The problem lies with data collection. Most medical reports still exist on paper, buried in files, and for those that are computerized, they’re often locked up in incompatible formats. We often have better data, Kohane notes, on baseball than on health. But simple measures would go a long way. Kohane talks repeatedly of “low-hanging fruit.” He believes, for instance, that merely creating a complete dataset of children’s height and weight charts and any diseases they might have would be revolutionary for pediatrics. Each child’s growth path then could be compared to every other child’s growth path. A computer could find children who were on a similar trajectory and automatically flag any troubling patterns. It might detect a child’s height leveling off prematurely, which in certain scenarios would likely point to one of two possible causes: hypothyroidism or a brain tumor. Early diagnosis in both cases would be a huge boon. “These are rare birds,” according to Kohane, “one-in-ten-thousand kind of events. Children, by and large, are healthy. I think we could diagnose them earlier, at least a year earlier. One hundred percent, we could.”
James Heywood is an entrepreneur who has a different approach to deal with difficulties linking medical data. He created a website, PatientsLikeMe.com, where individuals can report their own information—their conditions, treatments, and side effects. He’s already had a lot of success charting the varying courses diseases can take and how they compare to our common understanding of them.
His goal is to recruit enough people, covering enough conditions, so that people can find their health doppelganger. Heywood hopes that you can find people of your age and gender, with your history, reporting symptoms similar to yours—and see what has worked for them. That would be a very different kind of medicine, indeed.
DATA STORIES
In many ways the act of zooming in is more valuable to me than the particular findings of a particular study, because it offers a new way of seeing and talking about life.
When people learn that I am a data scientist and a writer, they sometimes will share some fact or survey with me. I often find this data boring—static and lifeless. It has no story to tell.
Likewise, friends have tried to get me to join them in reading novels and biographies. But these hold little interest for me as well. I always find myself asking, “Would that happen in other situations? What’s the more general principle?” Their stories feel small and unrepresentative.
What I have tried to present in this book is something that, for me, is like nothing else. It is based on data and numbers; it is illustrative and far-reaching. And yet the data is so rich that you can visualize the people underneath it. When we zoom in on every minute of Edmonton’s water consumption, I see the people getting up from their couch at the end of the period. When we zoom in on people moving from Philadelphia to Miami and starting to cheat on their taxes, I see these people talking to their neighbors in their apartment complex and learning about the tax trick. When we zoom in on baseball fans of every age, I see my own childhood and my brother’s childhood and millions of adult men still crying over a team that won them over when they were eight years old.
At the risk of once again sounding grandiose, I think the economists and data scientists featured in this book are creating not only a new tool but a new genre. What I have tried to present in this chapter, and much of this book, is data so big and so rich, allowing us to zoom in so close that, without limiting ourselves to any particular, unrepresentative human being, we can still tell complex and evocative stories.
6
ALL THE WORLD’S A LAB
February 27, 2000, started as an ordinary day on Google’s Mountain View campus. The sun was shining, the bikers were pedaling, the masseuses were massaging, the employees were hydrating with cucumber water. And then, on this ordinary day, a few Google engineers had an idea that unlocked the secret that today drives much of the internet. The engineers found the best way to get you clicking, coming back, and staying on their sites.
Before describing what they did, we need to talk about correlation versus causality, a huge issue in data analysis—and one that we have not yet adequately addressed.
The media bombard us with correlation-based studies seemingly every day. For example, we have been told that those of us who drink a moderate amount of alcohol tend to be in better health. That is a correlation.
Does this mean drinking a moderate amount will improve one’s health—a causation? Perhaps not. It could be that good health causes people to drink a moderate amount. Social scientists call this reverse causation. Or it could be that there is an independent factor that causes both moderate drinking and good health. Perhaps spending a lot of time with friends leads to both moderate alcohol consumption and good health. Social scientists call this omitted-variable bias.