Many people are anxious over this development. They are intimidated by data, easily lost and confused in a world of numbers. They think that a quantitative understanding of the world is for a select few left-brained prodigies, not for them. As soon as they encounter numbers, they are ready to turn the page, end the meeting, or change the conversation.
But I have spent ten years in the data analysis business and have been fortunate to work with many of the top people in the field. And one of the most important lessons I have learned is this: Good data science is less complicated than people think. The best data science, in fact, is surprisingly intuitive.
What makes data science intuitive? At its core, data science is about spotting patterns and predicting how one variable will affect another. People do this all the time.
Just think how my grandmother gave me relationship advice. She utilized the large database of relationships that her brain has uploaded over a near century of life—in the stories she has heard from her family, her friends, her acquaintances. She limited her analysis to a sample of relationships in which the man had many qualities that I have—a sensitive temperament, a tendency to isolate himself, a sense of humor. She zeroed in on key qualities of the woman—how kind she was, how smart she was, how pretty she was. She correlated these key qualities of the woman with a key quality of the relationship—whether it was a good one. Finally, she reported her results. In other words, she spotted patterns and predicted how one variable will affect another. Grandma is a data scientist.
You are a data scientist, too. When you were a kid, you noticed that when you cried, your mom gave you attention. That is data science. When you reached adulthood, you noticed that if you complain too much, people want to hang out with you less. That is data science, too. When people hang out with you less, you noticed, you are less happy. When you are less happy, you are less friendly. When you are less friendly, people want to hang out with you even less. Data science. Data science. Data science.
Because data science is so natural, the best Big Data studies, I have found, can be understood by just about any smart person. If you can’t understand a study, the problem is probably with the study, not with you.
Want proof that great data science tends to be intuitive? I recently came across a study that may be one of the most important conducted in the past few years. It is also one of the most intuitive studies I’ve ever seen. I want you to think not just about the importance of the study—but how natural and grandma-like it is.
The study was by a team of researchers from Columbia University and Microsoft. The team wanted to find what symptoms predict pancreatic cancer. This disease has a low five-year survival rate—only about 3 percent—but early detection can double a patient’s chances.
The researchers’ method? They utilized data from tens of thousands of anonymous users of Bing, Microsoft’s search engine. They coded a user as having recently been given a diagnosis of pancreatic cancer based on unmistakable searches, such as “just diagnosed with pancreatic cancer” or “I was told I have pancreatic cancer, what to expect.”
Next, the researchers looked at searches for health symptoms. They compared that small number of users who later reported a pancreatic cancer diagnosis with those who didn’t. What symptoms, in other words, predicted that, in a few weeks or months, a user will be reporting a diagnosis?
The results were striking. Searching for back pain and then yellowing skin turned out to be a sign of pancreatic cancer; searching for just back pain alone made it unlikely someone had pancreatic cancer. Similarly, searching for indigestion and then abdominal pain was evidence of pancreatic cancer, while searching for just indigestion without abdominal pain meant a person was unlikely to have it. The researchers could identify 5 to 15 percent of cases with almost no false positives. Now, this may not sound like a great rate; but if you have pancreatic cancer, even a 10 percent chance of possibly doubling your chances of survival would feel like a windfall.
The paper detailing this study would be difficult for non-experts to fully make sense of. It includes a lot of technical jargon, such as the Kolmogorov-Smirnov test, the meaning of which, I have to admit, I had forgotten. (It’s a way to determine whether a model correctly fits data.)
However, note how natural and intuitive this remarkable study is at its most fundamental level. The researchers looked at a wide array of medical cases and tried to connect symptoms to a particular illness. You know who else uses this methodology in trying to figure out whether someone has a disease? Husbands and wives, mothers and fathers, and nurses and doctors. Based on experience and knowledge, they try to connect fevers, headaches, runny noses, and stomach pains to various diseases. In other words, the Columbia and Microsoft researchers wrote a groundbreaking study by utilizing the natural, obvious methodology that everybody uses to make health diagnoses.
But wait. Let’s slow down here. If the methodology of the best data science is frequently natural and intuitive, as I claim, this raises a fundamental question about the value of Big Data. If humans are naturally data scientists, if data science is intuitive, why do we need computers and statistical software? Why do we need the Kolmogorov-Smirnov test? Can’t we just use our gut? Can’t we do it like Grandma does, like nurses and doctors do?
This gets to an argument intensified after the release of Malcolm Gladwell’s bestselling book Blink, which extols the magic of people’s gut instincts. Gladwell tells the stories of people who, relying solely on their guts, can tell whether a statue is fake; whether a tennis player will fault before he hits the ball; how much a customer is willing to pay. The heroes in Blink do not run regressions; they do not calculate confidence intervals; they do not run Kolmogorov-Smirnov tests. But they generally make remarkable predictions. Many people have intuitively supported Gladwell’s defense of intuition: they trust their guts and feelings. Fans of Blink might celebrate the wisdom of my grandmother giving relationship advice without the aid of computers. Fans of Blink may be less apt to celebrate my studies or the other studies profiled in this book, which use computers. If Big Data—of the computer type, rather than the grandma type—is a revolution, it has to prove that it’s more powerful than our unaided intuition, which, as Gladwell has pointed out, can often be remarkable.