Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are by Seth Stephens-Davidowitz (page

Aha. Here at last was the reason Summers had summoned me to his office.

Summers is hardly the first person to ask me this particular question. My father has generally been supportive of my unconventional research interests. But one time he did broach the subject. “Racism, child abuse, abortion,” he said. “Can’t you make any money off this expertise of yours?” Friends and other family members have raised the subject, as well. So have coworkers and strangers on the internet. Everyone seems to want to know whether I can use Google searches—or other Big Data—to pick stocks. Now it was the former Treasury secretary of the United States. This was more serious.

So can new Big Data sources successfully predict which ways stocks are headed? The short answer is no.

In the previous chapters we discussed the four powers of Big Data. This chapter is all about Big Data’s limitations—both what we cannot do with it and, on occasion, what we ought not do with it. And one place to start is by telling the story of the failed attempt by Summers and myself to beat the markets.

In Chapter 3, we noted that new data is most likely to yield big returns when the existing research in a given field is weak. It is an unfortunate truth about the world that you will have a much easier time getting new insights about racism, child abuse, or abortion than you will getting a new, profitable insight into how a business is performing. That’s because massive resources are already devoted to looking for even the slightest edge in measuring business performance. The competition in finance is fierce. That was already a strike against us.

Summers, who is not someone known for effusing about other people’s intelligence, was certain the hedge funds were already way ahead of us. I was quite taken during our conversation by how much respect he had for them and how many of my suggestions he was convinced they’d beaten us to. I proudly shared with him an algorithm I had devised that allowed me to obtain more complete Google Trends data. He said it was clever. When I asked him if Renaissance, a quantitative hedge fund, would have figured out that algorithm, he chuckled and said, “Yeah, of course they would have figured that out.”

The difficulty of keeping up with the hedge funds wasn’t the only fundamental problem that Summers and I ran up against in using new, big datasets to beat the markets.

THE CURSE OF DIMENSIONALITY

Suppose your strategy for predicting the stock market is to find a lucky coin—but one that will be found through careful testing. Here’s your methodology: You label one thousand coins—1 to 1,000. Every morning, for two years, you flip each coin, record whether it came up heads or tails, and then note whether the Standard & Poor’s Index went up or down that day. You pore through all your data. And voilà! You’ve found something. It turns out that 70.3 percent of the time Coin 391 came up heads the S&P Index rose. The relationship is statistically significant, highly so. You have found your lucky coin!

Just flip Coin 391 every morning and buy stocks whenever it comes up heads. Your days of Target T-shirts and ramen noodle dinners are over. Coin 391 is your ticket to the good life!

Or not.

You have become another victim of one of the most diabolical aspects of “the curse of dimensionality.” It can strike whenever you have lots of variables (or “dimensions”)—in this case, one thousand coins—chasing not that many observations—in this case, 504 trading days over those two years. One of those dimensions—Coin 391, in this case—is likely to get lucky. Decrease the number of variables—flip only one hundred coins—and it will become much less likely that one of them will get lucky. Increase the number of observations—try to predict the behavior of the S&P Index for twenty years—and coins will struggle to keep up.

The curse of dimensionality is a major issue with Big Data, since newer datasets frequently give us exponentially more variables than traditional data sources—every search term, every category of tweet, etc. Many people who claim to predict the market utilizing some Big Data source have merely been entrapped by the curse. All they’ve really done is find the equivalent of Coin 391.

Take, for example, a team of computer scientists from Indiana University and Manchester University who claimed they could predict which way the markets would go based on what people were tweeting. They built an algorithm to code the world’s day-to-day moods based on tweets. They used techniques similar to the sentiment analysis discussed in Chapter 3. However, they coded not just one mood but many moods—happiness, anger, kindness, and more. They found that a preponderance of tweets suggesting calmness, such as “I feel calm,” predicts that the Dow Jones Industrial Average is likely to rise six days later. A hedge fund was founded to exploit their findings.

What’s the problem here?

The fundamental problem is that they tested too many things. And if you test enough things, just by random chance, one of them will be statistically significant. They tested many emotions. And they tested each emotion one day before, two days before, three days before, and up to seven days before the stock market behavior that they were trying to predict. And all these variables were used to try to explain just a few months of Dow Jones ups and downs.

Calmness six days earlier was not a legitimate predictor of the stock market. Calmness six days earlier was the Big Data equivalent of our hypothetical Coin 391. The tweet-based hedge fund was shut down one month after starting due to lackluster returns.

Hedge funds trying to time the markets with tweets are not the only ones battling the curse of dimensionality. So are the numerous scientists who have tried to find the genetic keys to who we are.

Thanks to the Human Genome Project, it is now possible to collect and analyze the complete DNA of people. The potential of this project seemed enormous.

Maybe we could find the gene that causes schizophrenia. Maybe we could discover the gene that causes Alzheimer’s and Parkinson’s and ALS. Maybe we could find the gene that causes—gulp—intelligence. Is there one gene that can add a whole bunch of IQ points? Is there one gene that makes a genius?

In 1998, Robert Plomin, a prominent behavioral geneticist, claimed to have found the answer. He received a dataset that included the DNA and IQs of hundreds of students. He compared the DNA of “geniuses”—those with IQs of 160 or higher—to the DNA of those with average IQs.

He found a striking difference in the DNA of these two groups. It was located in one small corner of chromosome 6, an obscure but powerful gene that was used in the metabolism of the brain. One version of this gene, named IGF2r, was twice as common in geniuses.

Seth Stephens-Davidowitz's books