Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are

“First Gene to Be Linked with High Intelligence Is Reported Found,” headlined the New York Times.

You may think of the many ethical questions Plomin’s finding raised. Should parents be allowed to screen their kids for IGF2r? Should they be allowed to abort a baby with the low-IQ variant? Should we genetically modify people to give them a high IQ? Does IGF2r correlate with race? Do we want to know the answer to that question? Should research on the genetics of IQ continue?

Before bioethicists had to tackle any of these thorny questions, there was a more basic question for geneticists, including Plomin himself. Was the result accurate? Was it really true that IGF2r could predict IQ? Was it really true that geniuses were twice as likely to carry a certain variant of this gene?

Nope. A few years after his original study, Plomin got access to another sample of people that also included their DNA and IQ scores. This time, IGF2r did not correlate with IQ. Plomin—and this is a sign of a good scientist—retracted his claim.

This, in fact, has been a general pattern in the research into genetics and IQ. First, scientists report that they have found a genetic variant that predicts IQ. Then scientists get new data and discover their original assertion was wrong.

For example, in a recent paper, a team of scientists, led by Christopher Chabris, examined twelve prominent claims about genetic variants associated with IQ. They examined data from ten thousand people. They could not reproduce the correlation for any of the twelve.

What’s the issue with all of these claims? The curse of dimensionality. The human genome, scientists now know, differs in millions of ways. There are, quite simply, too many genes to test.

If you test enough tweets to see if they correlate with the stock market, you will find one that correlates just by chance. If you test enough genetic variants to see if they correlate with IQ, you will find one that correlates just by chance.

How can you overcome the curse of dimensionality? You have to have some humility about your work and not fall in love with your results. You have to put these results through additional tests. For example, before you bet your life savings on Coin 391, you would want to see how it does over the next couple of years. Social scientists call this an “out-of-sample” test. And the more variables you try, the more humble you have to be. The more variables you try, the tougher the out-of-sample test has to be. It is also crucial to keep track of every test you attempt. Then you can know exactly how likely it is you are falling victim to the curse and how skeptical you should be of your results. Which brings us back to Larry Summers and me. Here’s how we tried to beat the markets.

Summers’s first idea was to use searches to predict future sales of key products, such as iPhones, that might shed light on the future performance of the stock of a company, such as Apple. There was indeed a correlation between searches for “iPhones” and iPhones sales. When people are Googling a lot for “iPhones,” you can bet a lot of phones are being sold. However, this information was already incorporated into the Apple stock price. Clearly, when there were lots of Google searches for “iPhones,” hedge funds had also figured out that it would be a big seller, regardless of whether they used the search data or some other source.

Summers’s next idea was to predict future investment in developing countries. If a large number of investors were going to be pouring money into countries such as Brazil or Mexico in the near future, then stocks for companies in these countries would surely rise. Perhaps we could predict a rise in investment with key Google searches—such as “invest in Mexico” or “investment opportunities in Brazil.” This proved a dead end. The problem? The searches were too rare. Instead of revealing meaningful patterns, this search data jumped all over the place.

We tried searches for individual stocks. Perhaps if people were searching for “GOOG,” this meant they were about to buy Google. These searches seemed to predict that the stocks would be traded a lot. But they did not predict whether the stocks would rise or fall. One major limitation is that these searches did not tell us whether someone was interested in buying or selling the stock.

One day, I excitedly showed Summers a new idea I had: past searches for “buy gold” seemed to correlate with future increases in the price of gold. Summers told me I should test it going forward to see if it remained accurate. It stopped working, perhaps because some hedge fund had found the same relationship.

In the end, over a few months, we didn’t find anything useful in our tests. Undoubtedly, if we had looked for a correlation with market performance in each of the billions of Google search terms, we would have found one that worked, however weakly. But it likely would have just been our own Coin 391.





THE OVEREMPHASIS ON WHAT IS MEASURABLE




In March 2012, Zo? Chance, a marketing professor at Yale, received a small white pedometer in her office mailbox in downtown New Haven, Connecticut. She aimed to study how this device, which measures the steps you take during the day and gives you points as a result, can inspire you to exercise more.

What happened next, as she recounted in a TEDx talk, was a Big Data nightmare. Chance became so obsessed and addicted to increasing her numbers that she began walking everywhere, from the kitchen to the living room, to the dining room, to the basement, in her office. She walked early in the morning, late at night, at nearly all hours of the day—twenty thousand steps in a given twenty-four hour period. She checked her pedometer hundreds of times per day, and much that remained of her human communication was with other pedometer users online, discussing strategies to improve scores. She remembers putting the pedometer on her three-year-old daughter when her daughter was walking, because she was so obsessed with getting the number higher.

Chance became so obsessed with maximizing this number that she lost all perspective. She forgot the reason someone would want to get the number higher—exercising, not having her daughter walk a few steps. Nor did she complete any academic research about the pedometer. She finally got rid of the device after falling late one night, exhausted, while trying to get in more steps. Though she is a data-driven researcher by profession, the experience affected her profoundly. “It makes me skeptical of whether having access to additional data is always a good thing,” Chance says.

Seth Stephens-Davidowitz's books