Interestingly, they didn’t set out to explore just how poorly human experts performed when forced to compete with an algorithm. Rather, they set out to create a model of what experts were doing when they formed their judgments. Or, as Lew Goldberg, who had arrived in 1960 at the Oregon Research Institute by way of Stanford University, put it, “To be able to spot when and where human judgment is more likely to go wrong: that was the idea.” If they could figure out where the expert judgments were going wrong, they might close the gap between the expert and the algorithms. “I thought that if you understood how people made judgments and decisions, you could improve judgment and decision making,” said Slovic. “You could make people better predictors and better deciders. We had that sense—though it was kind of fuzzy at the time.”
To that end, in 1960, Hoffman had published a paper in which he set out to analyze how experts drew their conclusions. Of course you might simply ask the experts how they did it—but that was a highly subjective approach. People often said they were doing one thing when they were actually doing another. A better way to get at expert thinking, Hoffman argued, was to take the various inputs the experts used to make their decisions (“cues,” he called these inputs) and infer from those decisions the weights they had placed on the various inputs. So, for example, if you wanted to know how the Yale admissions committee decided who got into Yale, you asked for the list of the information about Yale applicants that were taken into account—grade point average, board scores, athletic ability, alumni connections, type of high school attended, and so on. Then you watched the commitee decide, over and over, whom to admit. From the committee’s many decisions you could distill the process its members had used to weigh the traits deemed relevant to the assessment of any applicant. You might even build a model of the interplay of those traits in the minds of the members of the committee, if your math skills were up to it. (The committee might place greater weight on the board scores of athletes from public schools, say, than on those of the legacy children from private schools.)
Hoffman’s math skills were up to it. “The Paramorphic Representation of Clinical Judgment,” he had titled his paper for the Psychological Bulletin. If the title was incomprehensible, it was at least in part because Hoffman expected anyone who read it to know what he was talking about. He didn’t have any great hope that his paper would be read outside of his small world: What happened in this new little corner of psychology tended to stay there. “People who were making judgments in the real world wouldn’t have come across it,” said Lew Goldberg. “The people who are not psychologists do not read psychology journals.”
The real-world experts whose thinking the Oregon researchers sought to understand were, in the beginning, clinical psychologists, but they clearly believed that whatever they learned would apply more generally to any professional decision maker—doctors, judges, meteorologists, baseball scouts, and so on. “Maybe fifteen people in the world are noodling around on this,” said Paul Slovic. “But we recognize we’re doing something that could be important: capturing what seemed to be complex, mysterious intuitive judgments with numbers.” By the late 1960s Hoffman and his acolytes had reached some unsettling conclusions—nicely captured in a pair of papers written by Lew Goldberg. Goldberg published his first paper in 1968, in an academic journal called American Psychologist. He began by pointing out the small mountain of research that suggested that expert judgment was less reliable than algorithms. “I can summarize this ever-growing body of literature,” wrote Goldberg, “by pointing out that over a rather large array of clinical judgment tasks (including by now some which were specifically selected to show the clinician at his best and the actuary at his worst), rather simple actuarial formulae typically can be constructed to perform at a level of validity no lower than that of the clinical expert.”
So . . . what was the clinical expert doing? Like others who had approached the problem, Goldberg assumed that when, for instance, a doctor diagnosed a patient, his thinking must be complex. He further assumed that any model seeking to capture that thinking must also be complex. For example, a psychologist at the University of Colorado studying how his fellow psychologists predicted which young people would have trouble adjusting to college had actually taped psychologists talking to themselves as they studied data about their patients—and then tried to write a complicated computer program to mimic the thinking. Goldberg said he preferred to start simple and build from there. As his first case study, he used the way doctors diagnosed cancer.
He explained that the Oregon Research Institute had completed a study of doctors. They had found a gaggle of radiologists at the University of Oregon and asked them: How do you decide from a stomach X-ray if a person has cancer? The doctors said that there were seven major signs that they looked for: the size of the ulcer, the shape of its borders, the width of the crater it made, and so on. The “cues,” Goldberg called them, as Hoffman had before him. There were obviously many different plausible combinations of these seven cues, and the doctors had to grapple with how to make sense of them in each of their many combinations. The size of an ulcer might mean one thing if its contours were smooth, for instance, and another if its contours were rough. Goldberg pointed out that, indeed, experts tended to describe their thought processes as subtle and complicated and difficult to model.
The Oregon researchers began by creating, as a starting point, a very simple algorithm, in which the likelihood that an ulcer was malignant depended on the seven factors the doctors had mentioned, equally weighted. The researchers then asked the doctors to judge the probability of cancer in ninety-six different individual stomach ulcers, on a seven-point scale from “definitely malignant” to “definitely benign.” Without telling the doctors what they were up to, they showed them each ulcer twice, mixing up the duplicates randomly in the pile so the doctors wouldn’t notice they were being asked to diagnose the exact same ulcer they had already diagnosed. The researchers didn’t have a computer. They transferred all of their data onto punch cards, which they mailed to UCLA, where the data was analyzed by the university’s big computer. The researchers’ goal was to see if they could create an algorithm that would mimic the decision making of doctors.
This simple first attempt, Goldberg assumed, was just a starting point. The algorithm would need to become more complex; it would require more advanced mathematics. It would need to account for the subtleties of the doctors’ thinking about the cues. For instance, if an ulcer was particularly big, it might lead them to reconsider the meaning of the other six cues.