When Will The Machines Start Predicting Bestsellers?

by James S. Murphy

In November, Knopf bought a 900-page debut novel by Garth Risk Hallberg for almost $2 million. It’s a tremendous gamble, regardless of the book’s quality, if one that many publishers were happy to make: more than 10 houses bid more than $1 million, according to the Times. Predicting a novel’s fate in the commercial or critical marketplace is a fool’s gambit, as indicated both by works like the first Harry Potter novel, which was repeatedly rejected before becoming, well, Harry Potter, and by expensive flops like Charles Frazier’s Thirteen Moons. The novelist Curtis Sittenfeld said, “People think publishing is a business, but it’s a casino.”

But what if publishers could recognize a bestseller or a prizewinner when they saw it? A recent paper by three computer scientists at Stony Brook University suggested that one day they might. After publication, their work provoked over-the-top headlines like “Key to hit books discovered,” “Want to write a best-seller? Scientists claim this algorithm will tell you how” and “Computer Algorithm Can Tell If Your Lousy Book Will Sell or Not.”

Are any of those true? The quick answer, at least with respect to this flawed paper, is no. But that does not mean that one day they won’t come true.

In “Success with Style: Using Writing Style to Predict the Success of Novels” (PDF), Yejin Choi, Vikas Ashok, and Song Feng employ computer models based on stylometric analysis to identify the features of novels that have been commercially or critically successful, with an eye, their title suggests, to predicting future successes. Stylometry — the statistical analysis of linguistic variations in literary works in order to identify the characteristic features of a text’s style — occasionally makes the headlines for revealing the authors of anonymous or pseudonymous works, as when it helped identify J. K. Rowling as the author of a detective novel she had published under a pen name. The power of the method lies in its analysis of the most ordinary aspects of language, such as the length of words and sentences, the distribution of various punctuation marks, and the relative frequency of common words like “the,” “of,” and “and.” A novel about crime will likely have all kinds of specialized words, but the way to tell a J. K. Rowling from a P. D. James lies more in the degree to which one might use “for” where the other would use “because.” By quantifying these elements, stylometry can identify a writer’s DNA.

What is bold about the Stony Brook study is its use of stylometry to identify not so much a novel’s DNA as its fitness. The authors examined 100 texts in each of eight genres, including, inexplicably, poetry. These 800 texts were taken from Project Gutenberg, a site that makes available over 42,000 free e-books, all of which have passed out of copyright, meaning they were published before 1922. The success of a book was determined, Choi reported in an email, by the number of times it had been downloaded over 30 days in October and November 2012. In each group, they identified 50 books that had been downloaded enough to be considered successful and 50 unsuccessful ones that had been downloaded typically fewer than ten times. The metadata that Song Feng makes available reveal that the number of downloads considered large enough to identify a work as a success ranged from 10,605 downloads to a fairly meager 94.

Choi and her collaborators found that “there exist distinct linguistic patterns shared among successful literature, at least within the same genre, making it possible to build a model with surprisingly high accuracy in predicting the success of a novel.” Successful novels, the paper asserts, use connectives (e. g., “and” “ since”), prepositions, and thinking verbs (e. g., “recognized” “remembered”) more frequently, while less successful books rely more heavily on action verbs, words for body parts, and negative words (e.g., “worse” “bruised”).

Does this model predict publishing or critical success? Once again, no, but before discussing the significant problems with using the Stony Brook findings to forecast a book’s future, I want to acknowledge the methodology and ambition of this study, as well as its challenge to the contemporary orthodoxy that taste is utterly relative and that both commercial and critical success are the pure products of external forces, including advertising, critical consecration, and fashion. It is a good thing that publishers, writers, critics, and readers might have their taste challenged, but as for predicting bestsellers, there is nothing to worry about yet. There is little reason to doubt that the stylistic patterns they found in their texts exist, but there are good reasons to doubt that predictive potential of these patterns.

Despite the claims from people covering the study, Choi and her colleagues did not study hit books, best-sellers, or book sales. As Matthew Jockers, an assistant professor of English at the University of Nebraska, points out, what the Stony Brook group “really studied was [the] classic,” not the best-seller. Jockers, whose own work uses computational text analysis to discover large-scale patterns among tens of thousands of books rather than hundreds, admires the Stony Brook paper but questions its model of success and the claim that the model can predict success among contemporary novels, because it is based on a corpus too dissimilar to today’s literary marketplace.

Choi and her colleagues almost certainly used Project Gutenberg because it is a large and free corpus of texts. I suspect that their interest lay more in computer science problems and what is called Natural Language Processing than in solving the problem of predicting literary success. As a result of choosing Project Gutenberg, however, they selected a corpus that has already been largely sorted in ways that shaped the popularity of its books. Jockers suggests that the distinction the Stony Brook study found is not between bestsellers and flops but between “the classic” and “pulp.” “What it might be,” he said, “is that they’ve detected the style of books that tend to get assigned on English literature syllabi.”

Other kinds of external influences can significantly affect download numbers. In the case of the most downloaded book in the Stony Brook corpus — Les Miserables — the number reflects the popularity of the film that was being heavily advertised in 2012. Even the download counts themselves likely influence future downloads.

Counting downloads actually confuses what counts as success instead of clarifying this thorny problem. Consider Moby Dick, which was the sixteenth most popular eBook of the past month at Project Gutenberg. Moby Dick was a massive flop at the time of its publication and struggled for decades to find an audience. Conversely, Marie Corelli’s The Sorrows of Satan, perhaps the best-selling British novel of the entire nineteenth century, was downloaded only 135 times in the past month. Do we trust 1895 or 2013 in gauging a novel a success?

Jodie Archer, a graduate student at Stanford University, whose research uses computer models to identify the intrinsic qualities that determine which contemporary novels become best-sellers, argues that classics are probably exactly the wrong model for predicting today’s commercial successes. In her dissertation, “Sex Does Not Sell: A Digital Analysis of the Contemporary Bestseller,” Archer used data derived from the 38,000 in-print novels provided by BookLamp, a technology-based book analysis platform that is for books what Pandora is for music, to train a computer to predict whether a novel was likely to be a New York Times bestseller today. When she compared classics to contemporary bestselling fiction, the computer model achieved 96% accuracy in determining which novels were which. “The novel the computer model thought would be least likely (99.6% ‘sure’) to be a bestseller today,” Archer said, “was Jane Austen’s Emma, followed closely by most of Henry James.” If Archer’s findings are correct, the patterns the Stony Brook group identified among their successful novels would almost certainly fail to identify a bestseller today.

In addition to incorporating many more books and employing a more rigorous definition of success (such as appearance on the Times bestseller list), Archer’s study of the intrinsic qualities that might determine what makes a novel a hit is more promising than the Stony Brook study because it does not limit itself to matters of style. A former editor at Penguin, Archer is well aware that a host of external factors significantly affect a book’s sales figures, including marketing, social media — even being on the bestseller list itself. She undertook her doctoral research because wanted to know whether there were intrinsic qualities shared by Times bestsellers in fiction. After training a computer model to identify 685 possibilities in novelistic theme and style, and using this model to quantify the difference between roughly 3,000 bestsellers and random samplings from 10,000 novels that sold less impressively, she has concluded that there are.

Archer is confident that her model can not only identify the tiny percentage of books that are likely to make the Times list but also pick out the even smaller group of books that hit number one on it. In an email, she wrote that the model achieved “a class average prediction of 87% at picking number ones,” and “was most sure about Danielle Steel. It picked her novels out repeatedly, with the odds that her texts would be number one on the list at 99.9% surety.”

Her model also proved capable of predicting whether a novel would sell a million copies, becoming what she refers to as a phenomenon. This kind of book, Archer wrote, is “not favorable toward sex, lust and passion, bodies described, marital relationships or remote natural settings. It also doesn’t like emotional expression. What it does like are middlebrow thematics. Education, law, travel, money, cities, technology, childhood relationships, history and dining out” are all wise subjects to cover “if you’re penning a future bestseller.”

What does this mean for the future of writing and publishing? Will aspiring novelists test their work with bestseller software? Could the agents and editors who identify work with potential soon become obsolete, as publishers simply run their slush piles through computer models? Or, if we really let our dystopian vision go, what about a future in which authors disappear altogether, replaced by novel-writing robots producing nothing but Scandinavian crime stories and Jennifer Weiner novels? Perhaps, but we are a way away from even the first possibility. Archer does hope that her research will prove of value to writers, agents, and editors, who can use it as a tool to challenge or confirm their intuitions. Those intuitions — the author’s, the agent’s, and the editor’s — will continue to matter because it is out of them that innovation happens, even among bestsellers,. “The maverick editor that wants to do something new, to start a new trend,” Archer said, “that’s not going to happen using technology. If you want to keep creating what [are already] bestsellers,” however, the computer models will “actually be very good” at selecting it.

What makes this work worth considering at this stage is not yet another round of debates over man versus machine, as some media coverage has suggested, but what it tells us about the increasing prominence of technology in art and aesthetics and the way it frames questions of taste. There are two basic kinds of techno-aestheticists at work in this field right now. On the one side, we have the sociologists, on the other side the formalists. The sociologists include social networks like Facebook, Twitter, and Goodreads, which allow individuals to advertise their taste and expand it by tracking the endorsements of others, as well as more sophisticated sites like Netflix and Amazon, which employ algorithms that leverage social data in order to guide their users’ viewing and purchasing. The formalists include companies like Pandora and Booklamp, which analyze the intrinsic aspects of thousands or even millions of works in order to make recommendations based on a user’s taste and behavior, and companies like Music Xray and Epagogix, which advertise algorithms that supposedly allow them to predict whether songs and screenplays, respectively, will be hits.

What all these companies share is a belief that each of them knows what consumers want better than consumers know themselves, thanks to the power of hard quantitative analysis. They split on the question of whether to find that data outside or inside the work of art. In doing so, they reenact an old debate, if not the original debate, in the field of aesthetics. Is beauty in the eye of the beholder and subject to — even wholly comprised of — external pressures exerted by critics, schools, fashion, and the desire to fit in with or distinguish oneself from one’s peers? Or is beauty a thing in the world, a version of truth, waiting to be discovered once the blinders have been removed, every bit as undeniably there as a stone in a bowl of beans? Relativism, especially when paired with a sociological perspective, has ruled the day in academic and critical circles for decades now, which is why it is revitalizing to see research like Archer’s and sites like Pandora. While they might not be looking so much for the lovely as for the loved, in their very commitment to the notion that our desire is stoked by the object itself rather than by social forces, this is technology that is not a threat to humanism but rather one of its last bastions.

James S. Murphy is a freelance writer working on a book entitled The Way We Like Now: Aesthetics in the Age of the Internet. Follow him on Twitter @magmods.