Tuesday, January 15th, 2013
33

Goodbye, Anecdotes! The Age Of Big Data Demands Real Criticism

If you think of all the information encoded in the universe from your genome to the furthest star, from the information that's already there, codified or un-codified, to the information pregnant in every interaction, "big" has become the measure of data. And our capacity to produce and collect Big Data in the digital age is very big indeed. Every day, we produce 2.5 exabytes of information, the analysis of which will, supposedly, make us healthier, wiser, and above all, wealthier—although it's all a bit fuzzy as to what, exactly, we're supposed to do with 2.5 exabytes of data—or how we're supposed to do whatever it is that we're supposed to do with it, given that Big Data requires a lot more than a shiny MacBook Pro to run any kind of analysis. "Start small," is the paradoxical advice from Bill Franks, author of Taming the Big Data Tidal Wave.

If you are a company that makes sense; but if Big Data is the new big thing, can it answer any big questions about the way we live? Can it produce big insights? Will it begat Big Liberty, Big Equality or Big Fraternity—Big Happiness? Are we on the cusp of aggregating utilitarianism into new tyrannies of scale? Is there a threshold where Big Pushpin is incontrovertibly better than small poetry, because the numbers are so big, they leave interpretation behind and acquire their own agency, as the digital age's answer to Friedrich Nietzsche—Chris Anderson—suggested in his "Twilight of the Idols" Big Data manifesto from four years ago. These are critical—one might say, "Big Critical—questions. Is Lena Dunham the voice of her generation as every news story about her HBO show "Girls" seems to stipulate or is this just a statistical artifact within an aggregated narrative about women that's even harder to swallow?

But because we are all disciples of enumeration, the first question is how big is Big? Well, an exabyte is a very big number indeed: one quintillion bytes—which is not as big as the exotically large quantities at the highest reaches of number crunching, numbers which begin to exhaust meaningful language like yottabytes;* but it's getting there. If you were to count from one to a quintillion, and took a second to visualize each number (audibly counting would take a lot longer), your journey would last 31.7 billion years, and you'd still be less than halfway through a day in the digital life of the world. Or, imagine if each byte occupied a millimeter of visual space: every four days our modern Bayeux tapestry would cover a light year. By way of contrast, Claude E. Shannon—the "father of information theory"—estimated the size of the Library of Congress in 1949 at 12,500 mega bytes, which is by today's standards, a mere Post-it note of information in the virtual annals of human data, albeit a rather useful one.

This latter historical tidbit (tidbyte?) comes from a thrilling, vertiginous essay by Martin Hilbert—"How much information is there in the 'information society?'"—that appeared in the August 2012 edition of Significance, the Journal of the Royal Statistical Society, (one of the indispensable publications of the digital age). Hilbert has lots of interesting data and, a fortiori, lots of interesting things to say; but one of his most interesting observations is that the big bang of digital data resulted in a massive expansion of text within the universe, and not as you might have intuited, light years worth of YouTube videos, BitTorrent copyright violations, cat pictures and porn:

In the early 1990s, video represented more than 80% of the world's information stock (mainly stored in analogue VHS cassettes) and audio almost 15% (on audio cassettes and vinyl records). By 2007, the share of video in the world's storage devices had decreased to 60% and the share of audio to merely 5%, while text increased from less than 1% to a staggering 20% (boosted by the vast amounts of alphanumerical content on internet servers, hard disks and databases.) The multimedia age actually turns out to be an alphanumeric text age, which is good news if you want to make life easy for search engines.

The bathos is forgivable—after all, what else do we spend a vast amount of our time doing other than searching on the Internet for information? But, happily, such "good news" amounts to more than better functionality, a "gosh, how convenient!" instrumental break in the pursuit of better instrumentalism; it also means we can ask bigger questions of Big Data. We can ask what the big picture actually means, and—no less important—we can criticize those who claim to know. We can, in other words, be "Big Critics"; we can do "Big Crit."

***

Big Crit got a big break in 2012, after a group of researchers at Bristol University's Intelligent Systems Laboratory teamed up with a couple of professors from Cardiff University's School of Journalism, Media and Cultural studies. They began with a straightforward, but non-trivial question: how much gender bias is there in the news media&mdash as suggested by the gender ratios of men to women as sources in stories? The key to understanding why the path to answering this question is as important as the answer itself lies in asking, how much information and of what kind would you need to conclude that yes, the media is biased or not?

Pre-Big Crit, you might have had pundits setting the air on fire with a mixture of anecdote and data; or a thoughtful article in The Atlantic or The Economist or Slate, reflecting a mixture of anecdote, academic observation and maybe a survey or two; or, if you were lucky, a content analysis of the media which looked for gender bias in several hundred or even several thousand news stories, and took a lot of time, effort, and money to undertake, and which—providing its methodology is good and its sample representative—might be able to give us a best possible answer within the bounds of human effort and timeliness.

The Bristol-Cardiff team, on the other hand, looked at 2,490,429 stories from 498 English language publications over 10 months in 2010. Not literally looked at—that would have taken them, cumulatively, 9.47 years, assuming they could have read and calculated the gender ratios for each story in just two minutes; instead, after ten months assembling the database, answering this question took about two hours. And yes, the media is testosterone fueled, with men dominating as subjects and sources in practically every topic analyzed from sports to science, politics to even reports about the weather. The closest women get to an equal narrative footing with men is—surprise—fashion. Closest. The malestream media couldn't even grant women tokenistic majority status in fashion reporting. If HBO were to do a sitcom about the voices of this generation that reflected just who had the power to speak, it would, after aggregation, be called "Boys."

The closest women get to an equal narrative footing with men is—surprise—fashion. Closest.

But leaving that rather dismal quantitation aside, look at the study's parameters. This kind of analysis doesn't just end arguments it buries them and salts the earth—unless you are prepared to raise the stakes with your own Big Data-mining operation. Either way, we've dispensed with what you or I think of the media—and the fact that everyone who consumes media gets to be a "media critic"—and empowered a kind of evidence-based discussion. Think about all the pointlessness that can be taken out of arguments about political bias in the media if you can, in real time, dissect and aggregate all the media coverage at any given moment on any question. In the possibility of providing big answers, Big Crit frees us to move the argument forward. If the data is so decisive on gender bias, we now have a rational obligation to ask why is that the case and what might be done about it.

This is why—as interesting as their finding on gender bias is—the key point of the recently published Bristol-Cardiff study, "Research Methods in the Age of Digital Journalism" (pdf) is the methodology. "They programmed a computer to ask simple questions of a vast amount of material and it came up with the same results as human coders would have, " says Robert Lichter, Professor of Communications at George Mason University in Virginia, and a pioneer in media content analysis (and for whom I worked on several content analyses). "This is a huge breakthrough, and one we've all been waiting for because nobody could capture digital media with hand coding." In terms of seeing the big picture in the exponentially expanding digital mediaverse, it's like being able to go from Galileo's telescope to the Hubble. We didn't have a way of seeing the contents of the whole media system, says the lead author of the study, Nello Cristianni, Professor of Artificial Intelligence at Bristol University, because its vastness meant it could only be seen by automated methods. Now, we have the technology.

***

Being able to see the media mediate in real time is a pretty big development in the annals of Big Data, yet surprisingly, or unsurprisingly, the media didn't quite grasp why this was a big story. Such coverage as there was—Wired UK—focused on the surface of the study, what it found but not how it found it. Only Gigaom's Derrick Harris seemed to understand the implications of the analytical breakthrough: "The experiment's techniques actually point to a future where researchers are spared the grunt work of poring through thousands of pages of news or watching hundreds of hours of programming, and can actually focus their energy of explaining."

In the not-so-distant past, when nations were kept informed by a handful of news organizations, figuring out how "the media" framed issues for public consumption was a laborious task for political scientists. Teams of trained coders would spend months—in some cases, years—going line-by-line through hundreds of stories and transcripts to draw out the patterns and calculate, using statistical software, what it all might have added up to. As a one-time content analysis serf in this hugely tedious process, explaining that I assigned hundreds of different codes to the different kinds of information embedded sentence-by-sentence in a news story seemed to invite both mockery and pity. It seemed counter intuitive that turning words into numbers could tell us anything really meaningful about the role of the media, how it shaped society, or even what a story really said.

Yet the results of simple, pre-Big Data aggregation were often eye opening. Who knew, for example, that military force was advocated ten times as often as diplomacy in television coverage of foreign news by the US media after 9/11—and even after the ostensible defeat of the Taliban? Who knew that the US was urged to act unilaterally three times as often as with its allies? And were these trivial findings when set against public opinion surveys which showed that people who got their news from TV were more likely to support the War on Terror, or believe that Saddam Hussein possessed weapons of mass destruction? Unfortunately, we only saw these patterns after they had happened.

Perhaps one of the most startling and effective deployments of content analysis in recent years came through the research of Elvin Lim, and were published in his 2008 book, The Anti-Intellectual Presidency. Lim, now a political scientist at Wesleyan University, analyzed the content and complexity of annual presidential addresses and inaugural speeches to show how political rhetoric had devolved since the 18th century beyond the kind of simplification and clarity one might expect as speech—and society—evolved.

By marrying computational to historical and political analysis (Lim, for instance, interviewed every living presidential speechwriter to elucidate and give context to his data), he showed that at a certain point simplification began to have an effect on the substantive content of political speech, so that (to somewhat oversimplify his findings) FDR was not simply a better rhetorician than say, Bill Clinton, or both Bushes, he was more substantive, in part, because he used language in a more stylistically complex way, and believed that presidential speech should be pedagogical. What happened over the last 50 years, said Lim, was that presidents and speechwriters deliberately abandoned oratory and embraced an anti-intellectual style. The result is demagoguery. It's an utterly fascinating, compelling, book that absolves neither party from their sins against language. And it's jammed with data.

Though Cristianini and his team didn't deploy the level of analysis Lim did to presidential speech, they did a general test for news media readability using Flesch scores, which assess the complexity of writing based on the length of words and sentences. Shorter, on this account, means simpler, although that doesn't mean that shortness of prose breadth always denotes simplicity; for general purposes, Flesch scores tend to hit the mark: writing specifically aimed at children tends to have a higher score over, say, scholarship in the humanities, and these differences in readability tend to reflect differences in substantive content.

According to Cristianini et al.'s analysis of a subset of the media—eight leading newspapers from the US and seven from the UK; a total of 218,302 stories—The Guardian is considerably more complex a read than any of the other major publications, including The New York Times. Surprisingly, so is the Daily Mail, whose formula of "celebrity 'X' is" happy/sad/disheveled/flirty/fat/pregnant/glowing/ and so on apparently belies its complexity (when I mentioned this to a friend who is an intellectual historian and Guardian reader, she said, "actually, sometimes I do read the Mail"). Together, according to Comscore, these three publications are the most read news sources in the world.


Comparison of a selection of US and UK outlets based on their writing style.
From the study Research Methods In The Age Of Digital Journalism (used with permission).

Cristianini et al. also measured the percentage of adjectives expressing judgments, such as "terrible" and "wonderful" in order to assess the publication's degree of linguistic subjectivity. Not surprisingly, tabloid newspapers tended to be more subjective, while the Wall Street Journal, perhaps owing to its focus on business and finance, was the most linguistically objective. Despite The Guardian and the Daily Mail's seemingly complex prose, the researchers found that, in general, readability and subjectivity tended to go hand-in-hand when they combined the most popular stories with writing styles. "While we cannot be sure about the causal factors at work here," they write, "our findings suggest the possibility, at least, that the language of hard news and dry factual reporting is as much as a deterrent to readers and viewers as the content." When political reporting was 'Flesched' out, so to speak, it was the most complex genre of news to read, and one of the least subjective.

When I asked Lim what he thought of the study via email, this, he said, was the pattern that stood out. "This means that at least in terms of the items included in the dataset, the media is opinionated and subjective at the same time that it is rendering these judgments in simplistic, unsubtle terms. This is not an encouraging pattern in journalistic conventions, especially given that the public appears to endorse it (given the correlation between the popularity of a story, its readability, and subjectivity)."

Even more provocative, when Cristianini et al. looked at the market demographics for the UK publications, they found "no significant correlation between writing style and topics, or between topics and demographics in respect to outlets. Thus, it appears, audiences relate more to writing style than to choice of topic – an interesting finding since prevailing assumptions tend to assume readers respond to both."

"This means that at least in terms of the items included in the dataset, the media is opinionated and subjective at the same time that it is rendering these judgments in simplistic, unsubtle terms."

That all this data mining points to the importance of style is just one of the delightful ways that Big Crit can challenge our assumptions about the way markets and consumers and the world works. Of course, we are still, in analytical terms, learning to scrawl. As Colleen Cotter—perhaps the only person to have switched from journalism to linguistics and to then have produced a deep linguistic study of the language of news—cautions, we need to be careful about reading too much into "readability."

"If 'readability,' is just a quantitative measure, like length of words or structure of sentences (ones without clauses)," she says via email, "then it's a somewhat artificial way of 'counting.' It doesn't take into account familiarity, or native or intuitive or colloquial understandings of words, phrases, and narrative structures (like news stories or recipes or shopping lists or country-western lyrics)." Nor do readability formulas take into account "the specialist or local audience," says Cotter, who is a Reader in Media Linguistics at Queen Mary University in London. "I remember wondering why we had to have bridge scores published in the Redding, CA, paper, or why we had to call grieving family members, and the managing editor's claims that people expect that."

There are other limits to algorithmic content analysis too, as Lichter notes. "A content analysis of Animal Farm can tell you what Animal Farm says about animals," he says. "But it can't tell you what it says about Stalinism."

And that's why we are still going to need to be smart about how we use Big Crit to interrogate Big Data: a computer cannot discern by itself whether the phrase "Kim Jong-Un is the sexiest man alive," is satire or a revealing statement of cultural preferences. Problems like this scale up through bigger and bigger datasets: we are on the path to solving problems that can only be seen when a million data points bloom, while, simultaneously, confronting statistical anomalies that rise, seductively, like Himalayas of beans, from asking too many data points too many questions.

But the possibility of Big Crit restores some balance to the universe; we are not simply objects of surveillance, or particles moving in a digital Big Bang, we have the tools to look back, and see where we are at any given moment amid exabytes of data. Changing starts with seeing—and it starts with being able to see how shaping the world as it is being shaped in a deluge of information. But it's hard not to think that there isn't more to all this than just media criticism about the news. Take an imminent development, the end of the common page view as a worthwhile measure of human engagement with online media. The data produced by you reading this page will soon evolve into a genome of your online self, which will be followed and analyzed as you flit from screen to screen, and used to predict your behavior and, of course, sell products.

The hidden meta-narratives starring the virtual you don't just demand criticism because they involve new forms of hidden power (and because all power is encoded with assumptions about what philosophers call "the good"), they renew criticism by revitalizing the notion of critical authority. We can, in fact, say very definite things about the textual and numerical nature of the digiverse because we've stepped back into modernity through a new kind of technological flourishing; and, as with our previous adventures in modernism, it's a place where the critic is indispensable.


* Dear mathematics, please get around to coining "alottabytes;" you know you want to and it would make everybody chuckle when they wrote it. It would also make up for missing the opportunity to enumerate a "gazillion."


Trevor Butterworth is a contributor to Newsweek and editor-at-large for STATS.org. Image by winui, via Shutterstock.

33 Comments / Post A Comment

scrooge (#2,697)

It's Science!
Right?
PS I think those Bristol guys are the same ones who came up with an algorithm to predict which songs are most likely to be hits.

Ralph Haygood (#13,154)

"This kind of analysis doesn't just end arguments it buries them and salts the earth – unless you are prepared to raise the stakes with your own Big Data-mining operation." Or unless you just believe whatever the hell you want, regardless of whether it's empirically defensible or even rationally coherent…you know, like millions of Americans!

Seriously, this kind of thing is interesting to people who are curious about how the world works (or doesn't work), but I wouldn't hold my breath waiting for it to make much difference to how the world works, unfortunately. The Dick Morrises of the world have little to fear from mere facts and logic.

A good point Ralph, but I believe people are going to have a hard time defending prejudice when the possibility of answering certain kinds of questions with much greater precision increases. Of course, big data or big crit kinda works against the interests of punditry, so I can see a certain economically motivated unwillingness by, say, cable news, to endorse it. More worrying is that the public will react against technocratic determination. Data panics over vaccination or gmos are rarely scientific in a meaningful sense, but rather expressions of political and moral rebellion. That can only get worse.

scrooge (#2,697)

I think you guys place too much faith in what looks to me like very rocky analysis. For example, the fact that the media quote males more than females (if it is so, which I suppose it probably is) is not necessarily an indication of gender bias in the media — it may well be, rather, that when, say, a company or organization is asked by the media to provide a spokesperson to comment on some damn thing, it provides a man rather than a woman. The bias is not, then, in the media.

Similarly, to purport to quantify "readibility" or "subjectivity" by these means is very dubious. Gibbon, for example, would most likely score very low on "readibility", because he may tend to write longer and more complex sentences; but those sentences are so transparent and clear that they are extremely readable. They're also often full of ironic subjectivity without his ever resorting to "emotional" adjectives.

Just because you come up with a number doesn't mean you've actually proved anything, you've just given your theory a semblance of logic (which it may well not have).

I suspect we are going to see a lot of bosh backed up by "big data analysis", and many people will be taken in because "It's Science!" I would be on the side of those "gullible" skeptics you both seem to disparage so.

As I noted, the key aspect about the Cristianini study is that it was able to validate a methodology – that an algorithm could achieve what a person trained in content analysis would achieve given the same textual data and the same query. I also quoted Dr. Cotter's caveats about readability and looked at Professor Lim's careful use of content analysis as data to be backed up with careful historical and textual analysis. So I don't think the piece simply leaps like a happy bunny over the obstacles in all of this. What it does not shy from is pointing out that such big data enables new kinds of critical questions to be asked of our alpha numeric world; the meta narratives – contra Lyotard – need to be taken seriously!

scrooge (#2,697)

@Trevor Butterworth@twitter Fair comment, to be sure. Trouble is, putting in disclaimers usually doesn't prevent these analysts from drawing dubious conclusions, of which I'm sure we'll see plenty as "Big Data" gathers momentum. Grist to the Awl's "It's Science!" mill.

In the early days of Amazon when they were still mostly books, they used to offer suggestions supposedly based on what you looked at/bought. Every suggestion I ever got was to read Playboy. Data/marketing analysis at its best.

la4thecure (#243,500)

I browse your site and cant wait to take a look awesome blog. This web page has a wide variety of pieces and written in awesome style of writing vpn

brendontaylor (#243,508)

The world is changing fast. people are also being changed.day by day we are becoming more dependant on degital system.yoU make me think of this really.You have a nice way of sharing your thoughts. turf fertilizer

CamVoice.com (#243,637)

Nice post Trevor, keep writing, science is awesome.
Free WebCam Chat
Free Live Sex Webcams

melindaalex90 (#243,686)

Thank you for your invested time in writing of such important topic article.

Iphone App Development

niltes (#243,635)

A good point Ralph, but I believe people are going to have a hard time defending prejudice when the possibility of answering certain kinds of questions with much greater precision increases. Skin Care

LewisMark (#243,767)

That's pretty powerful stuff. I like the way your logic follows, and how you're able to put it into words with such a powerful post! I am enthralled.
melbourne fl web design

Quite enhancing to work with you. Man Of Steel Jacket

yasigits1@twitter (#244,409)

You should proud to your self for having able to write down some really wonderful tips and hints. Great articles, I think it would be a good asset. Unquestionably been having more things opened
garcinia cambogia

acilz (#244,499)

On prime of and beyond the specific evolutionary history of one's geranium, the researchers are hoping to firmly discover basic facts about evolution. They will speculate that the high levels of rate modification occurring during this group would possibly have one thing to firmly do with genes who may be concerned in dna repair and recombination. Nail Art Designs

BobMillerFla (#244,690)

I like what you've posted here — it definitely made me chuckle, but you do speak the truth. Do you update your blog often? I'll be curious to see what other things you may have up your sleeve.
ac repair melbourne florida

Darrel Jackson (#244,866)

Simply wanted to write down a word in order to say thanks to you for those wonderful tips, Just wanted to give a quick shout out and Simply have to confess that your article is astonishing burn the fat feed the muscle

mycarzs (#244,877)

Thank you a lot for all the knowledge you could share.

mycarzs (#244,877)

Thank you a lot for all the knowledge you could share.

Jim Harris (#244,883)

This article about the age of big data was fantastic. Keep up the good work. Jim Harris

Mary Linton (#244,896)

This blog is the first place that help me find this information. The clarity in your post is so detailed, It's very helpful for writing this valuable information. Your post is one of the better that I have read healthy diet

Angela Glenn (#244,907)

You have a natural way of writing yet seem to have so much passion about this issue. It is definitely worth it to appreciate your high quality writing and You have put it in a way that everyone can understand baju gamis murah

Kelly Irving (#244,928)

Thanks a lot for providing individuals with a spectacular possibility to read critical reviews and posts from this site. I notify that this is the first place where I find issues and topics I've been searching for somanabolic muscle maximizer

Lisa Shook (#244,972)

This is the type of manual that needs to be given. Such qualified articles as this is a rare thing to find these days. Just wanted to give a quick shout out and say that I genuinely enjoy reading your articles gabriel method

Alice Mathis (#244,999)

Just wanted to give a quick shout out and say that I genuinely enjoy reading your articles. You definitely know what you were talking about and seem to have a very high interest on the topic. Your intelligence on the post could be giving us new knowledge Pet and Animal

Karen Lopez (#245,752)

Through this post, we know that your good knowledge, You have a clever yet attractive way of writing. Thanks a lot for providing individuals with critical reviews and I enjoy this sort of clever work of writing payday loans

Carlos Good (#245,770)

Now, a lot of people can be agree with you (the "prism"'s affairs has wake up some minds )

RogerPaul (#247,902)

Part of the fear is in the data mining itself. You touched on the after 9/11 issues of data and none of us realized just where we would be by 2013. When you read some of the disclosures, it really is scary even if you are an innocent citizen. "Unacceptable" searches could land a person in hot water even though the true intent behind the search was reasonable. IMO it is the need for "algorithmic content analysis" which gets us in trouble as it can't tell good from bad.

Candy fun (#254,427)

Great post, good work. It Couldn’t be wrote any better. Reading this post reminds me of my recent employer! He constantly kept talking about this. I will forward this article to him. Pretty sure he will have a delightful read. Thanks for posting! echipamente bucatarii profesionale

Post a Comment