Verification is the Key to Successful Crowdsourcing

Despite the immense increase in accessibility of digital content creation, the current condition of publishing in the digital sphere has not come without its limitations, and content creators are not always independently capable of reaching the material and financial demands necessary to pursue their endeavours. These challenges present themselves in manifold ways, most often stemming from time constraints, individual skillsets, and the monetary means necessary to fund one’s ventures.

Crowdsourcing was thus borne as a way of tapping into the external talents, creative insights, and knowledge base of broader online communities to further one’s projects and ensure audiences are able to receive what they want most. This essay will investigate how crowdsourcing has been adopted by journalists and those working toward the digitization of print books. By offering a comparison of how crowdsourcing is being utilized in these two endeavours, I hope to inspire discourse of how crowd sourced journalism may be more effectively implemented in the future.

 

Crowdsourcing as Commons-based Peer Production

Crowdsourcing in the present decade has witnessed the rise of volunteer captioning, translation and citizen journalism, proving to be a consistently employed strategy for content creation. As Yochai Benkler and Helen Nissenbaum discuss at length in their article “Commons-based Peer Production and Truth”, crowdsourcing is akin to what they define as commons-based peer production:

 

“Facilitated by the technical infrastructure of the Internet, the hallmark of this socio-technical system is collaboration among large groups of individuals, sometimes in the order of tens or even hundreds of thousands, who cooperate effectively to provide information, knowledge or cultural goods without relying on either market pricing or managerial hierarchies to coordinate their common enterprise.”

 

Indeed, the above examples (captioning, translating, and citizen journalism) coincide with this definition, because the efforts undertaken are done so collaboratively, with the aim of enhancing the spread of knowledge and culture, and are done so freely without the expectation of financial compensation. Drawing on Wikipedia as an early example, Benkler and Nissenbaum illustrate how peer production begins with “a statement of community intent” and achieves its ends via “a technical architecture that allows anyone to contribute, edit and review the history of any document easily.”

 

Digitizing Books, One Word at a Time

First introduced as Google Print at the 2004 Frankfurt Book Fair, Google Books has now scanned over 25 million book titles using Optical Character Recognition (OCR) technology. OCR works by creating electronic conversions of images of typed, handwritten, or printed text which Google Books then stores into its digital database. Since its inception, Google Books has slowed its output for two primary reasons: Copyright violations, and errors in scanning relating to the OCR process. Such errors include pages being unreadable, upside, crumpled, blurry, as well as fingers obscuring text.

To begin remedying these scanning errors, Google acquired reCAPTCHA in 2009 as a means of amending the unreadable pages and blurry scans resulting from the OCR process. reCAPTCHA is an evolution of CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) in which the scanning process errors of OCR are amended by users of websites. We have all filled out web-based forms containing CAPTCHAs at some point in time, and as their name implies, their purpose is to ensure that the user filling out the form is a human and not some sort of computer program.

recaptcha

From Wufoo Team’s Flickr

The co-founder of reCAPTCHA, Luis von Ahn of Carnegie Mellon, explains reCAPTCHA’s inception in his Ted Talk “Massive-scale online collaboration”:

 

“Approximately 200 million CAPTCHAs are typed every day by people around the world…and each time you type a CAPTCHA, essentially you waste 10 seconds of your time. And if you multiply that by 200 million, you get that humanity as a whole is wasting about 500,000 hours each day…”

 

Ahn wished to effectively seize these accumulated hours spent typing in CAPTCHAs by putting them to a secondary use, ultimately benefiting the global collective knowledge base. His innovation means that for every CAPTCHA an individual enters on a web form, books are being digitized one word at a time. He continues to illustrate how reCAPTCHA technology draws on the commons-based peer production to remedy text from books that OCR has difficulty accurately deciphering:

“OCR is not perfect, especially for older books where the ink has faded and the pages have turned yellow… Things that were written more than 50 years ago, the computer cannot recognize about 30 percent of the words. So what we’re doing now is taking all of the words that the computer cannot recognize and getting people to read them for us while they’re typing a CAPTCHA on the internet.”

If you’ve noticed that CAPTCHAs now contain two words instead of one, it isn’t to speed up digitizing efforts. Rather, a two-step process is being followed. The first word is one that has already been verified as being correct, and an individual’s correct inputting serves to verify they are human. The second word is one still requiring verification, and is being shown in CAPTCHAs to ten other people as a means of ensuring the correct digitization of a text. As of 2011, 750,000,000 people (10% of the world’s population) have helped to digitize at least one word of a book because of reCAPTCHA.

 

Journalism and Crowdsourcing: The Truth is out There

Traditionally, journalists have had to rely either on sources on the ground or on connections acquired via personal connections or professional external networks. It is undoubtedly beneficial to have a pool of experts at one’s disposal to provide statistical context, ideological interpretation, legal knowledge, etc. Likewise, it is enviable to have dedicated individuals who interview scholars, document raw footage, and correspond with locals as events unfold in real time. However, not all events are created equal – “home-grown” stories vs. international reporting, for example – and not all events are known beforehand (ex. Natural disasters), making on the ground coverage not always possible. Additionally, budgets do not always permit live reporting, especially for smaller publications already stretched thin paying in-house staff, freelance writers, and photographers.

The consequences of these monetary and time constraints most often manifest in the type of stories chosen for coverage, and the level of depth awarded to their investigation. As Madelaine Drohan explains in her report Does serious journalism have a future in Canada?:

 

“When time is at a premium, other parts of the job inevitably fall by the wayside, like the research required for accuracy, context and balance. Journalists and their editors are tempted to avoid harder, longer projects that require both money and time in favour of quick and easy hits…”

 

Drohan also states in her report that time-crunched journalists are “prone to circulating misinformation” and are “more inclined to put opinion over fact.” Thus, new solutions such as crowdsourcing journalistic efforts serve to combat these stresses to ensure timely coverage and the enhanced accuracy of details.

Crowdsourcing can be observed in multiple dimensions, from interviewing to corroborating details, and from video footage to audio recordings. Certainly, Benkler and Nissenbaum’s discourse on commons-based peer production apply to these activities, aligning with the two core characteristics of peer production itself: Decentralization and the use of social cues and motivations (rather than finances and commands) to drive and navigate the actions of participating agents. In the most direct sense, these efforts are inspired by a “call to action” by the news publications we read regularly, inviting us to share our photos and videos, our eye witness accounts, and to correct any errors our typos noted in the articles posted.

bbc

As seen on the BBC at the end of an article regarding wildfires in Israel

vancouver-sun

The end of an article on the Vancouver Sun’s website, inviting readers to submit comments regarding typos and missing information

Returning to Drohan’s report on journalism, it becomes apparent why news outlets are so dependent on peer production to source details and footage, and to amend the content of articles – the financial limitations and time constraints plaguing the 24-hour news cycle prove challenging, even for large-scale outlets like the BBC. Realizing these demands, news outlets are honest about their inability to rapidly turnaround factually correct and investigative pieces, inviting readers to wear the badge of citizen journalist in order to fill in the missing pieces and to provide refutation whenever necessary.

Another side of commons-based peer production in journalism concerns news outlets and government intervention. In his TED Talk titled “Citizen Journalism”, journalist Paul Lewis powerfully illustrates how journalism benefits from crowdsourcing to expose the truth being covered up by government bodies. His talk focuses on two stories involving the controversial deaths of Ian Tomlinson and Jimmy Mubenga that he wished to investigate further. In both instances, authorities released details of their deaths in a skeptical, misleading fashion. As he explains, his decision to put out a call to action on Twitter stemmed from the following:

 

“For journalists, it means accepting that you can’t know everything, and allowing other people, through technology, to be your eyes and your ears… And for other members of the public, it means not just being the passive consumers of news, but also co-producing news… This can be a very empowering process. It can enable ordinary people to hold powerful organizations to account.”

 

Upon receiving tweets, emails, and raw footage from members of the public surrounding both of the stories above, Lewis was able to determine the truth behind Tomlinson’s death – he was knocked to the ground by police with a baton to the back of his leg – as well as Mubenga’s death – he was held down by three airplane security personnel until he lost consciousness.

While the truth behind these two cases is undeniably thanks to commons-based peer production, it is crucial to note that discretion is necessary when relying on crowd sourced information, because information gleaned via social media messaging and email needs to be combed for bias, lies, and credibility to the same extent as traditional journalism. As Lewis asserts: “Verification is absolutely essential.” Similarly, Anahi Ayala Iacucci of the Standby Task Force, a non-profit dedicated to providing a humanitarian link between the digital world and disaster response, explains the necessary processes of judgment and filtering when making sense of the deluge of information shared on social media: “Crowd sourced information is a lot of noise… not always comprehensible, not always relevant, not always precise or accurate, and that’s still something journalists need to do [curate and verify].”

Because individuals exist who aim to spread false information and divert attention elsewhere – as well as to outright confuse and deceive – I believe it is necessary to re-consider the means through which discretion is performed and information is corroborated. As Benkler and Nissenbaum explain, common-based peer production must seek to achieve a system of checks and balances in order for a project’s or task’s goals to be successful:

“It enforces the behavior it requires primarily through appeal to the common enterprise in which the participants are engaged, coupled with a thoroughly transparent platform that faithfully records and renders all individual interventions in the common project and facilitates discourse among participants about how their contributions do, or do not, contribute to this common enterprise.”

When crowdsourcing in journalism fails, it is because of the very means through which information is sourced. Social media may be transparent in the way that it is a public platform, but it lacks transparency in terms of traceability and faithful recording; individuals do, after all, delete posts or accounts and amend details shared, but once a post has been shared and then read and re-shared, the damage is already done. Moreover, not all participants possess overlapping motivations surrounding journalistic efforts. As I said above, many people are out to confuse, mislead, or outright lie about events because of wide-ranging personal interests.

 

 Reading Crowd Sourced Journalism and reCAPTCHA Together

The success of Luis von Ahn’s reCAPTCHA efforts is contingent on the meticulous method of verification he imposes; showing CAPTCHAs to ten different individuals to ensure their correct digitization demonstrates the level of checks and balances necessary to render commons-based peer production effective. Returning again to Benkler and Nissenbaum, one can observe this systematic order in their example of Wikipedia: “The ability of participants to identify each other’s actions and counteract them—that is, edit out “bad” or “faithless” definitions—seems to have succeeded in keeping this community from devolving into inefficacy or worse.” In the case of reCAPTCHAs, this identification of actions can be accepted as the corresponding text typed in a web form, and the editing can be perceived of as the check performed when verifying which CAPTCHAs yield overlapping interpretations.

Unfortunately, peer produced journalism in its present state does not result in the same level of scrupulous verification. With news stories being churned out in incomplete variations to keep pace with the demands of the 24-hour news cycle, and news being heavily aggregated by sites like Buzzfeed and Huffington Post, proper checks of facts and footage are not consistently being conducted prior to publication. Moreover, people are more likely to share a story than read it, and online reading completion rates aren’t always reassuring, exhibiting the severity of unverified news sources being circulated en masse.

Thus, there is a great need for peer produced journalism to implement more thorough systems of verification, and to shift its focus from speed of delivery to accuracy of reporting. Just as the Standby Task Force works to help “filter the noise” of crowd sourced coverage to produce accurate mapping during crisis response, online news outlets, too, should consider partnering with similar external organizations to better corroborate details and “filter out” incorrect and misleading information.


 Works Cited

Benkler, Yochai and Nissenbaum, Helen, “Commons-based Peer Production and Virtue”, https://www.nyu.edu/projects/nissenbaum/papers/jopp_235.pdf

Heyman, Stephen, “Google Books: A Complex and Controversial Experiment”, http://www.nytimes.com/2015/10/29/arts/international/google-books-a-complex-and-controversial-experiment.html?_r=1

Weir, David, “Google Acquisition Will Help Correct Errors in Scanned Works”, http://www.cbsnews.com/news/google-acquisition-will-help-correct-errors-in-scanned-works/

Wufoo Team, https://c2.staticflickr.com/4/3598/3683064794_95824f2135.jpg

Massive-scale online collaboration, https://www.youtube.com/watch?v=-Ht4qiDRZE8&t=609s

Drohan, Madelaine, “Does serious journalism have a future in Canada?”, http://www.ppforum.ca/sites/default/files/PM%20Fellow_March_11_EN_1.pdf

Citizen journalism, https://www.youtube.com/watch?v=9APO9_yNbcg

Death of Ian Tomlinson, https://en.wikipedia.org/wiki/Death_of_Ian_Tomlinson

Unlawful killing of Jimmy Mubenga, https://en.wikipedia.org/wiki/Controversies_surrounding_G4S#Unlawful_killing_of_Jimmy_Mubenga

The importance of crowdsourced mapping in journalism, https://www.youtube.com/watch?v=uSrpZ8UXyzw

Standby Task Force, http://www.standbytaskforce.org/about-us/

Dewey, Caitlin, “6 in 10 of you will share this link without reading it, a new, depressing study says”, https://www.washingtonpost.com/news/the-intersect/wp/2016/06/16/six-in-10-of-you-will-share-this-link-without-reading-it-according-to-a-new-and-depressing-study/

Manjoo, Farhad, “You Won’t Finish This Article”, http://www.slate.com/articles/technology/technology/2013/06/how_people_read_online_why_you_won_t_finish_this_article.single.html

2 Replies to “Verification is the Key to Successful Crowdsourcing”

  1. I’d like to thank the author for writing this very timely piece about the need for greater verification in crowdsourced activities such as translation, journalism, and captioning.

    The author begins by defining crowdsourcing as a “Commons-based peer production” activity. The essay then discusses the usefulness of reCAPTCHA, especially in its application towards the digitizing of books. The author provides a very straightforward explanation of reCAPTCHA as a two-step verification process that digitizes books, which OCR scanners cannot. From here, the author moves into the foray of citizen journalism as a means of crowdsourcing. The author highlights the ever-increasing demands on professional journalists, leaving them with little to no options but to turn to the public for assistance with their stories. While the author draws attention to the growing number of Call-to-Action (CTA) buttons on newspaper (and other) websites, they also call for caution when it comes to the credibility of crowdsourced information. The author concludes by comparing the ease of verification with a tool like reCAPTCHA with the challenges of verification in journalism.

    This essay certainly highlights what CAPTCHA creator Luis von Ahn calls “the power of the crowd,” especially in its initial section on reCAPTCHA (from this video in the essay). I was pleased to learn about the mechanisms behind reCAPTCHA, and how a seemingly simple process designed to confirm that the user is a human, and not a computer/robot, has aided in the successful digitization of “the equivalent of at least 17,600 books .”

    While the author has certainly provided some useful videos and links on the topic of bias and credibility in citizen journalism, including a link to, or snippet of, John Oliver’s video on Journalism could have helped readers understand the move towards crowdsourcing in journalism as a response to higher expectations of professional journalists. I was pleased that the author stressed the need for increased transparency on ‘public platforms,’ which include websites like Wikipedia and Facebook.

    I thought that The Washington Post article that the author linked to summarized the situation of unverified sharing of articles very well: “Among the many phenomena we’d tentatively attribute, in large part, to the trend [of sharing articles without reading them]: the rise of sharebait (nee clickbait) and the general BuzzFeedification of traditional media; the Internet hoax-industrial complex , which only seems to be growing stronger; and the utter lack of intelligent online discourse around any remotely complicated, controversial topic.” While the latter point may be called into question, I particularly thought the term ‘BuzzFeedification’ of media was both interesting and spoke very directly to the author’s concluding paragraph about the need to verify digital news content, be it a short listicle or a lengthier piece.

    I was hoping that the author would provide some more in-depth suggestions on how to make verification in citizen journalism an easier process. The author mentions the Standby Task Force, which helps “filter noise” online, and hints that online news outlets need to work with similar organizations – but I wonder if that is the only option available? Given that there are individuals who write satirical news pieces or, as the author has brought up, write purposely-false articles, an internal verification by the author seems unlikely. Facebook reportedly has a CTA that allows its users to report a false media story, and is said to be working on more ways to combat false information circulating on its website, so there is a move towards confirming whether online news is accurate and reliable.

    Lastly, while the structure of the essay was clearly laid out in the introductory paragraphs, the author may have better guided readers by providing transition sentences from one section to the next. While the section headings may have worked in a similar function, it may have been more effective to include an additional sentence leading the reader onwards.

  2. This piece offers the beginnings of a solution to some of the problems plaguing journalism today. It suggests the benefits that would be achieved by bringing some of the elements of commons-based peer-production (CBPP) to journalism (a similar argument could be made about non-fiction publishing more broadly). However, while it explains CBPP, and offers a singular example (CAPTCHA) it does not go into much detail of what it might look like, or what the challenges to overcome to achieve it are. It also conflates crowd-driven dissemination (virality) with commons-based production. Perhaps these two are complementary in the author’s mind, but this complementarity is not expressed in the essay. A well-research piece, with some more thinking still left to further flesh out the idea.

Leave a Reply