Topic modeling for BISAC code selection

From the first book we read, or, more often, have read to us, we begin to form preferences. We find authors we like, writing styles or reading levels that we enjoy consuming; plotlines that compel us to keep reading, and characters we connect with, but underlying all of these nuanced preferences is one very specific penchant: genre.

The genre of a book is almost always the first criteria considered by a reader. When we answer the question, “what kind of books do you like?” we, more often than not, respond with a list of our preferred genres. It is the defining feature that creates our personal dichotomy of books we like and those we do not. Occasionally, this can be overturned by the preference for an author whose writing spans genres, but seldom does this happen. While J.K. Rowling may have the power to draw readers to any kind of book she might produce, most authors could not draw equal readership for, say, their writing in the two genres of romance and true crime.

Think even of the setup of a typical bookstore. If you live in Canada, you are likely imagining a Chapters-Indigo, and would confirm the store as being laid out in sections based on the genre of the books found within. The same is true online, as we encounter and browse through websites that offer us catalogues of books sorted by genre; categorized into the neat and tidy packages of biography, fiction and literature, science fiction, and more. Whether wandering into a brick and mortar store, or browsing through an online retailer’s website, we start with genre and go from there.

Understanding then how important book genre is to a reader’s selection and purchase of a book, it would seem logical to assume that the system by which books are assigned a genre is one of consistency and standardization; with strict guidelines and processes in place to ensure that the genre being selected is accurate—reflective of the book’s content. Unfortunately, this assumption is untrue.

The process by which books are categorized into genres begins with a publisher’s selection of what it known as a Book Industry Standards and Communications (BISAC) code, which is entered into the publisher’s metadata system—likely ONIX—to be shared with book sellers and retailers along with the rest of the book’s data (e.g. title, author, price, etc.).

These codes are created and managed by the Book Industry Study Group’s Subject Codes Committee, are updated on an annual basis, and run the gamut between being vague (e.g. BIO000000 BIOGRAPHY & AUTOBIOGRAPHY / General) and specific (e.g. CRA044000 CRAFTS & HOBBIES / Needlework / Cross-Stitch) [1]. As of 2014, the BISG has created 52 subject headings under which BISAC codes are listed, many of which have been carefully cross-listed to ensure that no two codes are redundant.

Despite this extensive and detailed list being developed with apparent care by the BISG, no definitions for the various subject headings or codes are provided. Very little guidance surrounding the selection of BISAC codes based on book content is given; and no categorization tools or aids are supplied.

The BISG website certainly does little to assist publishers in their selection of BISAC codes. In the FAQ section of their BISAC Tutorial and FAQ page they answer the question “How do I choose the BISAC Subject Heading for a specific book?” with the following vague and ineffectual response:

The first step in determining the proper heading for a book would be to identify which of the 52 major areas within the list is most appropriate for the title. Once that section is identified, look for the term that most closely fits the content of the book. If the title has numerous facets, it is recommended that the process be repeated for other relevant major sections. If database systems are sophisticated enough, a recommendation is to do a Keyword or Find search on the entire list in order to identify all the terms that may be appropriate for the book. This is especially effective if it is difficult to determine the proper major section for the term one imagines would be used. This will also help alert the user to cases where similar subjects appear in different sections to reflect different ways of approaching the topic (e.g., “HEALTH & FITNESS / Sexuality”, “PSYCHOLOGY / Human Sexuality”, “RELIGION / Sexuality & Gender Studies”, “SELF-HELP / Sexual Instruction”, not to mention related subjects under JUVENILE FICTION, JUVENILE NONFICTION, and SOCIAL SCIENCE). [2]

Beyond this, the only concrete documentation provided to assist publishers in their selection of BISAC codes is an optional download of a document called Best Practices of Product Metadata. The document reminds publishers that “BISAC subject should be assigned based on book’s content—not on the merchandising plans of the publisher” [3] and offers limited (and commonsense) advice including:

  • There should be consistency across formats. In other words, hardcover, paperback, mass market, large print, audio books, and e-books should all have the same BISAC subjects.
  • Works of juvenile nonfiction should be assigned subjects in the JUVENILE NONFICTION section only. Collections containing both juvenile nonfiction and juvenile fiction may also be assigned subjects in the JUVENILE FICTION section.
  • Use subjects in the FOREIGN LANGUAGE STUDY section for works about the languages specified, whether these works are of an instructional, historical, or linguistic nature. Do not use subjects in this section to indicate the language of a work: works should be classified based on their subject content without regard to the language in which they are written (of course, if a work is about a language and written in that language, a subject in this section should be assigned) [4].

Only two small piece of advice offered even remotely pertain to a book’s content and its relation to selecting a BISAC code. They are:

  • Use subjects in the HEALTH & FITNESS section for works aimed at nonprofessionals. For scholarly works and/or works aimed at medical or health care professionals, use subjects in the MEDICAL section.
  • Certain other subject combinations also apply to titles intended for a lay person vs. those intended for a professional. These combinations include Nature vs. Science, Self-Help vs. Psychology [6].

The rest of the information provided is focused on the entry process for the BISAC codes into metadata systems such as ONIX, and other administrative or clerical tasks associated with BISAC code selection (e.g. how many codes you can select, the fact that a general code is not required if a more specific code from the same subject heading is selected, etc.).

In the absence of industry standards, what’s left is publisher intuition and interpretation, with each publisher (or their proxy) applying their own definitions to the subject headings and selecting BISAC codes as they see fit. The result is an unorganized system with no consistency across the millions of books published and released into the North American market each year.

The question then, is how do we fix this broken system and implement a consistent process for selecting BISAC codes? The first solution that springs to mind is to define the BISAC subject headings and formulate guidelines outlining which elements within a book’s content correspond to specific subject headings. Although this would likely have some positive impact on the consistency of the BISAC codes being assigned by publishers, it would not be enough because it does not fully resolve the key issue with the current system: the potential for human interpretation and bias. Even with guidelines in place, a system that relies on individuals to understand, interpret, and apply standards is bound to experience variance in the output being produced. The answer then must include moving the process of selecting BISAC codes outside the responsibilities of individuals and into the hands of technology, where subjective interpretation is replaced by the objective and programmable application of rules and standards.

Enter natural language processing. A field of study that combines computer science, artificial intelligence, and linguistics; natural language processing (NLP) is concerned with the interaction of computers with human (natural) languages  and includes within it experimentation with numerous computer-completed “tasks” such as speech recognition (converting speech to its textual equivalent), translation (converting text from one language to another), and—most relevant to the issue of selecting a BISAC code—topic modeling (determining a document’s topic based on elements within the text) [6].

According to Princeton researcher David M. Blei, topic modeling is a statistical method of analyzing the words of original texts “to discover the themes that run through them, [and] how those themes are connected to each other.” [7] This method functions when a computer processes a text and identifies specific patterns within it.

These patterns can include the recurrent use of certain words or phrases, or the repeated appearance of relationships (in terms of grammar, order, and position) between words or phrases, and are measured or quantified so as to provide the statistical likelihood that a text pertains to a specific theme or subject matter [8] .

In order for topic modeling to work, the computer processing the text relies on a set of algorithms or rules defining which patterns it should be looking for and which topics correspond with these patterns. These rules are generally created using lexical databases and other linguistic (syntactic and semantic) information, which, for the scope of this essay, will not be discussed in detail.

In an introductory paper discussing topic modeling, Blei goes on to describe the benefits of topic modeling by asking readers to “[i]magine searching and exploring documents based on the themes that run through them” [9] .

We might “zoom in” and “zoom out” to find specific or broader themes; we might look at how those themes changed through time or how they are connected to each other. Rather than finding documents through keyword search alone, we might first find the theme that we are interested in, and then examine the documents related to that theme [10].

Blei’s description of how topic modeling is related to search and discoverability sounds strikingly familiar to the way readers already search for and select books. Replace “theme” with “subject heading” and interpret zooming in as selecting a more specific BISAC code within a subject heading, and topic modeling and the selection of a BISAC are mirrored processes—the only difference is that, at present, one is completed by humans and the other by computers.

Applying the process of topic modeling to the selection of BISAC codes then, we would begin by developing a standard for the linguistic, semantic and syntactic patterns associated with specific BISAC codes. This can be done in multiple ways, but the most obvious way would be through an examination of the patterns present in a massive corpus of books already identified as having a specific BISAC code.

With these patterns identified and thus the topic modeling rules or algorithms set, a publisher could run the text of a book through topic modeling software. The computer would process the text of a book, measuring and recording the patterns it observes, and then, using the frequency and proportions of these patterns, identify the book’s subject matter, and in turn, the most appropriate BISAC code.

By placing this task in the hands of computers, not only would the process become extremely expedient, but the consistency and impartiality with which BISAC codes are selected would also be drastically increased. Without human interference, BISAC codes would be applied solely based on the content of a book, and the biases, interpretations, and marketing ploys of publishers would be removed from the process entirely.

As an added bonus, with the right tools, the topic modeling software could be linked directly to the ONIX metadata for each book, feeding its selection directly into the database. Each year when the list of BISAC codes is updated, the software could automatically re-process the text and update the BISAC codes when necessary or appropriate. Currently, because of the manual process used to select BISAC codes, even as the list of codes is updated, the BISAC codes assigned to books are never updated. Making BISAC code selection an automatic computer task would keep ONIX genre metadata up-to-date and consistent, and would prevent books assigned a now outdated or discontinued BISAC from falling off the radar or being excluded from search results or retailer sorting algorithms (e.g. Amazon’s recommendations or subject/genre categorizations) that depend on or factor in BISAC codes. Topic modeling and its application to BISAC code selection is an obvious fix to a system that so clearly is not function.

With publishers everywhere clamoring about the volume of books flooding the market and the accompanying issue of discoverability metadata—which includes BISAC codes—and the importance of its accuracy has risen to the forefront of the conversation [11]. Book industry veteran and Product Manager, Identifiers at Bowker, explicitly states, “The publisher (and retailer) with the best, most complete metadata offers the greatest chance for consumers to buy books. The publisher with poor metadata risks poor sales—because no one can find those books” [12].

And yet, even with this rise in concern and understanding of the importance of metadata, broken systems such as the human-based selection of BISAC codes still persist within the industry. Given the above discussion of the importance and omnipresence of genre in the purchasing decisions of buyers, and the known existence of topic modeling software—which offers clear benefits and advancement for publishers toward accurate metadata for publishers—the question is raised: when will publishers stop talking about their problems, and actually solving them?

References
[1] https://www.bisg.org/tutorial-and-faq#General
[2] https://www.bisg.org/publications/best-practices-product-metadata
[3] Ibid.
[4] Ibid.
[5] Ibid.
[6] http://en.wikipedia.org/wiki/Natural_language_processing
[7] https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
[8] Ibid.
[9] Ibid.
[10] Ibid.
[11] http://toc.oreilly.com/2010/06/sifting-through-all-these-book.html
[12] http://book.pressbooks.com/chapter/metadata-laura-dawson

Accessing big data: The key to publishers taking back the power

Publishing and the priceless tool of big data

Over the last decade, the rise in digital reading has brought with it an unparalleled opportunity for publishers to collect and use reader data; information that extends from the time, frequency and duration at which consumers are reading, down to detailed records of whether or not a reader completes a book, and if not, on which page they gave up. All of this data, if harnessed, has the potential to impact traditional publishing’s business models, and at the very least to equip publishers with solid fact on which they may base their decisions. At present; however, traditional publishers (specifically the Big Five) find themselves with extremely limited access to this priceless tool. Instead of taking swift action to remedy this situation in a way that will empower publishers and allow them to self-sufficiently explore alternative business models, traditional publishers have sat by idly, opening up opportunities for tech startups and e-book retailers to drive innovation within the industry. What little action has been observes is taking extensive amounts of time to come to fruition and is being executed through partnerships with retailers and tech companies. This is worrisome, not only because of the history of issues publishers have faced after relying heavily on retailers such as Indigo and Amazon in the past, but also because it is doing nothing to increase the self-sufficiency of publishers and return them to the position of power within the industry. If publishers hope to regain control over their industry and see traditional publishing move forward as a profitable endeavor, they will need to take swift and innovative action to gain access to and apply big data to their business models and decisions through in-house innovation.

Facing “a locked-up data pipeline”

One of the largest barriers to the success of traditional publishers in utilizing big data–which should be considered prior to making recommendations on the ways in which the data should be used–is the lack of tools publishers have in place to collect and/or access reader data. Although publishers create the content, their interaction with and connection to the platforms through which it is consumed is nonexistent. Apart from a miniscule portion of direct sales, the work of traditional publishers reaches consumers via the platforms and products of external retailers and companies. This division between the content and its consumption means that publishers have no access to data beyond “the blunt instrument of units sold.” [1]

According to Kristen McLean, Miami-based founder and CEO of Bookigee, a publishing-focused data, analytics, and consumer research company, publishers are facing “a locked-up data pipeline in which [they] don’t have access to complete data”–a fact that has caused the publishing industry to “lag behind most major consumer industries, including the music, TV, and film.” [2]

Currently, of the five major e-book retailers/platforms–Amazon (Kindle), Apple (iBooks), Barnes & Noble (Nook), Google, and Kobo–all admit to collecting reader data, though most are tight-lipped about how exactly they are analyzing and using it. [3] Further, most seem to be fairly explicit about keeping this information to themselves. At one point, it appeared as though U.S.-based book retailer, Barnes & Noble, which holds roughly a quarter of the American e-book market, would help open that “locked-up data pipeline”. In a statement made by Jim Hilt, the company’s then vice president of e-books, at the January 2012 Digital Book World Conference B&N mentioned plans to share this information and the insights they have gleaned from it with publishers, and stated that the company was already doing so informally. [4] By March 2012; however, the company’s tune had changed with Hilt stating that B&N “has no imminent plans to share more information with publishers about readers’ habits in a systemic way.” [5]

The white knight in all of this has been Toronto-based Kobo, which openly shares its reader data with publishers, going so far as to release an overview of their aggregated findings in fact sheets and whitepapers publicly available on their company website. In a January 2015 presentation delivered to the Simon Fraser University Masters of Publishing program, Kobo President & Chief Content Officer, Michael Tamblyn confirmed that the lines of communication between Kobo and publishers are open, and that the priceless reader data being collected is already being relayed back to publishers. [6] In addition, Tamblyn offered insight into Kobo’s decision to partner with publishers, noting that the company believes that reader data will help publishers put out better content, and better content means more sales–a win-win for publishers and the e-book retailer. [7]

While one can only hope that Kobo will inspire other e-book retailers to form similar data-sharing relationships with publishers, as a platform that has more than eight million users worldwide and stocks more than 2.5 million books from hundreds of publishers and imprints including the Big Five, [8] Kobo’s generosity in sharing its reader data is a solid starting point for publishers looking to apply big data insights to their business models and decisions.

Making the most out of what is available

Offering some hope and hinting that publishers may be utilising the data they do have access to, limited though it may be, are the 2014/2015 partnerships of HarperCollins, Simon & Schuster, and Macmillan with e-book subscription services Oyster and Scribd. [9]

While the three publishers have not explicitly come forward saying their decisions were rooted in big data insights, the move aligns perfectly with findings reported in Kobo’s fall 2013 whitepaper entitled The Evolution of the eReading Customer which identified 19% of the e-reading population as “Book-Loving Borrowers”–individuals who read approximately 31 books each year but prefer to borrow rather than buy. [10]

This is an exciting move on the publishers’ parts–a first step toward a data-driven business model, and promising news for an industry that has seemingly run on intuition and anecdotal evidence for hundreds of years. [11] But it is still one of the only examples of traditional publishers applying big data, and it presents two large issues: publisher’s lack of expedience and their ever-present dependence on external partners.

Slow and steady doesn’t win the race

However out-of-character and noteworthy this move may be on the part of traditional publishers, the path by which they came to make this innovation is cause for concern. Firstly, and perhaps the most glaring of issues with this progress and exploration of alternative business models, is how long it took for this to happen. Kobo has been sharing reader data since at least 2013, and subscription services such as Scribd and Oyster have existed since 2007 [12] and 2012 [13] respectively. This means that it took three of the world’s largest, oldest, and most powerful publishing houses years to recognize and act on the substantial segment of readers who prefer to borrow books rather than buy. And still, the remaining two of the Big Five are showing no signs of taking action to capture this audience’s attention or business. Nearly two years ago, at the 2013 Frankfurt Book Fair, Penguin Random House CEO Markus Dohle responded to questions about the company’s plans to use big data, saying, “We’re doing our homework, our research and development with different business models and we are doing it cautiously. We will take our time, we want quality over speed and there is no rush. We are building a company for the next 100 years and not for 100 days.” [14] Although the sentiment of quality over speed may have its merits, one would hope that Random House would also recognize that speed is important when you’re in a competitive and struggling market. While publishing magnates have been doing its research, Amazon has had time to build and launch its own subscription service, Kindle Unlimited, which first became available in the U.S. in July 2014. [15]

Dependence as publishing’s downfall

The second concern surrounding traditional publishing’s early forays into new business models is the fact that they are not experimenting with industry innovation for themselves, and instead are relying on other retailers and companies to actually implement or execute the innovations–a trend that has been historically pervasive within the industry and has placed traditional publishers at a disadvantage.

Proof positive of the ubiquitous dependence of publishers on external partners can be found in the aforementioned partnerships of HarperCollins, Simon & Schuster, and Macmillan with e-book subscription services Oyster and Scribd; particularly when it contrasted with Amazon’s development of Kindle Unlimited. Instead of doing something for themselves, publishers turned to others to capitalize on the limited big data insights they had.

And it looks as though the pattern of dependence will continue. While traditional publishers have been contemplating big data, a plethora of tech startups have appeared, offering ebook analytics to self-published authors and small or independent publishers–and doing so successfully. An example is San Francisco company App Annie that expanded its services to include e-book analytics in 2013. [16]

With these companies cornering the market on big data for publishers and building successful tools and infrastructure as traditional publishers stand by idly, it raises concern that, as a struggling industry, while seeking the most “cost efficient way” [17] to access reader data, publishers will be left to rely on experts and services outside their own company–a situation similar to what happened when traditional publishers left book e-commerce in the hands of Amazon, and one that would leave publishers yet again in the submissive position.

Changing the conversation around partnerships

Although this concern does not appear to have spurred traditional publishers to take swift action within their own walls, conversations around the subject of future partnerships indicate that traditional publishers have at least learned from their past decisions to rely on others. In a 2014 interview with Fast Company, HarperCollins’ Chief Digital Officer, Chantal Restivo-Alessi, responded to a question about data-driven projects the company will be taking on by saying, “Where we are making the first inroads is really allowing ourselves to acquire more consumer data.” [18]

At the 2013 Frankfurt Book Fair’s CONTEC conference, Sebastian Posth, CEO of Berlin-based Publishing Data Networks, a company offering analytics to the German publishing industry, summarized nicely the changing mindset and necessary caution of publishers looking at partnerships that would allow them to experiment with data-driven models:

“Data analysis is a business requirement and a necessary means to deal with the digital change…The publishing industry needs to learn this lesson if it wants to survive. Publishers need to make sure that they work with partners (retailers, intermediaries, distributors), that in general support the idea of exchanging, at best, real-time information between people and organizations in a distributed supply chain…Data is not a giveaway or supplement to a business deal, it is a prerequisite.” [19]

Taking the bull by the horns

While the awareness and caution being exercised by traditional publishers is heartening, the still-pervasive reliance on external partners to creative innovation is worrisome, and it seems reasonable to question how publishers will attempt to gain the upper hand if not at least equity in these partnerships. The safer, though perhaps less economical, route would be for publishers to take matters into their own hands and develop their own tools for collecting and analyzing big data, then apply the insights they gain. Until publishers do that they will be, in the words of marketing guru Seth Godin, “playing a different game than people who have been winning on the internet for a very long time.” [20]

One such way for traditional publishers to do so is by generating all e-books using EPUB 3, which, being built on HTML 5, would allow them to build JavaScript into the books that could then be used to track reader behaviour. [21]  While this move would require traditional publishers to expand their teams to include data analysts, the unmitigated access to reader data would place publishers into a position of power and control, and most importantly would allow them to create data-driven innovation self-sufficiently.

At the very least, if traditional publishers did not find it economically feasible to analyze the data themselves, having access to it would allow them to build partnerships more akin to outsourcing, whereby publishers could hire or contract an external company to perform these services for them. In this scenario, publishers would be in the position of power, as they would only be paying for the analysis, and not for access to the data.

Similarly, with access to the data no longer a bartering chip, the idea of partnerships with retailers could be revisited. If the data being held hostage by retailers such as Amazon, Apple, and B&N was suddenly available to publishers through alternative means, the value of the data, monetarily speaking, would depreciate and publishers and retailers would move to a more equal level on which they could strike deals.

Shifting the industry norm

The key in all of this is access to big data. Without it, publishers will remain powerless, unable to affect change and innovation within their own industry, and at the mercy of retailers such as Amazon and Apple. Though it may be a difficult and likely expensive path, traditional publishers, and within that the Big Five specifically, need to take swift action to gain access to reader data. Through the example and generosity of Kobo, publishers can see the possible applications of this data, and understand that partnerships based in equity are possible between publishers and retailers, but if they want that to become to industry norm, they need to step up and do something. And fast.

 

Sources

[1] “Publishing in the Era of Big Data Whitepaper – Fall 2014.” Kobo Café. 2014. Accessed February 20, 2015. https://cafe.kobo.com/press/facts.

[2] Anderson, Porter. “Publishing Is Now a “Data Game” – Publishing Perspectives.” Publishing Perspectives. September 17, 2013. Accessed February 20, 2015. http://publishingperspectives.com/2013/09/publishing-is-now-a-data-game/.

[3] Kaste, Martin. “Is Your E-Book Reading Up On You?” NPR. December 10, 2010. Accessed February 20, 2015. http://www.npr.org/2010/12/15/132058735/is-your-e-book-reading-up-on-you).

[4] Greenfield, Jeremy. “Barnes & Noble to Share More Reader Data with Publishers.” Digital Book World. January 24, 2012. Accessed February 20, 2015. http://www.digitalbookworld.com/2012/barnes-noble-to-share-more-reader-data-with-publishers/.

[5] Greenfield, Jeremy. “Barnes & Noble Has No Imminent Plans to Share More Data With Publishers.” Digital Book World. March 16, 2012. Accessed February 20, 2015. http://www.digitalbookworld.com/2012/barnes-noble-has-no-imminent-plans-to-share-more-data-with-publishers/.

[6] Tamblyn, Michael. “Kobo.” Lecture, from Simon Fraser University Masters of Publishing Program guest speaker series, Vancouver

[7] Ibid.

[8] Alter, Alexandra. “Your E-Book Is Reading You.” WSJ. July 19, 2012. Accessed February 20, 2015. http://www.wsj.com/articles/SB10001424052702304870304577490950051438

[9] Plaugic, Lizzie. “Ebook Subscription Services Get a Boost with Help from Macmillan.” The Verge. January 13, 2015. Accessed February 20, 2015. http://www.theverge.com/2015/1/13/7539379/e-book-subscription-oyster-scribd-macmillan.

[10] “The Evolution of the eReading Customer – Fall 2013.” Kobo Café. 2014. Accessed February 20, 2015. https://cafe.kobo.com/press/facts.

[11] “Publishing in the Era of Big Data Whitepaper – Fall 2014.” Kobo Café. 2014. Accessed February 20, 2015. https://cafe.kobo.com/press/facts.

[12] “Oyster (company).” Wikipedia. February 20, 2015. Accessed February 20, 2015. http://en.wikipedia.org/wiki/Oyster_(company).

[13] “Scribd.” Wikipedia. February 20, 2015. Accessed February 20, 2015. http://en.wikipedia.org/wiki/Scribd.

[14] Knolle, Kirsti. “Publishers Need to Know Their Readers to Survive in Digital Era.” Reuters. October 21, 2013. Accessed February 20, 2015. http://www.reuters.com/article/2013/10/21/net-us-publishing-data-idUSBRE99G0LD20131021.

[15] “Amazon Officially Launches Ebook Subscription Service, Kindle Unlimited.” Digital Book World. July 18, 2014. Accessed February 20, 2015. http://www.digitalbookworld.com/2014/amazon-officially-launches-ebook-subscription-service-kindle-unlimited/.

[16] Owen, Laura. “App Data Company App Annie Expands into Ebook Analytics for Publishers and Authors.” Gigaom. October 8, 2013. Accessed February 20, 2015. https://gigaom.com/2013/10/08/app-data-company-app-annie-expands-into-ebook-analytics/.

[17] Greenfield, Rebecca. “How HarperCollins’s Chief Digital Officer Uses Big Data To Make Publishing More Profitable.” Fast Company. January 23, 2014. Accessed February 20, 2015. http://www.fastcompany.com/3025254/most-creative-people/how-harpercollinss-chief-digital-officer-uses-big-data-to-make-publishi.

[18] Ibid.

[19] Anderson, Porter. “Publishing Is Now a “Data Game” – Publishing Perspectives.” Publishing Perspectives. September 17, 2013. Accessed February 20, 2015. http://publishingperspectives.com/2013/09/publishing-is-now-a-data-game/.

[20] Friedman, Jane. “How E-Books Have Changed the Print Marketplace: Digital Book World, Day 3.” Jane Friedman. January 16, 2015. Accessed February 20, 2015. http://janefriedman.com/2015/01/16/ebooks-print-market/.

[21] Greenfield, Jeremy. “How Publishers Should Prepare for EPUB 3.” Digital Book World. January 18, 2012. Accessed February 20, 2015. http://www.digitalbookworld.com/2012/how-publishers-should-prepare-for-epub-3/.

It’s a pirate’s life—in Canada

A broad overview of the legality of peer-to-peer file sharing and related copyright infringement from a Canadian perspective

What is peer-to-peer file sharing?

In the simplest terms possible, peer-to-peer (P2P) file sharing is a method of distributing and downloading files. More complexly, P2P file sharing relies on a group of internet users, known as peers, voluntarily connecting their computers to form a network that allows them to share and download files.

Peers use websites known as trackers to locate files that other members of the network are sharing, and download and assemble the files using software known as clients. Unlike in traditional file sharing models that rely on direct file transfers from one user to another, P2P file sharing allows users to connect with all peers in the network currently sharing the file they want and download different pieces of the file from multiple different peers simultaneously. Once a peer has downloaded a file, they then become what known as a seeder and are able to share that file with other peers within the network.

The most popular P2P file sharing protocol is known as BitTorrent, and is estimated to have anywhere from 150 to 300 million users [1].

Is it legal?

While the short and sweet answer to this is yes—peer-to-peer file sharing and the technology behind it is, in and of itself, legal in Canada—it’s the type of files being shared that determine the legality of P2P file sharing on a case by case basis.

For users (or peers)

Users sharing and downloading content that is no longer under copyright or to which they personally hold or have been granted (under copyright or creative commons licensing) the rights to do so have little or nothing to fear. Of course, of all content available through BitTorrent file sharing an estimated 99.97% of it is copyrighted material [2], and it is the sharing and downloading of these files that could land you in hot water.

Section 27 subsection (1) of the Canadian Copyright Act clearly states:

It is an infringement of copyright for any person to do, without the consent of the owner of the copyright, anything that by this Act only the owner of the copyright has the right to do [3].

Given that copyright (without any special creative commons licensing) gives the creator of an original work exclusive rights to its use and distribution, sharing and/or downloading (which requires creating a copy of) content via P2P file sharing is a clear case of copyright infringement and therefore illegal.

For P2P site owners

It’s not only peers or members of P2P networks that could be breaking the law. There are many peer-to-peer file-sharing networks, and affiliated services, that could also find themselves in a legal bind based on the types of files being shared, and/or the copyright infringement made possible by the network.

Within Canada there are in fact specific laws governing the use of “digital networks” and their connection to possible acts of piracy. According to section 27 subsection (2.3) of the Canadian Copyright Act:

It is an infringement of copyright for a person, by means of the Internet or another digital network, to provide a service primarily for the purpose of enabling acts of copyright infringement if an actual infringement of copyright occurs by means of the Internet or another digital network as a result of the use of that service [4].

There are numerous conditions considered by the court when determining whether a person has in fact infringed copyright under subsection (2.3), but in broad terms, what this means is that popular peer-to-peer networks and their related services that facilitate the free sharing of content and material still under copyright (and the people who run them) could be breaking the law. This would, of course, include torrent trackers such as isohunt, Torrentz, and KickassTorrents.

The debate over P2P site owner responsibility

Although the Canadian government feels that site owners are infringing copyright by facilitating the copyright infringement of others, whether or not P2P site owners should be held responsible is still hotly debated among members of the tech and legal communities.

The most common argument presented in defence of P2P file sharing sites, regardless of the type of content they are hosting, rests on the idea that these sites (and the technology behind them) were not created with the intention of facilitating illegal activity; it is the choices made by individual users that leads to copyright infringement, and these choices are not under the control of site owners.

A common metaphor used to illustrate this argument compares the use of a P2P file-sharing network to the purchase of a vehicle. While the vehicle was not designed with the sole intention of breaking the law, once the driver is in possession of it, they are capable of exceeding the speed limit or driving while impaired or distracted, through their own personal choices—choices for which the vehicle manufacturer is not held responsible or liable despite the facilitation of those choices by the vehicle [5].

Countering this defence of P2P site owners, policymakers argue that it is not the mere fact that illegal file sharing happens that places the legal onus on site owners, but rather numerous factors including, but not limited to, the site owners’ knowledge of these activities and failure to take measures to prevent them.

This is clearly spelled out in the Canadian Copyright Act section 27 subsection (2.4) which lists the conditions used to determine whether copyright infringement has occurred under subsection (2.3). These include:

(a) whether the person expressly or implicitly marketed or promoted the service as one that could be used to enable acts of copyright infringement;
(b) whether the person had knowledge that the service was used to enable a significant number of acts of copyright infringement;
(c) whether the service has significant uses other than to enable acts of copyright infringement;
(d) the person’s ability, as part of providing the service, to limit acts of copyright infringement, and any action taken by the person to do so;
(e) any benefits the person received as a result of enabling the acts of copyright infringement; and
(f) the economic viability of the provision of the service if it were not used to enable acts of copyright infringement [6].

Of course, there are still those who dispute the validity of the above conditions, and it is unlikely that the government, legal, and tech communities will come to a consensus on the matter any time soon, but for now it is the Canadian government that has the final word on the matter.

Copyright law in action

While all this legalese might make Canada seem like it’s taking a hard stance on the illegal sharing of files, despite producing legislature identifying the potential use of P2P file sharing for copyright infringement, things above the 49th parallel have been pretty quiet—something that can’t be said for other countries around the world.

With many governments taking a similar stance to Canada, numerous P2P site owners and users have found themselves in trouble with the law.

The most famous instance of a copyright infringement lawsuit against a P2P file sharing site is, of course, the 2009 shutdown of the Sweden-based tracker The Pirate Bay—a case that made international headlines with the arrest of all four site founders who were later found guilty of promoting the copyright infringement of others, fined $3.5 million USD and sentenced to one year, each, in prison [7].

In 2006, a similar situation unfolded; this time on American soil, when the Motion Picture Association of America (MPAA) launched a lawsuit against multiple BitTorrent trackers sites including isoHunt (interestingly enough, a Canadian server hosted and owned tracker) on the basis that the site had facilitated copyright infringement. Charges were laid against isoHunt founder and Vancouver resident Gary Fung, and in 2013 Fung voluntarily shut down the tracker, agreeing to pay a whopping fine of $110 million USD [8] .

Also coming out of the U.S. is news of what’s being called “copyright trolling” with an estimated 18,000 Americans being sued by the Recording Industry Association of America (RIAA) [9] and horror stories about a woman fined $1.9 million USD for downloading 24 songs [10] .

This may all sound pretty scary, but here in Canada, not a single case has been heard against P2P site owners; the closest thing being an alleged seizure of servers from a BitTorrent site by the RCMP in May 2014. The site; however, was a Swedish tracker known as Sparvar, and RCMP were working on behalf of Swedish authorities [11]. On the user side of things, finding a case against an individual Canadian P2P user is like finding a needle in a haystack, or harder.

So what is Canada doing about illegal downloading?

While other countries around the world have been making moves made to cut piracy off at the knees by taking legal action against sites that facilitate illegal downloading or launching court cases en masse (seemingly as much a scare tactic as a tenable strategy to recoup lost profits), Canada has focused its energy on developing measures to deter users from downloading copyrighted content. Oh, and it’s being done using polite “we see what you’re doing and would like you to stop” notices and court-approved letters from copyright holders.

In 2012, Bill C-11, also known as the Copyright Modernization Act, was passed, broadening the scope of what’s covered under fair dealing with specific focus on educational uses of content under copyright [12]. Also included in the new law is what’s known as the “Notice and Notice” provisions that came into effect on January 2, 2015 [13]. Under these provisions, internet service providers (ISPs) are required to pass along notices of suspected copyright infringement to users from copyright holders—they’re also required to hold on the IP addresses of any users they contact for up to a year in case a copyright holder decides to pursue further legal action [14].

That’s right, Canadian ISPs are sending courtesy emails, and it does, in fact, seem to be working. Rogers reported that after receiving just one notice, 67% of the recipients stop infringing—after two notices that number jumps to 89% of recipients abandoning their illegal file sharing ways [15].

Of course, if a copyright holder is dedicated enough, they can still take an individual to court, but in order to do so they’ll need to formally request the user’s information from the ISP, a process that requires federal court approval. If granted, all communication between the copyright holder and alleged pirate must be court-approved and cannot include any intimidation or scare tactics, with communication clearly indicating that a court has not yet decided the individual’s liability [16].

If a copyright holder makes it through all those hoops and takes an individual to court the returns are still fairly limited. For non-commercial copyright infringement, penalties are limited to $5000 [17], a move some legal experts are noting as the Canadian government’s way of safeguarding individuals from being exploited by media companies , and a fine that barely makes the process of going to court worthwhile.

Avast ye, mateys!

With all that said, it would seem that being a pirate (of the online variety) is pretty good if you live in Canada. While peer-to-peer file sharing site owners might have a bit more to worry about, it would seem that the government isn’t coming down as hard on individual copyright infringers—at least not as hard as one would expect given the lengthy legislation put in place. While multi-million dollar copyright infringement lawsuits are taking place around the world, all’s quiet in the great white north. Will it stay that way? It’s hard to tell, but for now, (yo ho, yo ho) it’s a pirate’s life—in Canada.

Works Cited

[1] “‘P2P Not Dead’: 300 Mn BitTorrent Users Swap TV Shows and Movies Every Month.” RT News. May 31, 2014. Accessed January 26, 2015. http://rt.com/news/162744-p2p-file-sharing-increase/.

[2] Flaherty, Anne. “99.97pc of BitTorrent Files Illegal – Study.” 3 News. September 30, 2013. Accessed January 26, 2015. http://www.3news.co.nz/technology/9997pc-of-bittorrent-files-illegal–study-2013092012#axzz3QAbyrAs6.

[3] “Copyright Act (R.S.C., 1985, C. C-42).” Justice Laws Website. December 9, 2014. Accessed January 27, 2015. http://laws-lois.justice.gc.ca/eng/acts/C-42/page-16.html#h-21.

[4] Ibid

[5] Palm, Erik. “Pirate Bay Attorney Outlines Arguments for Appeal – CNET.” CNET. May 9, 2009. Accessed January 27, 2015. http://www.cnet.com/news/pirate-bay-attorney-outlines-arguments-for-appeal/.

[6] “Copyright Act (R.S.C., 1985, C. C-42).” Justice Laws Website. December 9, 2014. Accessed January 27, 2015. http://laws-lois.justice.gc.ca/eng/acts/C-42/page-16.html#h-21.

[7] Enigmax. “The Pirate Bay Trial: The Official Verdict – Guilty | TorrentFreak.” TorrentFreak RSS. April 17, 2009. Accessed January 28, 2015. https://torrentfreak.com/the-pirate-bay-trial-the-verdict-090417/

[8] The Canadian Press. “IsoHunt Shut Down, Canadian Torrent Firm Fined $110M US – Technology & Science – CBC News.” CBCnews. October 18, 2013. Accessed January 30, 2015. http://www.cbc.ca/news/technology/isohunt-shut-down-canadian-torrent-firm-fined-110m-us-1.2126064.

[9] Holpuch, Amanda. “Minnesota Woman to Pay $220,000 Fine for 24 Illegally Downloaded Songs.” The Guardian. September 11, 2012. Accessed January 27, 2015. http://www.theguardian.com/technology/2012/sep/11/minnesota-woman-songs-illegally-downloaded.

[10] Friend, Elianne. “Woman Fined to Tune of $1.9 Million for Illegal Downloads.” CNN. June 18, 2009. Accessed January 27, 2015. http://www.cnn.com/2009/CRIME/06/18/minnesota.music.download.fine/index.html?eref=ib_us.

[11] Makuch, Ben. “In a Rare Move, Canadian Mounties Seized Data from a Torrent Site.” Motherboard. May 15, 2014. Accessed January 27, 2015. http://motherboard.vice.com/read/in-a-rare-move-canadian-mounties-seized-data-from-a-torrent-site.

[12] “Bill C-11: The Copyright Modernization Act.” Copyright at UBC. Accessed January 28, 2015. http://copyright.ubc.ca/guidelines-and-resources/support-guides/bill-c-11-the-copyright-modernization-act/.

[13] “Notice and Notice Regime.” Government of Canada. June 16, 2014. Accessed January 28, 2015. http://news.gc.ca/web/article-en.do?nid=858069.

[14] Ibid

[15] Geist, Michael. “Rogers Provides New Evidence on Effectiveness of Notice-and-Notice System – Michael Geist.” Michael Geist. March 23, 2011. Accessed January 28, 2015. http://www.michaelgeist.ca/2011/03/effectiveness-of-notice-and-notice/.

[16] El Akkad, Omar, and Jeff Gray. “Court Orders Canadian ISP to Reveal Customers Who Downloaded Movies.” The Globe and Mail. February 21, 2014. Accessed January 29, 2015. http://www.theglobeandmail.com/technology/tech-news/court-tells-teksavvy-to-reveal-customers-who-illegally-download-movies/article17025513/.

[17] Armstrong, James. “New Regulations about Illegal Downloading Go into Effect.” Global News. January 2, 2015. Accessed January 30, 2015. http://globalnews.ca/news/1752246/new-regulations-about-illegal-downloading-go-into-effect/.