Topic modeling for BISAC code selection

From the first book we read, or, more often, have read to us, we begin to form preferences. We find authors we like, writing styles or reading levels that we enjoy consuming; plotlines that compel us to keep reading, and characters we connect with, but underlying all of these nuanced preferences is one very specific penchant: genre.

The genre of a book is almost always the first criteria considered by a reader. When we answer the question, “what kind of books do you like?” we, more often than not, respond with a list of our preferred genres. It is the defining feature that creates our personal dichotomy of books we like and those we do not. Occasionally, this can be overturned by the preference for an author whose writing spans genres, but seldom does this happen. While J.K. Rowling may have the power to draw readers to any kind of book she might produce, most authors could not draw equal readership for, say, their writing in the two genres of romance and true crime.

Think even of the setup of a typical bookstore. If you live in Canada, you are likely imagining a Chapters-Indigo, and would confirm the store as being laid out in sections based on the genre of the books found within. The same is true online, as we encounter and browse through websites that offer us catalogues of books sorted by genre; categorized into the neat and tidy packages of biography, fiction and literature, science fiction, and more. Whether wandering into a brick and mortar store, or browsing through an online retailer’s website, we start with genre and go from there.

Understanding then how important book genre is to a reader’s selection and purchase of a book, it would seem logical to assume that the system by which books are assigned a genre is one of consistency and standardization; with strict guidelines and processes in place to ensure that the genre being selected is accurate—reflective of the book’s content. Unfortunately, this assumption is untrue.

The process by which books are categorized into genres begins with a publisher’s selection of what it known as a Book Industry Standards and Communications (BISAC) code, which is entered into the publisher’s metadata system—likely ONIX—to be shared with book sellers and retailers along with the rest of the book’s data (e.g. title, author, price, etc.).

These codes are created and managed by the Book Industry Study Group’s Subject Codes Committee, are updated on an annual basis, and run the gamut between being vague (e.g. BIO000000 BIOGRAPHY & AUTOBIOGRAPHY / General) and specific (e.g. CRA044000 CRAFTS & HOBBIES / Needlework / Cross-Stitch) [1]. As of 2014, the BISG has created 52 subject headings under which BISAC codes are listed, many of which have been carefully cross-listed to ensure that no two codes are redundant.

Despite this extensive and detailed list being developed with apparent care by the BISG, no definitions for the various subject headings or codes are provided. Very little guidance surrounding the selection of BISAC codes based on book content is given; and no categorization tools or aids are supplied.

The BISG website certainly does little to assist publishers in their selection of BISAC codes. In the FAQ section of their BISAC Tutorial and FAQ page they answer the question “How do I choose the BISAC Subject Heading for a specific book?” with the following vague and ineffectual response:

The first step in determining the proper heading for a book would be to identify which of the 52 major areas within the list is most appropriate for the title. Once that section is identified, look for the term that most closely fits the content of the book. If the title has numerous facets, it is recommended that the process be repeated for other relevant major sections. If database systems are sophisticated enough, a recommendation is to do a Keyword or Find search on the entire list in order to identify all the terms that may be appropriate for the book. This is especially effective if it is difficult to determine the proper major section for the term one imagines would be used. This will also help alert the user to cases where similar subjects appear in different sections to reflect different ways of approaching the topic (e.g., “HEALTH & FITNESS / Sexuality”, “PSYCHOLOGY / Human Sexuality”, “RELIGION / Sexuality & Gender Studies”, “SELF-HELP / Sexual Instruction”, not to mention related subjects under JUVENILE FICTION, JUVENILE NONFICTION, and SOCIAL SCIENCE). [2]

Beyond this, the only concrete documentation provided to assist publishers in their selection of BISAC codes is an optional download of a document called Best Practices of Product Metadata. The document reminds publishers that “BISAC subject should be assigned based on book’s content—not on the merchandising plans of the publisher” [3] and offers limited (and commonsense) advice including:

  • There should be consistency across formats. In other words, hardcover, paperback, mass market, large print, audio books, and e-books should all have the same BISAC subjects.
  • Works of juvenile nonfiction should be assigned subjects in the JUVENILE NONFICTION section only. Collections containing both juvenile nonfiction and juvenile fiction may also be assigned subjects in the JUVENILE FICTION section.
  • Use subjects in the FOREIGN LANGUAGE STUDY section for works about the languages specified, whether these works are of an instructional, historical, or linguistic nature. Do not use subjects in this section to indicate the language of a work: works should be classified based on their subject content without regard to the language in which they are written (of course, if a work is about a language and written in that language, a subject in this section should be assigned) [4].

Only two small piece of advice offered even remotely pertain to a book’s content and its relation to selecting a BISAC code. They are:

  • Use subjects in the HEALTH & FITNESS section for works aimed at nonprofessionals. For scholarly works and/or works aimed at medical or health care professionals, use subjects in the MEDICAL section.
  • Certain other subject combinations also apply to titles intended for a lay person vs. those intended for a professional. These combinations include Nature vs. Science, Self-Help vs. Psychology [6].

The rest of the information provided is focused on the entry process for the BISAC codes into metadata systems such as ONIX, and other administrative or clerical tasks associated with BISAC code selection (e.g. how many codes you can select, the fact that a general code is not required if a more specific code from the same subject heading is selected, etc.).

In the absence of industry standards, what’s left is publisher intuition and interpretation, with each publisher (or their proxy) applying their own definitions to the subject headings and selecting BISAC codes as they see fit. The result is an unorganized system with no consistency across the millions of books published and released into the North American market each year.

The question then, is how do we fix this broken system and implement a consistent process for selecting BISAC codes? The first solution that springs to mind is to define the BISAC subject headings and formulate guidelines outlining which elements within a book’s content correspond to specific subject headings. Although this would likely have some positive impact on the consistency of the BISAC codes being assigned by publishers, it would not be enough because it does not fully resolve the key issue with the current system: the potential for human interpretation and bias. Even with guidelines in place, a system that relies on individuals to understand, interpret, and apply standards is bound to experience variance in the output being produced. The answer then must include moving the process of selecting BISAC codes outside the responsibilities of individuals and into the hands of technology, where subjective interpretation is replaced by the objective and programmable application of rules and standards.

Enter natural language processing. A field of study that combines computer science, artificial intelligence, and linguistics; natural language processing (NLP) is concerned with the interaction of computers with human (natural) languages  and includes within it experimentation with numerous computer-completed “tasks” such as speech recognition (converting speech to its textual equivalent), translation (converting text from one language to another), and—most relevant to the issue of selecting a BISAC code—topic modeling (determining a document’s topic based on elements within the text) [6].

According to Princeton researcher David M. Blei, topic modeling is a statistical method of analyzing the words of original texts “to discover the themes that run through them, [and] how those themes are connected to each other.” [7] This method functions when a computer processes a text and identifies specific patterns within it.

These patterns can include the recurrent use of certain words or phrases, or the repeated appearance of relationships (in terms of grammar, order, and position) between words or phrases, and are measured or quantified so as to provide the statistical likelihood that a text pertains to a specific theme or subject matter [8] .

In order for topic modeling to work, the computer processing the text relies on a set of algorithms or rules defining which patterns it should be looking for and which topics correspond with these patterns. These rules are generally created using lexical databases and other linguistic (syntactic and semantic) information, which, for the scope of this essay, will not be discussed in detail.

In an introductory paper discussing topic modeling, Blei goes on to describe the benefits of topic modeling by asking readers to “[i]magine searching and exploring documents based on the themes that run through them” [9] .

We might “zoom in” and “zoom out” to find specific or broader themes; we might look at how those themes changed through time or how they are connected to each other. Rather than finding documents through keyword search alone, we might first find the theme that we are interested in, and then examine the documents related to that theme [10].

Blei’s description of how topic modeling is related to search and discoverability sounds strikingly familiar to the way readers already search for and select books. Replace “theme” with “subject heading” and interpret zooming in as selecting a more specific BISAC code within a subject heading, and topic modeling and the selection of a BISAC are mirrored processes—the only difference is that, at present, one is completed by humans and the other by computers.

Applying the process of topic modeling to the selection of BISAC codes then, we would begin by developing a standard for the linguistic, semantic and syntactic patterns associated with specific BISAC codes. This can be done in multiple ways, but the most obvious way would be through an examination of the patterns present in a massive corpus of books already identified as having a specific BISAC code.

With these patterns identified and thus the topic modeling rules or algorithms set, a publisher could run the text of a book through topic modeling software. The computer would process the text of a book, measuring and recording the patterns it observes, and then, using the frequency and proportions of these patterns, identify the book’s subject matter, and in turn, the most appropriate BISAC code.

By placing this task in the hands of computers, not only would the process become extremely expedient, but the consistency and impartiality with which BISAC codes are selected would also be drastically increased. Without human interference, BISAC codes would be applied solely based on the content of a book, and the biases, interpretations, and marketing ploys of publishers would be removed from the process entirely.

As an added bonus, with the right tools, the topic modeling software could be linked directly to the ONIX metadata for each book, feeding its selection directly into the database. Each year when the list of BISAC codes is updated, the software could automatically re-process the text and update the BISAC codes when necessary or appropriate. Currently, because of the manual process used to select BISAC codes, even as the list of codes is updated, the BISAC codes assigned to books are never updated. Making BISAC code selection an automatic computer task would keep ONIX genre metadata up-to-date and consistent, and would prevent books assigned a now outdated or discontinued BISAC from falling off the radar or being excluded from search results or retailer sorting algorithms (e.g. Amazon’s recommendations or subject/genre categorizations) that depend on or factor in BISAC codes. Topic modeling and its application to BISAC code selection is an obvious fix to a system that so clearly is not function.

With publishers everywhere clamoring about the volume of books flooding the market and the accompanying issue of discoverability metadata—which includes BISAC codes—and the importance of its accuracy has risen to the forefront of the conversation [11]. Book industry veteran and Product Manager, Identifiers at Bowker, explicitly states, “The publisher (and retailer) with the best, most complete metadata offers the greatest chance for consumers to buy books. The publisher with poor metadata risks poor sales—because no one can find those books” [12].

And yet, even with this rise in concern and understanding of the importance of metadata, broken systems such as the human-based selection of BISAC codes still persist within the industry. Given the above discussion of the importance and omnipresence of genre in the purchasing decisions of buyers, and the known existence of topic modeling software—which offers clear benefits and advancement for publishers toward accurate metadata for publishers—the question is raised: when will publishers stop talking about their problems, and actually solving them?

References
[1] https://www.bisg.org/tutorial-and-faq#General
[2] https://www.bisg.org/publications/best-practices-product-metadata
[3] Ibid.
[4] Ibid.
[5] Ibid.
[6] http://en.wikipedia.org/wiki/Natural_language_processing
[7] https://www.cs.princeton.edu/~blei/papers/Blei2012.pdf
[8] Ibid.
[9] Ibid.
[10] Ibid.
[11] http://toc.oreilly.com/2010/06/sifting-through-all-these-book.html
[12] http://book.pressbooks.com/chapter/metadata-laura-dawson

One Reply to “Topic modeling for BISAC code selection”

  1. As you say, this seems like an easy answer to an obviously flawed process. I’d be surprised if there hadn’t been some research to test the viability of this approach. I’d love to see if some research had been done towards classifying books using BISAC. Google and Amazon must obviously be doing this internally, not necessarily for BISAC, but for subject categorization more broadly.

    Your essay does a great job of motivating the problem, and offers a good description of what the potential of NLP approaches would be.

Comments are closed.