Wither the PDF

In an age of unprecedented document sharing, the Portable Document Format, better known as the PDF, is ubiquitous. Though there are no official statistics, a Google search for PDF formatted documents in 2007 returned 236 million documents compared to 37 million Microsoft Word documents (1), and the number can only have grown since then. The reasons for its popularity seem clear. It is described by creator Adobe as being “used to present and exchange documents reliably, independent of software, hardware, or operating system”. It offers document control, visual consistency across screen and print and ease of sharing. However, it is also limiting and fails to maximise on the potential of digital documents, particularly in the realm of scholarly publishing. When considering the nature of digital reading and the evolving nature of scholarly work, it is clear that the PDF falls short of enabling users to get the most out of the digital realm. Instead, researchers should look to other formats like HTML to create dynamic, interactive documents that allow both authors and readers to interact with a text in any way they see fit.

The PDF was originally conceived by Adobe co-founder John Warnock in 1991 when he launched the Camelot Project. His vision was to solve the issue of universal communication and formatting of printed information. At the time, fax machines were the most advanced method of sharing documents quickly, but were limited by poor quality, high communication bandwidth and being device specific. More generally, digital formatting and layout were considered complicated (2). The solution Warnock devised was built on PostScript, a device-independent page description language that had been widely adopted as a standard for application outputs. PostScript was limited, though, by requiring a powerful computer to process it and a PostScript capable printer. To circumvent these, Warnock proposed a new language, a subset to PostScript, that did not require a complete PostScript parser to process. Then, with a new version of the PostScript interpreter, any PostScript file could be converted to the new format, that included a structure storage system, creating a self-contained file that could be sent anywhere and viewed or printed exactly as intended (2) (3). Warnock predicted that this new format would work even on small machines, enabling widespread adoption, as well as allowing distribution of documents via email, text searching capabilities and improved document archiving. This concept was revolutionary in a time when the web was in its infancy, exchanging ideas was limited to emails, bulletin boards and chatrooms, and documents were restricted by incompatible platforms and software versions (4).

From this vision, the PDF was created. PDF 1.0 was released in 1993, and early feature additions included password security, internal and external linking, interactive elements like checkboxes and digital signatures and improved colour and web-capture features (5). In the decades since, these features have been developed and strengthened, with other options like redaction and some minor editing tools added. It is undeniably one of, if not the most commonly used format in document management today, fulfilling Warnock’s dream of a device- and OS-neutral format that preserves a document as originally intended. His predictions around emailing documents and archiving have also proved accurate. Its importance was perhaps most clearly recognised in 2008 when the format became an ISO standard, ensuring its survival as an easily available format for stakeholders that include governments and multinational corporations (6). It also affirmed that future PDF viewing software will be backwards compatible, meaning they can open earlier PDF versions, again ensuring the long term usefulness of the format for archiving (7).

However, the very nature of the format that made it so revolutionary when it was first conceived is what makes it unsuited to the digital age we now find ourselves in. Where the PDF was designed to improve viewing and printing, technology has evolved to a point where a document’s uses and its readers’ needs go far beyond that. Historically, readers have not been able to contribute to published information, but production is no longer done in isolation, and publishing is no longer a one-way process (8). Editing, reviewing, commenting, annotating, sharing and collaborating are all possible in ways that could not have been accounted for in the original development of the PDF, and it has not effectively adapted to new demands. What staticity and near universal compatibility do offer is a powerful tool for print production, and with the related PDF/A format a reliable option for archiving digital-born documents (9), but for nearly any mode of communication beyond that the PDF is restrictive and stifling.

Before considering new demands, though, it is important to note that one half of the PDF’s most basic raisons d’etre – improved viewing – has suffered over time. Screen technology has evolved as quickly as other digital technologies, particularly with the rise of mobile, tablets and touchscreens. With their fixed format and proportions, PDFs are not mobile responsive, making them extremely difficult to read on handheld devices. As early as 2001, Jakob Nielsen, a user experience expert, was lamenting that PDFs only served to replicate the look of a printed page, which didn’t work for display in a browser window, and that navigation was difficult, resulting in a poor user experience (10). Since then, digital reading has moved even further from the printed page and is fast becoming the predominant mode of reading. This requires specifically designed tools and reading environments that enable new forms of publishing information that go beyond replicating a traditional paper page on-screen (11). In addition to this, an often overlooked form of reading is also affected by the limited visual focus of PDFs – they have widely been found to be inaccessible for readers with disabilities who rely on screen readers, requiring extra software, and even with that offering inconsistent results (1).

Along with the visual presentation of text, digital reading involves a level of interaction and co-creation of documents that has not previously been possible with print. Looking at the tools users might require, PDF does offer some, in limited forms. Commenting and highlighting are possible, but they exist in isolation and are not seen by anyone unless the document is sent directly to them, and this process can’t be carried out by multiple authors concurrently. Moving from reading to production, editing and workflow functions are limited to paid versions of the Adobe Acrobat software suite, and still fail to match the collaborative nature of digital work in the same way that digital first platforms like Google Docs do. This is mitigated somewhat by the fact that PDF is an open format, and several tools have been built to try to allow for more varied and productive uses. Simple tools like A.nnotate allow in-browser PDF annotation for individuals and groups, but others like Utopia Docs make a direct connection between static PDFs and the online world. It acts as an alternative PDF viewer, reading the document for what is available within it, then seeking out additional information from publishers or the community to build a more complete, connected picture (12). This kind of tool is significant considering the reality of how much research is currently stored in PDF format, and that any challenge to that dominance is unlikely to see much existing research converted. In this sense, Utopia Docs fits with Willinsky, Garnett and Wong’s vision of researchers and scholarly publishers learning to use PDFs more effectively, rather than seeking a new standard format (13). However while their discussion of how PDFs could be better executed is comprehensive and informative, even they concede that all of their recommendations could equally apply to a PDF successor, and they do not solve all the issues PDFs present.

To continue with considering forms of digital reading, there remains one that the PDF again fails to account for – machine reading. Digital documents are increasingly being read by programs, not people, meaning they need to be structured appropriately for that purpose (14). While PDFs do contain metadata to enable some machine reading, their design deliberately retains as little information as possible to reduce the file size, prioritising the information necessary for accurate display. This makes it difficult to extract and can lead to inconsistencies (13). Further, while viewing a PDF is simple, uses beyond that can be easily limited through digital rights management (DRM). As research methods have developed alongside new technologies, scholars are looking to use text and data mining to form new insights, but many publishers are employing DRM to limit these activities. While researchers expect to be able to interact with texts and data beyond simple (human) reading and they have the capacity to do so (15), many copyright holders claim that this breaches their rights. It took a supreme court ruling in the UK to confirm that these activities do not violate copyright (16), but this applies only to their jurisdiction and researchers internationally are still in uncertain territory. While the challenges to text and data mining go beyond the format that the information is stored in, PDFs simple control of a document’s uses that can affect even the smallest forays into text analysis. Further, if PDF is being used for archiving, any DRM measures can have an impact on future accessibility of research, although PDF/A forbids encryption (17).

Another limitation is that once downloaded, a PDF disappears from view in the network, and so there is no way to track or measure how it is used. In scholarly publishing specifically, this drastically reduces the possible measures of how research circulates, in particular discounting them from any altmetrics that measure social engagement. In some cases this anonymity has been used to scholars’ advantage in times of need, notably with the recent popularity of SciHub, which uses legitimate credentials to download a PDF of a paywalled journal article, delivers it to whoever requested it free of charge, and stores a copy for future searches. In response to this, publishing consultant Joe Esposito referred to PDFs as “a weapons-grade tool for piracy”, because of the ease of sharing and the few identifying features of any given PDF (18). However, beyond this, Esposito has also pointed out that a PDF journal article dropping out of the networked environment when it is downloaded reduces the measure of its reach to download numbers and citation counts, both of which fail to properly capture the complexity of an article’s impact (19). Without the links and traces that most communications leave online, there is less available data about the use, impact, and visibility of research within the academic community, and beyond it (20). While these data are largely used to complement or predict traditional measures of impact by updating them for the digital age (21), altmetrics also have the potential to foster a system that values public use of research and offers a more level playing field for researchers in developing countries (22). By distributing and circulating research in a closed-off format, the research community is losing valuable data.

After considering the shortcomings of the PDF format, it is important to then consider what alternatives are open to scholars and publishers to continue their work and make the most of the opportunities offered to them by digital technologies. Some suggest reverting to PostScript, the original language that PDF is based on, whose limitations around computing power and printer-compatibility are now all but obsolete. Others recommend plain text formats in some contexts, for storage and data management. Pettifer, McDermott, Marsh, Thorne, Villeger and Attwood (12) offer a detailed comparison of various formats that may serve different needs:

PDF Comparison

However, leaving aside archiving and printing, it seems that the most universal and versatile format is HTML. HTML was invented around the same time as the PDF and its strength lies in its simplicity and flexibility (2). The language was actually first invented for scholarly communication, and offers comprehensive metadata, unlimited linking, simple reference management (23). It also allows for better searching, more simple and cheap tools and plugins, more creativity in formatting, mobile capability, and most importantly, it allows authors and readers to maximise what it means to create and experience a document online. Where a PDF is a terminal format, HTML creates living documents that can evolve and change over time. It also enables all the functions that a PDF limits – searching, text and data mining and reader interaction in the form of annotation, commenting and review. It is also held as the highest standard of accessibility for disabled readers (1). In essence, it enables digital reading in a way that, despite improvements over time, the PDF format is unlikely to ever achieve. It also leaves open the possibility for new and more creative engagement in the future.

Ultimately, the dominance of the PDF looks unlikely to be shaken overnight. It remains a default setting for document management, and does serve as an excellent tool for printing. However, that strength alone speaks to how inadequate the format is in an age where print is less and less relevant as a method of consuming information. It is a relic; a terminal format that once created assumes that what is written will not change, and will be read in isolation. For scholarly publishing, this not only affects the creation of and contributions to research, but limits how that research can be used, and how its influence can be measured. Looking to the future, researchers and publishers should consider all the possibilities of how scholarly work could be used, and that how their readers might want to engage with it. While change may happen slowly, this consideration of the appropriateness of a format, rather than defaulting to the norm, should be the first step in broadening the horizons.

 

References:

1 – Turró, M. R. (2008). Are PDF Documents Accessible?. Information Technology & Libraries, 27(3), 25-43.

2 – King, J. (2004). A Format Design Case Study: PDF. Hypertext ‘04: Proceedings of the fifteenth ACM conference on Hypertext and hypermedia, 95-97

3 – Fanning, B. (2007). PDF Standards.

4 – Quora.com, How was the PDF format created?

5 – Thomas, K. (1999). Portable Document Format: An Introduction for Programmers.

6 – ISO.org, PDF Format Becomes ISO Standard.

7 – PDFA.org, PDF/A FAQ.

8 – Jones, T. (2012). Why Digital Books Will Become Writable. Book: A Futurist’s Manifesto.

9 – Han, Y. (2015). Beyond TIFF and JPEG2000: PDF/A as an OAIS submission information package container. Library Hi Tech, 33(3), 409 – 423

10 – Nielsen, J. (2001). Avoid PDF For On-Screen Reading.

11 – Pearson, J., Buchanan, G. & Thimbleby, H. (2013). Designing for Digital Reading.

12 – Pettifer, S., McDermott, P., Marsh, J., Thorne, D., Villeger A. & Attwood, T .K. (2011). Ceci n’est pas un Hamburger: Modelling and Representing the Scholarly Article. Learned Publishing, 24, 207–220.

13 – Willinsky, J. Garnett, A. & Wong, A. P. (2012). Refurbishing the Camelot of Scholarship: How to Improve the Digital Contribution of the PDF Research Article. Journal of Electronic Publishing, 15(1)

14 – McCoy, B. (2014). The Inhuman Future of Digital Reading. Journal of Electronic Publishing, 17(1)

15 – Carpenter, T. (2016). Text and Data Mining Are Growing and Publishers Need to Support Their Use – An AAP-PSP Panel Report.

16 – CopyrightUser.org, Text & Data Mining.

17 – DigitalPreservation.gov, PDF/A-1, PDF for Long-term Preservation, Use of PDF 1.4.

18 – Esposito, J. (2016). Sci-Hub and the Four Horsemen of the Internet.

19 – Esposito, J. (2008). Downloads as Failure.

20 – Holmberg, J. H. (2015). Altmetrics for Information Professionals.

21 – Sud, P. & Thelwall, M. (2014). Evaluating Altmetrics.Scientometrics, 98(2), 1131-1143

22 – Alperin, J. P. (2013), Ask not what altmetrics can do for you, but what altmetrics can do for developing countries. Bul. Am. Soc. Info. Sci. Tech., 39: 18–21

23 – Fenner, M. (2011). A very brief history of Scholarly HTML.

One Reply to “Wither the PDF”

  1. This is a good overview of the different ways in which PDFs are limited, and which of its characteristics have outlived their time. However, I thought one important aspect was missing: portability. The pre-Web culture of keeping copies of things locally on a machine—prior to the cloud—have been important in contributed to the PDFs success. This could have brought into focus the cultural practices layer that was somewhat neglected here. Similarly, it might have highlighted some of the challenges of having HTML replace it over time. That said, the essay captures the limitations of the format well, and discusses what its replacement might lead to.

Leave a Reply