I have been working and relying on Information Technologies since the 90’s. You could say that the moment I discovered the web, I left the paper world behind and never looked back.
It was a normal transition, very much like learning to use the microwave. At first we thought of it more as a shiny toy than as a change in lifestyle. Before we knew it, we couldn’t get along without it.
As someone who created and read from the web, HTML became the norm and flowable text a must. I never realized this transition had occurred until I decided to attend university for a year. I discovered a totally different aspect of publishing; one that relies totally on fixed-layout.
The culture of print was based on fixed-layout; since publishing was still very much linked to print, it was also being associated to fixed-layout. I must say this came as a bit of a shock. It was akin to registering for cooking classes and to later discover that, for the sake of an established tradition, we would be using wood stoves. I have nothing against wood stoves and one could argue that this method of cooking will produce a unique and beloved product. I only question the practicality of using such a technology nowadays.
So what do we expect from publishing technologies? What do you do if all your equipment and training are based on reproducing a manual technology? It took centuries to develop the familiar layout we now use for books. House rules for typography are clear on layout. A title precedes the section it belongs too. We have been accustomed to a very linear way of thinking when it comes to publications.
Web pages in eBooks are being designed to reproduce the familiar print model. But what is happening behind the scene? Webpages are modules in an assembly; they are not really pages. Webpages are not read by computer software; they are parsed. There is a difference between reading and parsing; the computer does not try to make sense of the information, it must be available. According to Wikipedia, parsing is a traditional grammatical exercise, sometimes known as clause analysis, which involves breaking down a text into its component parts of speech with an explanation of the form, function, and syntactic relationship of each part.
Computers do not understand that a title relates to the text that follows it unless the structure specifies this relationship. In languages such as XML and HTML, the structure is specified using tags. When we read a text, the activity of parsing and of recognizing relationships is done by the human brain; it is such a natural activity that we don’t even think about the complexity of the process.
In large publishing environments, storing content as XML structures is the norm; much like storing data into tables is the norm for accounting. To explain how an XML structure functions, you need to think of XML as a series of containers of different sizes and used for different purposes; you can picture them as cardboard boxes containing smaller cardboard boxes and text. Text is always found in a box or it can’t be moved. To continue along this line of thought, a section is a container; a paragraph is a container, etc. Any text organized as a block on a page is a container—a block being defined as text separated from other text using borders or spaces, the way paragraphs are separated from each other.
The basic relationship is simple, any text or containers inside a container belong to that container. So if you imagine a chapter as being a cardboard box, you could fill it with a small title box, paragraph boxes and possibly section boxes; sections boxes would also contain a title box and many paragraph boxes. Any text displayed as a block is contained. Computers do not parse a publication according to its visual layout, the content is parsed in terms of containers delimited by tags.
What happens when you transform an HTML file to a PDF format? It is as if you had emptied your boxes, lined up the content and taken a picture of all the content spread out on the floor. There is no longer any relationship between the different sections of content; there is only placement.
Now, if you were asked to transform this PDF back into HTML, you could only organized your content according to the picture taken. Your boxes would now be labeled as section J-12 or B-07 according to the map made from the picture. The same way pieces of an advertising image used to be glued to the 10ft by 20ft panels of a poster site. The relationship between the boxes would only be only of placement, not of relationships. I know that the box containing the title comes before the box containing the first paragraph; there is nothing in the labeling that would tell me that the title belongs with that paragraph rather than to the paragraph before it.
The most striking example is the transformation of an HTML table to a PDF table. In HTML, a table is composed of cells arranged in rows. The table contains rows and the rows contain cells. When you transform an HTML table to a PDF format, the visual placement is the same so you may think that this relationship is still present. In fact, it is your brain that is organizing the information as cells within rows within a table. The moment you try to convert this PDF table back to HTML, you see that the coded content is organized and styled to look like a table, but you are back to the J-12 or B-07 blocks relating to the placement map.
HTML code is not based on placement; markup language code is meant to be parsed. Parsing extracts form, function and syntactic relationship between the tagged parts. When XML and HTML’s natural structures are hijacked to represent placement (tagging display rather than structure), you lose the relationships.
Some fixed-layout authoring tools such as Microsoft Word are used to create HTML and do so with limited success. The HTML produced with fixed-layout tools usually result as a hybrid between placement and structure. PDF files are output files meant to be the representation of the print file. One does not author directly in a PDF format any more than one authors in print format—unless you are using a mechanical typewriter. You could scan your paper-printed pages and create a PDF, but you don’t scan a paper-printed page and expect decent HTML.
This discussion takes us back to the question of defining a portable document format. Are we trying to package a portable print format or a web format? Do we want to display the packaged content on a mobile device or project it on a presentation screen? The criteria for a readable user-friendly document to be read on a mobile device are not the same criteria expected for print. On a mobile device, text may be made bigger or smaller and will still flow. Zooming on text is not the same as expanding text fonts. Zooming requires a effort from the reader; this “pinching” is only required to zoom in on fixed-layout text formats such as PDFs.
ePUBs are packaged HTML. The ePUB standard, like the HTML standard, is non-proprietary. The ePUB packaging standard is designed for mobile devices. ePUBs cannot only be printed, but they can also be styled both for screen display and for print display. A PDF can only styled for print display.
I leave it do the reader to decide whether they will favor the print display of PDF for mobile devices or the responsive screen display of ePUBs.