23 May 2014

A Practical Guide to Convert a PDF File to an ePub Version 3 Fixed Layout File

I just published a very small ebook which can be bought on Google Play Books store and previewed on Google Books.

 Google Play Books

This is the beginning of the book (the rest is mainly technical stuffs to make the conversion from pdf to html, then from html to epub):

Fixed Layout

Different file formats exist for fixed layout ebooks. Bellow a list of the main ones:

- PDF (Portable Document Format) [.pdf]
- DjVu (Déja Vu) [.djvu]
- ePub (electronic Publication) [.epub]
- Apple iBooks (similar to ePub) [.ibooks]
- Amazon Kindle (similar to ePub) [.kf8]

In this book, we will focus mainly on the conversion of a PDF file to a fixed layout ePub file. This is possible since the version 3 of the ePub format which includes now the fixed layout mode in addition to the traditional flowing text mode.

This type of conversion can be very useful as the page layout programs (e.g. Scribus) are always exporting the final result as a PDF (optimized for paper or online publication).

The "ePub 3.0 Fixed Layout (FXL) Format Specifications" published by the International Digital Publishing Forum (IDPF) can be found here:

http://www.idpf.org/epub/fxl

A "Field Guide to Fixed Layout for E-Books" published by the Book Industry Study Group (BISG) is available for free here:

http://www.bisg.org/publications/field-guide-fixed-layout-e-books

The ePub version 3 format uses all the modern Web technologies like HTML5, CSS3, JS, SVG, XML, XHTML, WOFF, etc.

Important remarks:

1) This book is only about fixed layout ePub. Fixed layout can be used if the book has a sophisticated layout with lots of images. Such fixed layout books are made with desktop publishing (DTP) programs like Scribus, Adobe InDesign, Quark XPress, or Microsoft Publisher. For books with only text or with few images, a flowing text ePub is more suitable and more easy to do.

2) Most of the PDF to ePub converters do not work for sophisticated layout because they convert a fixed layout PDF into a flowing text ePub, which gives most of the time an ugly and unusable result unless the file is heavily adapted. They just extract the text and the images from the PDF, and put then sequentially into a flowing text ePub with all the layout gone.

3) Most of the ePub viewers do not support (yet) the fixed layout. If you try to display a fixed layout ePub with such viewer, the result will be ugly and unusable. Two good ePub viewers supporting the fixed layout are Google Play Books (for tablets running under Google Android or Apple iOS (iPad)) and Readium (for laptops or desktops running under Microsoft Windows, Apple OS X (Mac), or GNU Linux; it is a Google Chrome browser extension). Most of the time, small screens are not suitable for fixed layout books. Such books should be read on tablets, not on smartphones.

Conversion Methods

There are three main methods to convert a PDF file to an ePub fixed layout file:

1) Method 1: Bitmap image only + Hidden text

Each ePub page is a bitmap image (PNG8, possibly PNG24 or JPEG) of an exact replica of the PDF page. This bitmap image is the result of the rendering of the text (using vector fonts), bitmap images, and vector images. To maintain accessibility (select text, copy/paste text, search text, text to speech, etc.), an invisible text layer is added on top of the image. This is also the way used to convert a PDF file to a DjVu file. Some PDF files are also made like that, mainly when they are the results of scanning paper books (the text layer is made by OCR).

2) Method 2: Image + Text

Probably the best method, but more sophisticated than the first one, is to add on each ePub page a bitmap image (JPEG, possibly PNG) which is made of all bitmap and vector images of the PDF page, or a bitmap and vector image (SVG). The text is not converted in a bitmap image or inserted in the SVG file, but added on the ePub page by using XHTML5 and CSS3. The CSS uses: a) absolute positioning to put the text at the exact same place than in the PDF page; b) styles and fonts for the text to look exactly the same as in the PDF page. These two last steps are challenging, because HTML5 cannot always do what the PDF format can; lots of free and commercial tools exist, but most of the time cannot do that correctly when it comes to fixed layout.

3) Method 3: SVG only

The bitmap images, the vector images, and the text are embedded in SVG files (one SVG per page). The text should be rendered as true text (with fonts), not just outlines of the glyphs (vector images). Also called: SVG in the spine (no XHTML).

In the following of this book, I will only focus on the second method (image + text).

Conversion Tools

There are free open source and commercial tools to convert PDF to ePub3-fxl, but some have drawbacks.

The tool and the method I will describe below is free, and give a very good result for the visual aspect and for the text accessibility. The tool I will use is pdf2htmlEX, developed by Lu Wang (speudo: coolwanglu), a Chinese PhD student at the Department of Computer Science and Engineering of the Hong Kong University of Science and Technology. You can find it here:

http://coolwanglu.github.io/pdf2htmlEX

This tool, as its name tells us, does a conversion of the PDF pages to HTML pages, and does not produce an ePub file. To get an ePub3-fxl file, I will show how to use the result produced by pdf2htmlEX, to create the ePub3-fxl file. It means mainly: a) remove the HTML viewer that pdf2htmlEX produces and integrates in the result; b) create all the files required by the ePub format and wrap the result into one unique file.