May 31, 2010

Mass-market paperback style epub files

I've got this shiny iPad, and I read a lot, and that means I've been digging through a lot of ebooks. Geek that I am, I've been fiddling with creating them too since the ePub format's not all that bad to work with once you get past the intrinsic horror that XML brings to anything it touches. (Perl modules help rather a lot here) I've found a few things out in the process.

Firstly, the formatting of an awful lot of ebooks is really sloppy. Books change font sizes from paragraph to paragraph, often times paragraph formatting shifts in the middle of a book (from indented paragraphs to extra leading between paragraphs and back), paragraphs are often not even there as the formatter uses breaks instead of paragraphs... the list goes on. We won't even talk about chaptering. Or the fact that a lot of ebooks look like someone took their Word doc, did a "Save as HTML", and threw the result at Calibre. And those are the better books.

Secondly, it turns out the ePub standard leaves out all sorts of important stuff. Like, for example, how to note cover art in an ePub book. Or how to handle tables of contents or indexes. I expect there are a number of practical reasons for that (probably dueling proposals, installed bases, intransigent committee members, and all the other political nonsense that goes with groups of more than one person) and that's fine. I'd rather have half a standard than none at all.

Here's a list of some things I've found while stripping crap out and reformatting things so they don't suck when read in iBooks. They seem to apply for Stanza, too, and hopefully to other ePub readers.

The Rules:

  1. You're not controlling the formatting. Don't even try.
  2. There are exactly four HTML tags you use in your text: <P>, <I>, <B>, and <BLOCKQUOTE>. Maybe (maybe!) you use <H1> for chapter heads.
  3. There's exactly one tag attribute you use. align="center" You only use it on sub-chapter dividers. If you prefer horizontal rules instead, then you don't get to use align="center" and you can add <HR> to your tag list
  4. Use HTML entities for typographic characters (opening and closing quotes, em and en dashes, and ellipses)
  5. Each chapter goes into a separate XHTML file.
  6. No empty paragraphs! Or breaks, but since those aren't on your tag list you're not using them anyway.
  7. The first xhtml file in your ebook holds the table of contents, cover-containing, and general front-of-book. This is the one xhtml file you get to have more tags in, and you can use... <A> to link to the start of each chapter.

#1 drives the rest of the rules. You're building a book such that I can read it easily and straightforwardly, and I don't notice the formatting. There are plenty of books where I should notice the format, where the layout and art is breathtaking, where the book is interactive and linked to the world, where someone's spent days or weeks of their life putting together something that makes me damned impressed.

These rules are not for those books.

They're for mass-market paperbacks, things I pick up at Border's for $7 or $8. Books whose sole reason for being is the text the author wrote and anything that distracts from that is bad, because you're not going to spend the time to do the things to the text and layout that make it better. Think "overworked intern with a two hour deadline" here. Simple! The less formatting you apply the less there is to be screwed up.

#2 probably comes as a surprise. I and B are the tags we've been told to never use any more, since they specify formatting, and you're supposed to use CSS for that these days. Well, you can't do that. Refer to the list -- there's no class, no div, no span, no nothing that'd let you apply formatting to anything. Rule 2 is more for the mechanical parser than the ebook reader software. I can write code to rip out anything that's not an <I>. Mechanically figuring out how something was tagged as italic and preserving that is a pain.

#3 is in grudgingly, because occasionally you want "* * *" or "~ * ~" or something like that to divide sections in a chapter.

While ebook reader software is generally clever enough to know what the quote characters are in unicode and the major 8-bit character sets, don't make it try. If you're going to use typographer's quotes, ellipses (that's three or four dots in a row, depending on the font), or proper em or en dashes, and you should, then encode them as HTML entities. Don't rely on the software to know you've used the Windows-1252 left quote character or the Unicode right double quote. &lsquot; or &rdquot; is better. Hence rule #4.

#5 is for performance. Separate files for each chapter make most ebook software perform better. (iBooks certainly much prefers it) Better performance means better battery life, and that's good. This is especially true at startup if you're in the middle of a book. For iBooks, at least, the time from startup to text display is dependent on how far your current position is from the start of the file the text is in. If you're on page 600 of 800, with all 800 pages in one file, it takes an annoying amount of time. If you're on page 600 of 800 but you're actually four pages into a chapter, it takes much, much less time. That's even better.

Some people really like a lot of space between their paragraphs. Some people don't. Either way is fine, but when you're building a book if you're under the impression there should be extra space after a paragraph the place to do it is in the .css file not with an extra <P></P> or <P/> after each paragraph. Empty paragraphs are meaningless. Don't do that. Hence rule #6.

#7 is the one place where the previous 6 rules are a little looser. If you're going to have a title page and cover art and a table of contents you stick them in this first file. This is where you get to use <A> links to point to each chapter start, where you embed the cover art link, and maybe get a little fiddlier with the fonts. Still very, very simple, but perhaps slightly less simple than the rest of the book

That's it. If you want to get fancy with fonts or leading or first-paragraph stuff or whatever then you do it in the .css file. Though, as a reader, I have to say please don't. Remember, mass market paperback. The point is for the formatting to be unobtrusive so it can get out of the way and I can enjoy the text.

One nice thing about these rules is that, if you want to get fancy, they give you base files that are in good shape for fanciness. You'll have nice clean input files with almost no crud in them, and with that you can do all sorts of Clever Things with CSS. It also means that if you feed these files into another program for WYSIWYG fancying up (and please don't, but...) you're not going to have to worry about that software getting confused by whatever fancy things the previous piece of software did.

Posted by Dan at 01:31 PM | TrackBack