Friday, August 12, 2011

Mobi OPF from Epub

‹prev | My Chain | next›

Yesterday, I began my efforts to re-use epub files when generating mobi in git-scribe. The hope is that this will produce better results and DRY up the toolchain a bit.

The do_mobi method currently generates epub first (if the epub was previously generated, do_epub will return immediately) and then decorates the epub for mobi use:
def do_mobi
return true if @done['mobi']

do_epub

info "GENERATING MOBI"

decorate_epub_for_mobi

cmd = "kindlegen -verbose book_for_mobi.epub -o book.mobi"
return false unless ex(cmd)

@done['mobi'] = true
end
Last night, I was able to produce the Table of Contents file, toc.html, that the kindle uses for the Table of Contents (and to determine the start page). This is accomplished via the add_epub_toc method in decorate_epub_for_mobi:
def decorate_epub_for_mobi
add_epub_etype
add_epub_toc
zip_epub_for_mobi
end
Currently, add_epub_toc only generates the toc.html file, but that is not sufficient for mobi—I also need to modify the Open Packaging File to include the table of contents. So I add a call to the new add_html_toc_to_opf:
def add_epub_toc
build_html_toc
add_html_toc_to_opf
end
The OPF is an XML file with three different sections that need to be updated to include toc.html. The three sections are:
  • <manifest>—the actual contents of the epub/mobi file. It associates an ID with the file / url
  • <spine>—describes the order in which the documents are read (and which are optional)
  • <guide>—points to "meta" documents (the table of contents, the cover, etc)
So I need to read the epub OPF and add the toc.html to each of the three sections:
def add_html_toc_to_opf
Dir.chdir('book.epub.d/OEBPS') do
opf = File.read('content.opf')
        opf = add_html_toc_to_opf_manifest(opf)
opf = add_html_toc_to_opf_spine(opf)
opf = add_html_toc_to_opf_guide(opf)
File.open('content.opf', 'w') do |f|
f.puts opf
end
end
end
Each of those three sections follows a similar pattern—replace a known element with the same element plus toc.html. Adding toc.html to the <manifest> section of the OPF works thusly:
def add_html_toc_to_opf_manifest(opf)
opf.sub(/<item id="ncxtoc".+?>/) { |s|
s + "\n" +
%q|    <item id="htmltoc" | +
%q|media-type="application/xhtml+xml" | +
%q|href="toc.html"/>| }
end
And, happily, that works! After regenerating the mobi version of The SPDY Book and replacing the previous version on my Kindle, I have a Table of Contents, a correct start page, and pretty decent formatting. There are still a few details in need of cleaning up, but this approach seems promising.

At some point, I will certainly need to clean this code. The fact that so many methods are named the same thing screams for a module or class to encapsulate functionality. I also completely lack testing of any kind. The problem, of course, is that I do not know my target state. I have to keep guessing, trying it on the Kindle and then correcting my mistakes. That is all well and good for spiking towards understanding, but it will not hold up as a robust implementation going forward.

But all that is for another day.


Day #111

No comments:

Post a Comment