Converting a manual from LaTeX to AsciiDoc

Lately I’ve been working on improving the build process for PuffinPlot. The goal is to have it fully buildable from scratch with a simple mvn package, with no dependencies other than those which can be fetched automatically by Maven. One of the obstacles to this was the manual, written in LaTeX, which means that anyone who wants to build the full package must manually install LaTeX and various related tools and extensions. There’s no automated way to produce decent HTML and PDF output from LaTeX source without stepping outside the JVM (things might have been different had NTS taken off). I therefore needed to convert the source to a form which could be processed by a Maven plugin. I decided on AsciiDoc, probably my favourite lightweight markup language. It’s well-defined, full-featured, and widely supported. The most popular AsciiDoc processor at present is Asciidoctor, written in Ruby. Conveniently for Java users, there is an official, jruby-based JVM port of Asciidoctor – AsciidoctorJ, complete with associated Maven plugins. (jruby turned out to be a useful addition to my Maven build environment in any case: I wrote a couple of jruby scripts which use jgit to process commit information during the build, for example in order to set the version number automatically in the ‘About’ dialog.)

The LaTeX source of the PuffinPlot manual is relatively clean, as LaTeX source goes – which is to say, there are only a few dozen lines of package includes, configuration, and macro definitions in the preamble. No automated process will produce a 100% satisfactory conversion to AsciiDoc, but a ~90% accurate conversion, followed by some manual clean-up, is feasible. I used the ever-capable pandoc for the initial conversion; as far as I could tell, it’s the only widely available tool that can convert directly from LaTeX to AsciiDoc. (Incidentally – and curiously, given its very wide file format support – pandoc doesn’t support AsciiDoc as an input format.) The only pandoc-incompatible command in the file was \input changes.tex, which I fixed by changing it to \input{changes.tex}. The raw pandoc output was more usable than I had expected: tabular environments, sectioning, and lists were all converted successfully. The most significant omissions were section cross-references and the bibliography.

Tidying up the automatically converted source consisted mainly of:

Replacing LaTeX quotation marks ` and ' with '` and `' respectively. (AsciiDoc can of course handle the Unicode characters ‘ and ’ natively, but the equivalent ASCII digraphs seem to be preferred.)
Correcting anchor and cross-reference syntax, which pandoc mangled a little in conversion. (Emacs macros were invaluable for this.)
In several description lists which used custom macros in the \item[…] parameter, the term disappeared in the AsciiDoc output. If I’d noticed this in time, I think it could have been corrected by modification of the LaTeX source. As it was, I only noticed once I had significantly modified the AsciiDoc, so I used Emacs macros to do some semi-automated transformation and copy-pasting of the missing content from the LaTeX file.
LaTeX math: while AsciiDoc can handle LaTeX math, it wasn’t necessary in this case: the manual didn’t contain any formulae, and only used math mode for the occasional special character. I search-and-replaced these instances and substituted them with suitable UTF-8 characters.
Tweaking table layouts.
Citations and the bibliography.
Various minor aesthetic reformatting of the source.

Converting the bibliography turned out to be a significant subproject, documented here. Asciidoctor has limited built-in bibliography support, but for proper handling of author-date citations and automated formatting of bibliography entries a plugin is required; I used asciidoctor-bibtex, which appears to be the most popular and up-to-date, but there are also asciidoctor-bibliography and asciidoc-bib.

There was one feature I had to forgo entirely in the transition: floating tables and figures. These are available as standard in LaTeX, but I could find no equivalent in Asciidoctor, either built in or with a plugin. (The search was complicated by the fact that AsciiDoc uses the term "float" to refer to the placement of images at the left or right edge of a text block, with the text flowing around them.) I don’t think that the PuffinPlot PDF manual suffered much from the removal of floats (and the HTML manual doesn’t suffer at all), but for many documents this limitation might be a deal-breaker when considering the source format.

The manual includes two SVG images, and here AsciiDoc allows for a cleaner build process than LaTeX. LaTeX can’t include SVG files directly, so the old build script called Inkscape to convert them to PDF (for PDF ouput) and PNG (for HTML output), introducing another heavyweight external dependency. Asciidoctor, on the other hand, can handle both output formats more elegantly: for HTML output, the SVG images are embedded directly; for PDF output the SVG can be rendered into PDF using prawn-svg. Handling font embedding for SVGs rendered to PDF is not entirely straightforward, but preferable to my LaTeX set-up (which required the font to be installed and available to Inkscape, thus adding yet another hard-to-automate dependency to the build). For now, the default fonts are good enough.

The PDF output quality is entirely adequate, but – unsurprisingly – can’t match the quality of LaTeX typography. The HTML output quality is definitely an improvement: TeX4ht, for all its ingenuity, still generates output with a distinctly 1990s feel to it. Asciidoctor, even without custom styling, produces something more sleek and modern; it also has the useful ability to inline images, resulting in a single self-contained HTML file.

The full configuration for the Asciidoctor plugin is (as ever with Maven XML) rather verbose, but not particularly complex:

<plugin>
  <groupId>org.asciidoctor</groupId>
  <artifactId>asciidoctor-maven-plugin</artifactId>
  <version>${asciidoctor.maven.plugin.version}</version>
  <dependencies>
    <dependency>
      <groupId>org.asciidoctor</groupId>
      <artifactId>asciidoctorj-pdf</artifactId>
      <version>${asciidoctorj.pdf.version}</version>
    </dependency>
    <!-- Comment this section to use the default jruby artifact provided by the plugin -->
    <dependency>
      <groupId>org.jruby</groupId>
      <artifactId>jruby-complete</artifactId>
      <version>${jruby.version}</version>
    </dependency>
    <!-- Comment this section to use the default AsciidoctorJ artifact provided by the plugin -->
    <dependency>
      <groupId>org.asciidoctor</groupId>
      <artifactId>asciidoctorj</artifactId>
      <version>${asciidoctorj.version}</version>
    </dependency>
  </dependencies>
  <configuration>
    <sourceDirectory>src/main/asciidoc</sourceDirectory>
    <!-- Attributes common to all output formats -->
    <attributes>
      <sourcedir>${project.build.sourceDirectory}</sourcedir>
    </attributes>
    <gemPath>${project.build.directory}/rubygems</gemPath>
    <requires>asciidoctor-bibtex</requires>
    <sourceDocumentName>manual.adoc</sourceDocumentName>
    <resources>
      <!-- We don't want to copy any resources. Omitting the resources
           section entirely copies all the resources in the asciidoc
           source directory. Including a resources section with an
           empty resources subsection omits the asciidoc resources but
           copies the contents of src/main/java (!). So we need this
           resources section which explicitly excludes src/main/java.
      -->
      <resource>
        <directory>${project.basedir}/src/main/java</directory>
        <excludes>
          <exclude>**/*</exclude>
        </excludes>
      </resource>
    </resources>
  </configuration>
  <executions>
    <execution>
      <id>generate-asciidoc-manual-pdf</id>
      <phase>prepare-package</phase>
      <goals>
        <goal>process-asciidoc</goal>
      </goals>
      <configuration>
        <backend>pdf</backend>
        <outputDirectory>${project.build.directory}/manual-pdf</outputDirectory>
        <sourceHighlighter>coderay</sourceHighlighter>
        <!-- Use `book` docType to enable title page generation -->
        <doctype>book</doctype>
        <attributes>
          <pdf-stylesdir>${project.basedir}/src/main/asciidoc/theme</pdf-stylesdir>
          <pdf-style>custom</pdf-style>
          <icons>font</icons>
          <pagenums/>
          <toc/>
          <idprefix/>
          <idseparator>-</idseparator>
        </attributes>
      </configuration>
    </execution>
    <execution>
      <id>generate-asciidoc-manual-html</id>
      <phase>prepare-package</phase>
      <goals>
        <goal>process-asciidoc</goal>
      </goals>
      <configuration>
        <backend>html5</backend>
        <outputDirectory>${project.build.directory}/manual-html</outputDirectory>
        <sourceHighlighter>coderay</sourceHighlighter>
        <embedAssets>true</embedAssets>
        <attributes>
          <imagesdir>images</imagesdir>
          <toc>left</toc>
          <icons>font</icons>
          <sectanchors>true</sectanchors>
          <!-- set the idprefix to blank -->
          <idprefix/>
          <idseparator>-</idseparator>
          <docinfo1>true</docinfo1>
        </attributes>
      </configuration>
    </execution>
  </executions>
</plugin>

The full POM is here, and the AsciiDoc manual source is here.

Overall, as with so many things, the conversion took far more time and effort than expected, but it was certainly a worthwhile investment: in the long run, the improvements in build automation and reproducibility will pay ample dividends.

links

social