Converting a 500-page PDF user manual with TransPDF
The is Part 4 (the last one) of the series Localizing an iOS drawing app with memoQ, one cliffhanger at a time.
You didn’t quite anticipate you’d need to deal with a pair of huge PDF files in order to localize a tiny iOS app, right? It might be the years I’ve spent in this industry, but I wasn’t particularly surprised. When you decide to trespass in a less familiar subject matter area, as a diligent translator you go online and scavenge the interwebs for useful material. Usually you’ll bump into something you can consider as an authoritative source or de-facto standard.
With the tiny drawing app Inkpad, that source turns out to be Adobe Illustrator, whose user manual is freely available in multiple languages. Here is the English file and the Hungarian translation that I downloaded.
You could of course just keep these two files open, search for expressions in the English text, navigate to the same chapter in the Hungarian text, figure out what the translation might be… But we’re not using a CAT tool to do it that way! If I can manage to somehow bring this content into memoQ, I will have the extremely powerful concordance function at my fingertips to literally mine it for information.
The problem is the PDF format. The acronym has a whole bunch of pretty accurate readings, most of which I cannot repeat in good company, but “pretty darn funny” not one of them. PDFs are a PITA, a one-way format meant to make sure a nicely typeset document shows up with the exact same nice typesetting no matter where you’re viewing it. It is not meant to make accessing the text easy. (Not to mention recreating a translated document.) Luckily, for my purposes, all I need is the plain text of both files.
To convert the PDF into a civilized format I used Iceni’s wonderful TransPDF service. You start by registering for a free trial, and as the registration screen shows, the good folks at Iceni really know their customers well. You can choose the CAT tool you are using from a drop-down list, and you immediately get instructions tailored to your paticular tool.
Once you’ve registered and logged in, it’s dead simple to upload a PDF and have it processed. When you upload the file you are also prompted for a language pair. The source language is relevant because it will show up in the XLIFF, and this is an important piece of information for your CAT tool. The target language, in turn, will make a difference if you want to translate the text and get it converted back into PDF: the service will look at the target language to make sure all characters show up with the correct font and the like. When the conversion is done, you can retrieve the result as an XLIFF file like so:
Because I had both an English and a Hungarian PDF I needed to do this twice. I will treat both of these as “source” files, so in my case, I won’t be coming back to the TransPDF website again. Normally, you would only have a source file to convert, and you would then be translating that file in memoQ. The really cool thing about TransPDF is that it’s able recreate a fully formatted PDF from the translated XLIFF that you upload, and for that part you will be charged a small fee if you go beyond 25 pages. And when I say this is a cool thing, I really mean cool, as in way out there, extraterrestrially cool. Your only other option to deal with non-trivial PDFs is an OCR (optical character recognition) tool, and let’s just say you’re lucky that I’m not taking you there in this post.
Alrightie, I managed to turn two PDF files into two XLIFF files. What is XLIFF even? What good does it do? Check back for the next installment to find out.
Part 5: Mine 200 thousand words of translated content through memoQ’s hands-off aligner and translating concordance