July 2, 2003


Blizg is another one of these blog-affinity finders. It looks for “ICBM” location data (as popularized by GeoURL) and Keywords in your headers, and finds proximate and topically-related blogs for you. Nice. Although it would be more powerful if it could extract keywords from the stuff you actually post about, rather than the stuff you mention in your meta-data.

Adventures in biomechanical translation

Machine translation (MT) is the bugbear of the professional translator. Machine-assisted translation (MAT) is a more devious, and perhaps more pernicious bugbear. Machine translation takes the translator out of the process entirely; machine-assisted translation makes use of the translator’s expertise to create patterns of source/target sentence pairs, and attempts to extrapolate these patterns through the source text. Translation agencies then use the “match rate” as a way to chisel the translator on payments.

Most of the work that I do is not very amenable to MAT (if I used it at all)–my guesstimate is that most of my jobs would have less than a 10% match rate overall. But the job I’m doing right now would be highly amenable to MAT: it’s programming document where a given sentence may be repeated 50 times, with minor variations in predictable spots.

The job was sent to me as a series of MS Word files, which I manually concatenated into one. Word search/replace tools are relatively limited, but BBEdit has a powerful implementation of GREP. So, after much gnashing of teeth, I managed to export a usable HTML file from Word, and cleaned it up. This in itself could be the subject of an even-more-tiresomely long post, which I will spare everyone from reading, and myself from reliving.

Once I got the file whipped into a shape I could stand looking at, I started working out GREP patterns. Some of these were highly productive–one pass would translate 40 or so sentences. Some would only do the one I was looking at. So I’ve been manually reproducing the MAT process, and getting pretty good at GREP syntax to boot. But as I work on it, there’s always a nagging feeling that if I understood that syntax better, I could produce more generalized patterns that would capture more sentences. The ultimate, of course, would be the hideously convoluted pattern that would be required to translate the entire document in one pass–which starts getting into Chomsky territory.

Postscript: I finished that job. What started out as 28 Word files weighing in at a total of 1.2 MB wound up–when I finished concatenating, exporting to HTML, cleaning up, translating, and compressing with Gzip–as a 17.1 KB file. Amazing.