My gripes about translation memory

I recently tweeted that I was experimenting with OmegaT, a translation-memory tool. When asked by one of its proponents how I liked it, I responded

@brandelune do not like omegaT. really only works with plain text. ugly. burdened w/ typical java on mac shortcomings. not customizable.

That barely begins to cover what I don’t like about OmegaT. I’ve been thinking about what I would like in a translation tool for a while now. My desires break down into two categories: the translation-memory engine, and the environment presented to the translator.

The TM engine

Translation memory is based on the concept of the segment. Typically, a segment is one sentence, and the TM tool can pre-segment the document at sentence breaks, but there’s no firm rule that all segments must be sentences.

A sentence is a logical unit for segmentation, but it’s a Procrustean rule that doesn’t always apply comfortably. I suspect that many texts that would benefit from TM would get more benefit from a different level of segmentation.

A friend of mine is working on a TM tool that allows for nested segments, and I believe that’s an important piece of the puzzle, but it’s only one piece.

The problem is that with the current state of the art, these different segments (nested or not) would need to be created manually, which defeats the purpose. The tool is supposed to make you work more efficiently, not work more. I believe that smarter segmentation logic could automatically identify phrases or clauses. Another piece of the puzzle—and one I confess I haven’t quite figured out—is comfortably specifying which segment one is translating at a given moment when working with nested segments.

I believe this could be accomplished, at least in part, by searching for frequency, which would have certain knock-on benefits. If a phrase or clause is repeated in a document (or across multiple documents), it should be marked as a segment. If not, it’s less important to treat it as one. Just knowing that a segment is going to be repeated later in a document could be useful to the translator, and a CAT tool that could show the other contexts in which that segment is used could be very helpful. I don’t believe any CAT tool gives a prospective view of segments like this. This kind of thing is tougher with, say, a Japanese source text, because there are no spaces as word delimiters, so there needs to be some lexical analysis (which could be tripped up by imperfect grammar or spelling). Still, it could be done.

The environment

Translation memory would be only one aspect of the translator’s work environment, even if it is central. Even just considering that, I don’t much care for the ones I’ve seen.

There seem to be two approaches to presenting the TM interface: self-contained apps and floating windows. OmegaT is a self-contained app that reads in the source document and lets you iterate through it, one segment at a time, so that you wind up with a document interleaving source segments with target segments.

Some other computer-assisted translation tools let you work inside Word (or whatever), viewing the source document in situ and providing a floating window for entering translated text that hovers over the main document like a remora, and overwrites the source as you go.

I don’t particularly care for either view. I like to be able to look at the source document in its entirety, likewise my target document, as it gives me a sense of the flow between sentences. Interleaving source and target or simply replacing source with target makes this difficult. Ideally, I’d like a large pane showing the source document unmolested and as it was meant to be read. Perhaps this is my old-fashioned paper-oriented mentality showing through.

I like the idea of a self-contained environment, but I recognize that a good one has several hurdles to overcome that the parasitic environment would not, most notably file-format support.

One of OmegaT’s biggest shortcomings, in my opinion, is that it is effectively limited to plain text. It can open HTML, docx, and some other file formats, but any format with, well, formatting is presented with tags surrounding the formatted text. In the case of a (seemingly) lightly formatted document that I tried opening in OmegaT, a chapter heading reading “第2章 高度成長下の事業拡大(1960~1966)” with no apparent formatting at all was rendered as

<w0> <w1> <w2/> <w3/> <w4/> <w5/> </w1> <w6> 第</w6> </w0> <w7> <w8> <w9/> <w10/> <w11/> <w12/> </w8> <w13> 2</w13> </w7> <w14> <w15> <w16/> <w17/> <w18/> <w19/> </w15> <w20> 章 高度成長下の事業拡大(</w20> </w14> <w21> <w22> <w23/> <w24/> <w25/> <w26/> </w22> <w27> 1960</w27> </w21> <w28> <w29> <w30/> <w31/> <w32/> <w33/> </w29> <w34> ~</w34> </w28> <w35> <w36> <w37/> <w38/> <w39/> <w40/> </w36> <w41> 1966</w41> </w35> <w42> <w43> <w44/> <w45/> <w46/> <w47/> </w43> <w48> )</w48> </w42>

I went no further with OmegaT.

Other big hurdles are what would make a translation environment more than just a TM tool: the inclusion of other tools, and the decision of what other tools to include.

One obvious tool is a dictionary. OmegaT does have a facility for job-related glossaries, which is nice as far as it goes, and I believe that other CAT tools can support job, client, and general glossaries, but none of these tie in with general-purpose dictionaries that use the EDICT or EPWING formats, or Chinese-character dictionaries (which are their own very special ball of yarn).

Another obvious tool would be a web browser, or a specialized Wikipedia browser.

I have done a series of jobs where, in addition to a text transcript, I also had video files. It might seem a bit much to integrated video playback into a translation tool, but it would have been fantastically useful for that work.

Something that may be peculiar to my style of translation is the need for some scratch space. Working in between the opening and closing tags for a segment imposes a subtle psychological confinement. I need room to spread out. When I encounter a long, knotty sentence, I’ll work out the individual clauses on separate lines, and then compose the whole thing into a sentence that hangs together. A self-contained CAT tool would need to offer that.

All in all, I think there’s a huge amount of untapped potential in CAT tools. Translation is a very narrow market, but the commercial tools out there sell for $350 and up. It’s also unfortunate that the only ones out there are either Windows-based or Java-based. There’s an unsupported, primitive, and inscrutable TM tool from Apple, and that’s it on the Mac side. There’s no reason to think that a TM tool could only succeed on the majority platform: it’s a specialized enough market that, given a breakthrough product, translators will buy the platform that runs the software, not the other way around. My layman’s understanding is that the development environment on the Mac would provide a programmer with enough tools to get a decent head-start over a Windows (or multi-platform) alternative.

4 thoughts on “My gripes about translation memory”

  1. Funny that the format you try OmegaT with is the ugliest format on earth. Of course, that does not show when you look at the file. After all it is only a nicely formatted Word 2007 file…

    For there to concluding that OmegaT works only for plain text is a little easy. Especially since after that trial you decided to go no further with the tool. I mean, you don’t even seem to have checked what text files look in OmegaT… And you have certainly not checked the publicly available support list archive where such issues were specifically discussed no longer than 2 weeks ago.

    Also, the “unsupported, primitive, and inscrutable TM tool from Apple” is 1) not unsupported, not primitive at all (if you take the time to read the user manual) and not inscrutable (if you take the time to read the first pages of the manual).

    But I do feel you pain. There are numerous apps that run on Macs and that are supposed to help translators in their various daily tasks. I don’t like them. I think they are either ugly, or overly too complex, or both. Plus they are all expensive.

    Me too, I’m waiting for the holy grail of Mac CAT.

    Meanwhile, since I still need to feed the kids, I stick to OmegaT and Appletrans because even if Microsoft can only produce crap file formats, there are plenty of ways to ignore them and still produce a very acceptable translated document, for free, with both OmegaT and Appletrans: two not butt-ugly and not difficult to understand well supported and documented applications that run flawlessly on the Mac.

    ps: about the Word 2007 file. If you want to translate it in OmegaT, open it in NeoOffice (a recent version), save it as ODT and proceed. You’ll still have tags but if you work with the document open you’ll see if you can ignore them or if they have any special meaning that requires you to keep them in your translation. Once translated, use the same NeoOffice to save it to Word 2007. To do the same think in Appletrans, save the first to RTF first and proceed.

    ps2: I work from Japanese too, and I understand fully your need for “scratch space”. TextEdit provides me with that. I can split the string any way I want, annotate its parts, put all that together in a different area, and paste the result into my OmegaT segment, if necessary.

    ps3: the best translation environment is a nice dictionary application and a nice text editor. Kotonoko and TextEdit provide that on the Mac, and when I don’t use OmegaT or Appletrans, they are the only 2 applications I need. Along with Safari for Wikipedia etc. browsing.

    Anyway, a happy new translation year, and I sincerely hope that you’ll find the shoe that fits.

  2. Although I didn’t write about it in the post, I did try OmegaT with plain text. I also tried doing an HTML export from the Word file and GREPping out the extraneous tags, although that was very time-consuming and ultimately unsuccessful.

    The fact is that most of the work I receive comes in the form of Word files. I know that any file format invented by Microsoft is likely to be a rat’s nest, but that’s what I’ve got to work with, and the translation tool should make work easier, not make work. And for the job in question, a lot of the formatting really needed to be visible and comprehensible, because it indicated struck-out text, commentary, and in-line glossary expansions.

    As to AppleTrans, Apple’s own localization web page reads “AppleTrans is not an officially supported Apple tool. Use this software at your own risk.” So I stand by my statement that it is unsupported. Inscrutable is a matter of opinion, I suppose. In my opinion, it is inscrutable. In terms of translation-assistance features, it is lacking, hence “primitive.”

  3. Great post. I make a (Windows-only) TM tool, and I’ve run into a lot of the weaknesses of CAT tools in general that you’ve mentioned. I think that among existing CAT tools, Deja Vu probably does the best job at auto-detecting sub-phrases from your segments, but I don’t know how well they do for the CJKV languages.

    I’d also love to be able to produce a version of my software for the Mac, because I like the Mac and think it’s a highly under-served market (no offense to Wordfast and OmegaT). But as an essentially one-person shop, I simply don’t have the resources to target the Mac as well. I do have future plans for a Web-based, cross-platform version, but no fixed date.

  4. Interesting read. The multivarious requirements and layers you describe makes it sound like some sort of highly modularized approach might work best. Provided some kind of standard interface, different programs could handle each layer, and ideally be switchable — multiple possible DBMSs for storage, TM engines for segment analysis, UIs for pretty, and dictionaries somewhere in the middle. Make it extensible, and folks who want can add on CMSs for file storage, converters to handle other file formats, and heck, even the video playback capability Adam mentions. Run the whole thing as a local app, or put various portions (or even the whole thing for real masochists) through a webapp and access from wherever.

    OTOH, maybe I’ve just been spending too much time dealing with Unixy sytems. :)

Leave a Comment

Your email address will not be published. Required fields are marked *