I wrote this glossary mostly for my own delight (yes, I know, my hobbies are known to be on the
bizarre side), but also to help folks who are trying to cut their way through the translation
industry jargon. For a crowd whose mission is to help people communicate, we sure have come
up with a pretty dense web of insider vocab, huh?
I encourage you to take everything here with a generous pinch of salt. It is just one guy’s
take one things, and the guy in question has a dog in this fight through his affiliation with
Which is, objectively, the world’s most awesomest CAT tool. Right. But seriously, I did my best to
stay tool-agnostic here. I included the least possible amount of memoQ jargon, but I consciously
did not include jargon from any other tools. I simply feel I have neither the knowledge nor the
authority to do that, and anyways, thataway be dragons.
If you like what you see here, do share it with your friends, colleagues and enemies: there are
handy buttons for this purpose on the right. And if you find any minor or major errors or
inaccuracies, give me a shout through your preferred social media channel!
A short communication between an installed program and the manufacturer’s website. The program sends your serial number and
a few anonymous details about your computer. The website checks that you own a license or that you are just starting a
free trial, and returns a code to authorize the program to run on your computer.
Often, when you get a document to translate, you receive a set of previously translated documents along with it, or you can
find matching pages on websites. Those may contain a whole lot of translations you could use either as
TM matches or through concordancing. Problem is, the TM needs
segments, and you have whole documents. Alignment means splitting source and target documents into segments, and
algorithmically finding out which target segment corresponds to which source segment. Not a straightforward thing
to do! Advanced CAT tools have a function to automate the bulk of the work and help you correct the rest.
Before you accept a job, you need to know how much text there is to translate. But you already have TMs
with past translations, so you also want to know how much new text there is, and how many fuzzy or exact
matches you can expect. That’s what analysis does: it compares your text against your TMs and
corpora, and gives you a neat breakdown expressed in segment, word and character counts.
Analysis is sometimes used interchangeably with statistics, which has absolutely no fancy scientific meaning in this context.
A nerdy term to say that a program allows other programs to use its functions, just as if a human was clicking its
buttons. If a program has no API, then it’s impossible to integrate it with other systems, and humans end up with
tendonitis from lots of completely unnecessary clicking. It is particularly important to make sure a cloud-based
tool you’re considering has an API. If it does not, you may get locked in, with no easy way to retrieve your data
if you want to switch.
When you’re working in a memoQ online project, your translations are initially saved only on your computer.
You can choose to synchronize a few times a day, but if you enable auto save, your translations are
sent to the server immediately, without holding you up. This way others can see your work *almost* as if you were
editing a Google doc in real time. What better way to stay consistent?
If you want to see how an expression has been translated, concordance gives you exactly that.
But a good CAT tool can do more: by looking at the source segment it can find the parts that occur in a
lot of other segments, and point you to them. That’s effectively saying, “Hey! These phrases seem to be all
over the place, it’s probably a good idea to concordance them right now!” And if you’re very lucky, those
phrases also occur as entire source segments, and the TM or corpus will give you their translation
I don’t know about you, but I hate to type numbers as I’m translating, and I also hate to lose the flow to
select, copy and paste something over from the source. In addition to numbers, source segments also contain
other things that can go straight into your translation: tags, non-translatables, terms.
If you just press and release Ctrl (in memoQ), AutoPick highlights all the special entities in your source,
lets you cycle through them with the arrow keys, and insert the next one with a single keystroke. It also
re-formats numbers to match your target language’s conventions.
Almost every text you translate has repetitions: segments that occur multiple times. With some technical
texts these may even make up the majority. One responsibility of TMs is to pick these up, but CAT tools
can do even better. If you enable auto-propagation, then as soon as you confirm a segment, the tool immediately
populates all the other occurrences in the document and marks them as confirmed.
Most texts (particularly technical, legal and financial) have recurring entities that follow some pattern.
Think of a date: 05/27/1978, to be “translated” as 27.5.1978. Auto-translation rules allow you to create
regular expressions that recognize and transform such patterns in an incredibly flexible way.
Short for bidirectional text, i.e., the right-to-left scripts used to write languages like Arabic, Hebrew or Farsi.
It’s bidirectional because numbers and some proper names in Latin letters are written from the left within the overall
right-to-left text flow.
A specially formatted Word document that contains a translation’s source and target segments, and often
also comments and other information. This way, a translator can share work in progress with a client or a
domain expert who has no CAT tool, only a word processor. CAT tools, in turn, can read an edited
bilingual RTF with changes and comments, and bring the updates back into the translation environment. Some old
formats relied on hidden text and were very easy to ruin with a single misplaced edit. These days it’s more common to
see a table with three or more columns.
CAL is short for “concurrent access license.” While individual licenses allow a single person to use a program,
an organization can purchase CAL licenses instead, which can be handed out to any end user on an on-demand basis.
The limitation is how many people can use the tool at the same time; it doesn’t matter who they are or
where they work from.
Software that helps translators and reviewers work more efficiently and in good quality, even if the work involves
many people working simultaneously on the same large text. Sometimes the name is reduced to “TM tool” because
TMs were the first function that CAT tools focused on. Jost Zetzsche prefers translation environment tools, or TEnT;
I tend to agree. Translation Management System (TMS) is often used synonymously because the border, really,
is quite blurry.
Stands for the three East Asian languages, Chinese, Japanese and Korean. There are two Cs because Chinese can be
written either with simplified characters (PRC and Singapore) or traditional ones (Hong Kong and Taiwan).
When a translator or reviewer checks out an online project in memoQ, the tool downloads the assigned documents
and sets up a correctly configured working environment. This eliminates saving email attachments and going
through an error-prone series of steps, saving time and ensuring all project participants work with the right
resources and settings.
Software used to edit, organize and publish large amounts of content. A CMS typically tracks who is responsible
for what content, whether it is approved or obsolete, what the content applies to, and much more. Often, CMSes
break down content into smaller chunks, which are reused in several related documents. CMSes are important
because an incredible amount of text that gets translated comes from them, typically as small chunks of XML
and often in the DITA format.
In a CAT tool you can mark entire documents, source or target segments, or just a short part
within a segment. You can add a remark, or use the function to simply highlight something. This way you can
communicate with other translators, reviewers or even clients, or just bookmark something for yourself to
return to later. You can keep comments private, or choose to export them as part of the finished translation.
A function of translation memories and LiveDocs corpora that allows you to search for a word or
expression, retrieving all translated segments where it occurs. This is nothing short of a small wonder,
allowing you to “Google” existing translations. memoQ also highlights the expression’s most probable translation
within the target segments, just like Linguee, but from your own private data.
Usually a short machine-readable text that identifies a string that belongs to a specific place in an
app or a game. It’s crucial to distinguish between, say, “Open” on a label (translated into German as “Offen”)
or on a button (translated as “Öffnen”). The TM stores the ID and returns a context match if the
same text occurs with the same ID later.
A seemingly simple text-based format that stores several values in each line, separated by commas. It’s
still widely used to exchange glossaries, and sometimes even for translatable content. In spite of
its apparent simplicity it’s very easy to mess up; the most common problem is using the wrong code page
instead of Unicode.
The translator or reviewer’s action to signal that they are finished with their task, such as the translation
of a given document. Delivery is not a symbolic step: in a system like memoQ, it usually triggers a series
of actions like automated QA checks or emailing the finished translation back to the end client.
Technology that allows you dictate text, instead of typing it on a keyboard. Commercial tools have become
incredibly good for major languages. Dictation is preferred by a minority of translators; they, however,
report a productivity boost of 50% or more over typists.
A neologism derived from typo. It means an error by the dictation software. Unlike typos, which are nonsensical
on an elementary level, dictos are insidious because they are valid phrases that sound like what was intended,
but mean something completely different, like an immature middle school joke. Think Uranus.
Short for Document Information Typing Architecture, DITA is exactly as unsexy as it sounds, but tremendously
useful. It is an open standard that defines how to structure and reuse content in CMS systems. The
format is based on XML, and if your CAT tool supports it, you can deal with a huge share of the
content coming from several different CMSes.
DTP tools include the likes of FrameMaker and InDesign, used to produce professionally typeset printed
documents. In the industry DTP typically means an activity after translation and review. Translated text
looks really bad in the original format unless you adjust the typesetting to accommodate longer paragraphs,
different special characters, or even a complete left/right directional swap.
A number that expresses how different one text is from another, usually derived from the number of insertions,
deletions and swaps needed to get from here to there. While similar TM matches are qualified by fuzzy
match rates, edit distance is sometimes used to measure the extent of a reviewer’s changes.
One key benefit of CAT tools is that you always translate in the same familiar editor, regardless of the
file format your text came in. That means CAT tools must somehow extract the text from all the different file
formats. The component that does this is called a file format filter: it “filters” text from all the other
stuff in the file. Bringing the text into the CAT tool is called importing a file; retrieving the translation
in the original format is called exporting it.
Every filter comes with its own options that affect how it works (“Do you want to extract the hidden text
from this Word file?”), and for some formats, notably XML, these settings make an enormous difference.
In memoQ, the Find function has an option that puts all occurrences on a separate list, instead of walking
through them one by one from the pop-up window. The outcome is the find/replace listing, where you can
review each segment comfortably and decide where to replace and where to leave as is.
Many file formats, particularly from DTP tools, tend to use fonts that look really good, but cannot draw
a lot of special characters. If your target language happens to have a lot of these, the translated file will
look ugly, or skip letters outright. Font substitution is a function of file format filters that tweaks
the file, replacing the original font with one that has the right glyphs for your target language.
If there is no exact or fuzzy match for a segment in your TMs or corpora, a lot of the segment’s
parts may still have a match from a term base, non-translatables or auto-translatables.
Fragment assembly takes all of these and just replaces them with their target equivalents, giving you a
patchwork segment that might still take a lot less work to brush up than translating it from scratch.
First impressions are correct here: this is one of the fuzziest words in the entire industry jargon.
Initially a fuzzy match was used in contrast to an exact match from a TM: you get a translation
that is fully legit, except it’s the translation of something more or less different from your current
source segment. Just how different is expressed by the fuzzy match rate. Eventually fuzzy matching
was also extended to terminology, where it can be pretty useful if your language is in the habit of
changing letters in the middle of words.
In the olden days, the find function only worked if you first opened a document. Global find & replace
searches through all documents in your project: a massive difference if, for instance, your job entails
hundreds of tiny XML files from a CMS.
A garden-variety analysis tells you how much of your text has fuzzy or exact matches from your
existing TMs and corpora. But even if you start with an empty TM, as you progress in a document,
you will start getting matches from your own new translations! The homogeneity function quantifies these “internal”
matches as part of the analysis, going beyond the mere detection of repetitions.
A two-column grid layout where you see source on the left and target on the right has engulfed CAT tools
like a flash flood washing away a hapless creekside camper’s stock of ABC soup. But many a translator still
prefers to see their target text below the source. The horizontal layout option reshuffles the active segment’s
dominos, so source and target show up one below the other.
Localizing a product entails more than just translation: it includes things like showing dates in the
right format, displaying temperatures in Celsius vs. Fahrenheit, writing first name last or vice versa,
and the like. It requires extra effort to enable a product to do all this; that effort is called internationalization.
The ability of CAT tools to understand each other’s formats and APIs, and to support standard formats well,
so that people using software from different manufacturers can work together without drama, tears and major tragedies.
To “leverage” past translations is fancy talk for: the tool gives me what I already translated, I don’t need
to do it again. “Leverage” as a noun is fancy talk for the extent that happens: if a tool promises to enhance
leverage, you should expect to type fewer new characters while translating the same text.
This is memoQ lingo for things like non-translatables, segmentation rules, and a lot more. As opposed to heavy resources, which mean TMs, LiveDocs corpora and Muses, light resource have much less data. But while in many other tools they are “settings,” in memoQ they are resources: they have a name; they can be exported and imported; you can reuse them in different projects; and they can be shared online through memoQ server.
This term is probably the single biggest crime of the translation industry against proper English usage. For every educated person, a linguist means someone like Noam Chomsky, William Labov, Daniel Everett, or Arrival’s Amy Adams: a scientist studying language in the mind, or language in society. In the translation industry, “linguist” is sloppy shorthand for translator or reviewer.
memoQ’s approach to alignment, where you simply throw a bag of source and target documents at the tool, and start translating. The tool aligns fist the documents, then their segments, and indexes them in the background so they immediately give you lookup results in the editor. There will inevitably be errors, but you only spend time fixing those that actually give you matches.
memoQ’s alternative to TMs. While a TM holds a homogenous mass of translated segments in no particular order, a LiveDocs corpus preserves entire translated documents, but gives you the same kinds of matches. If you want to check the context of a past translation, you can jump directly to the full document from the translation editor. TMs have one big advantage: they only store every translation once. If your content has a lot of repetitions, LiveDocs can become cumbersome.
Sometimes used as a synonym for translation, localization entails a bit more: it includes showing dates in the right format, money in the right currency, and the like. In order to localize a product, it must first enable doing all this, which is called internationalization.
A person who knows the ins and outs of CAT tools,
nasty file formats, regular expressions and other arcana.
Many are not shy to code either. They make sure that before a complex project is launched, all the content
is imported correctly, the segmentation is right, untouchable segments are
locked, and a lot more. Without localization engineers, complex projects
would never be finished on time and budget, and translators would
tear out their hair and move to a farm to raise pigs.
One thing that no CAT tool copes with well is mixed languages in a source document.
In memoQ there is a well-hidden feature in the mundane function to lock segments. This inconspicuous option will algorithmically detect each segment’s language, and if it’s different from your document’s source language, lock it.
In practically every CAT tool segments have a status like new, pre-translated or confirmed.
from this, segments can also be locked, which makes them read-only. If a localization engineer has populated
some segments with translations approved (and mandated) by the client, then locking makes sure these do not get
changed accidentally. Locked segments are also easy to exclude from the word count during analysis.
In addition to merely reviewing and correcting translations, human reviewers can also mark every error they find, indicating the error’s type from a pre-defined list; the error’s severity; and possibly other details. This information can later be evaluated to assess quality objectively. LQA is the function that facilitates this in CAT tools.
A computer system that transforms a sequence of characters into a different sequence of characters that is
recognizable, to a human, as text in another language, with relevant clues about the information in the
original text. MT systems come in three main flavors. In rules-based (RBMT) systems, humans hand-craft
grammatical rules. In statistical (SMT), a statistical system is trained on large amounts of
human-translated text. Neural (NMT) systems are also trained on human-translated data, but they
need a lot more computation, and have been reported to produce superior results.
Tools like Google MT are generic. In the translation industry it is more common to train specialized
systems that do really well on a single type of content. This needs MT experts and a body of relevant,
high-quality human translations for the training data.
One way in which MT is used is human post-editing: the MT system produces a rough translation that
is often incorrect and ungrammatical, but is cheaper/faster to fix by humans than to translate from
scratch. Another way is interactive MT, where the human translator receives suggestions from the
MT system while she is translating text in a CAT tool.
One way to utilize machine translation, where the MT system produces a rough “translation” that is often incorrect and ungrammatical, and is then post-edited by a human to remove major errors and improve fluency.
Additional details about a piece of stored information, like “who translated this, when, and for what client” in the case of a translation unit, or “what source did this come from and did the client approve it” for a term base entry. CAT tools usually support a set of standard fields like the ones above, but also allow users to define their own custom fields and categories for more detail.
A function where you export your translation into its original format, make changes outside the CAT tool, and can then bring those changes back into the translation environment from the edited target-language file. It is particularly useful when you need to send your work for client review but even a Word-based bilingual file is “too complicated.”
Why do you want to bring such changes back into the CAT tool? To make sure your TM contains only final, approved translations. Otherwise you may end up with trash in, trash out.
An Excel file with source text, translations, comments and other information. Sometimes it’s a small, innocuous file with two columns for source and target text, but we have reliable eyewitness reports of files out there with 50,000 rows and 25 columns for different languages. Such monstrous files often come from computer games.
One of the resources powering predictive typing in memoQ. A Muse is built by analyzing existing TMs and corpora, with the aim of extracting words and phrases that correspond to each other in the two languages. When you translate a new source segment, the Muse looks at the phrases in it and gives you a list of suggestions that might be the translation of a phrase in the source text.
A special character that looks like a normal space but acts differently because it doesn’t allow a line break to intervene between the word on its left and right. A non-breaking space is a must before a colon in French (you don’t want “:” to start a line), and between a number and a unit of measurement (you don’t want “cm” to start a line either). In most word processors you can type it by pressing Ctrl+Space.
Spaces, non-breaking spaces, tabs, and newlines. Also, a few other invisible characters used in bidirectional text. The point is, they are all blanks and you normally don’t see them. Just like Word, CAT tools have an option to show them, so that you don’t accidentally type two spaces, or a normal space where a non-breaking one is warranted.
Software whose original function is to turn an image (e.g., a scanned page) into editable text, usually a Word document. In translation OCR is used to turn documents in the one-way PDF format into a Word document that you can edit or import into a CAT tool.
A memoQ project that stores documents in a server, allowing multiple people to simultaneously translate and review them, working together in real time. Online projects also make it really simple to assign work because they eliminate sending files around in email, and they prevent trivial errors because they make sure everyone in the project uses the right settings and resources.
A translation memory shared through a server or in the cloud. They allow organizations to store their translations centrally (and always find them when they are needed). They also make sure that translators working together in real time on different parts of a project get to see each other’s translations instantly, ensuring their work will be consistent.
A function present in all advanced CAT tools (though usually called differently) that allows you to filter the segments of the document you’re working in. It is “find” on steroids: you can quickly skim segments that contain a particular word or expression, and make changes if you changed you mind about a translation. It’s also useful to eliminate, say, segments that are already confirmed so you can just focus on what needs work.
Portable document format by its maiden name, it is meant to make sure a document looks exactly the same no matter where you view or print it. The price of that consistency is that it’s extremely hard (nigh-impossible) to change the text inside it. In other words, it’s a one-way format, which makes it one of the biggest nuisances for the translation industry. Apart from a few innovative solutions like TransPDF, your best bet is to convert a PDF into a Word file with an OCR tool, then translate that. Or, if you have the chance, to get the source (InDesign, FrameMaker or similar DTP file) from your client and work on that.
Some translations are to be trusted less than others. They may be too old, coming from the wrong translator, or applicable to a different client or domain. A penalty means reducing the translation’s natural match rate so it gets ranked lower than others.
A license that allows you to use a piece of software you have purchased forever. Perpetual licenses typically belong to a specific version of the software: to make sure the developer stays in business, it needs to finance its work by
charging an upgrade fee for new versions.
A small module in a larger piece of software that performs a specific task. The point, usually, is that anyone can develop a plugin without the need to involve the main software’s developer. A typical example in CAT tools is
machine translation plugins: the providers of the machine translation make their service available through plugins for the CAT tools used by translators, LSPs or companies. Whatever the purpose, you can create plugins if the CAT tool’s developer opened up a part of their application by creating and publishing an SDK.
Quite often a source document has lots of segments that all contain only numbers – think financial reports. Numbers are funny. They don’t need translation in the traditional sense, but they also cannot go straight into the translated text: the target language’s conventions for decimal separators and the like are different. This function processes all the segments with numbers in one go, adjusting their format along the way.
PTA for short, it is similar to analysis, but is performed once the translation is finished, not up front. Every large text will yield a lot of “internal” TMmatches, both fuzzy and exact. When two or more translators work together, there’s no way to say in advance who will translate a segment from scratch, and who will get a match because someone else was there first. memoQ keeps track of this
in its online projects, and gives a precise and fair breakdown when all the work is done.
Incidentally, the numbers in the pre-translation analysis match very closely those from homogeneity. The
difference is that PTA’s breakdown shows who got how many of the internal matches that homogeneity predicted at the start.
One of CAT tools’ most sexy functions with the least sexy name. Predictive typing makes you both faster (by eliminating keystrokes) and more consistent (by offering the right things to type). It looks at the characters that you have typed so far and offers a list of continuations from the current segment’s term base matches, auto-translatables, non-translatables, Muse hints and other such sources.
A screen area in your CAT tool that shows the document in its original format, or a close approximation. A premise of CAT tools is to rip text to segments and let you translate these in the same efficient environment, regardless of the original file format. But one thing is lost through this: visual context. The solution to this self-inflicted pain is the preview, which magicks visual context right back into your environment.
The screen in memoQ where you can add or remove documents to translate, pick TMs, term bases, Muses and other resources, and fiddle with your working environment in countless other ways, whenever you have an urge to procrastinate.
If you get recurring or at least similar translation jobs (and you do), you are forced to do the same things over and over again: pick the right TMs, term bases, light resources, settings, people etc., and also perform the same actions like analysis, pre-translation and the like. Project templates define rules for all of these and a lot more, so you don’t make embarrassing mistakes, don’t get tendonitis from incessant clicking, and have a fighting chance to stay sane in the midst of it all. Also, project templates allow you to reuse the work of experts like localization engineers and make it a lot simpler for new hires to get up to speed.
Translation is the fun part, but if you’re dealing with complex file formats from esoteric systems, you need to make sure your work will also make it back to the original system at the end and not crash your client’s multimillion-dollar flagship app right before the deadline. Pseudo-translation allows you to test the whole process without actually translating anything. It replaces source text with funny characters, words spelled backwards, and made-up stuff to inflate strings.
Machines cannot even come close to humans in crafting a message that resonates, but humans are really bad at getting numbers and other boring and repetitive things right. Quality assurance checks come to the rescue: they verify that you got your numbers right, that you didn’t type two spaces, that you used MegaCorp Ltd’s official terminology, that you translated the same thing the same way throughout the text, that you didn’t forget a mission-critical tag, and a lot more. And if the dumb machine didn’t get it right, you always have the option to ignore (suppress) individual QA warnings.
Even texts written by humans are full of patterns that have well-defined moving parts. Think dates: Number(1-12)/Number(1-31)/Number(four digits) is a date in the US. For German, you have the same numbers in there, you just need to rearrange them and use dots instead. Regular expressions are a super counter-intuitive but super-useful way to describe exactly these kinds of patterns. No wonder CAT tools support them across the board, from defining file format filters through auto-translatables to find/replace. You can get half a localization engineer’s career just out of knowing your regex.
Any segment that occurs at least twice in your source text is a repetition. They are a delight because usually you need to translate the same thing only once. That’s how repetitions gave rise to auto-propagation and exact matches. And for the cases where the same thing must be translated differently, you have context IDs to differentiate.
A set of tools and documentation that allows developers to build their own module to work together with a different piece of software. SDKs are what allows third parties to develop plugins for CAT tools, for instance.
When we translate text, we almost always proceed sentence by sentence. If you try to get to the bottom of it,
however, nobody really knows what a sentence precisely is. Also, when you translate a single word in a
bullet-point list, is that a sentence? CAT tools decided to sidestep this can of worms altogether,
so we speak about segments instead.
Generally (though not always), a segment is the essential unit of translation: you proceed segment by segment in the editor, and you store the translation of segments in the TM. Your TM and corpus matches also refer to the segment you are translating at the moment.
Segments are born with the active cooperation of regular expressions, in a special incarnation called
segmentation rules. As all regex, they look gibberish to the uninitiated eye, but they basically elaborate
a single theme: “If you find sentence-final punctuation like a period followed by one or more spaces followed
by a capital letter, start a new segment right there. Except if the last word before the period is a known
abbreviation.” Segmentation normally happens quietly, behind the scenes, when your CAT tool’s
file format filter imports a source document.
No matter how elaborate, segmentation rules will inevitably get it wrong from time to time. To help get around this, CAT tools have a function to join neighboring segments, and to split a single segment into two.
From the moment you import a document into your CAT tool all through the steps of translation, review, client review, proofreading after a good night’s sleep, and additional review by your pet, up to the point of exporting it for delivery to your client, the text lives in the form of segments. In this form, there’s a whole lot to know about segments beside the text itself: does the target come from a TM, or from you? Have you confirmed it already? Was it rejected by a reviewer? Is it halfway edited but not quite finished yet? That is the kind of information that you can see in the form of colors and icons within the translation environment.
For several years after they arrived, my friends from Mars were convinced translators were in the business of turning empty (grey) segments into confirmed (green) ones, and they thought this was a terribly appealing job. By now they know they were wrong, but they still think the job is awesome.
A function of online collaborative CAT tools that allows several people to edit the same document together in real time. You can think of this as Google Docs on steroids, customized for the two-column, source-and-target world of translation.
While a license agreement entitles you to use a piece of software, the SMA that usually goes along with it grants you access to support from a human and to new versions of the software. Normally, perpetual licenses have a one-off fee; SMA, in contrast, is charged on an annual basis.
This is a strong contender for the industry’s most fuzzy word, right there after fuzzy itself. When a CAT tool vendor uses it, they basically want to say, “We’re doing something extremely advanced and useful here.” In prosaic terms it means lookup results and suggestions (aka leverage) that refer to a shorter bit of the source segment. In all earnestness, often the machinery that generates such matches really is pretty advanced, extrapolating knowledge from past translations in ways that are far from obvious.
In developer-talk, a string is a sequence of characters. When you translate the user interface of a software application or a game, all the chunks of text that appear in different places are called “strings.” Typically, a string shows up as a single segment, and it has an associated context ID to disambiguate it.
When you work in a memoQ online project, you have the option not to save every translated segment in
the server immediately, but instead gather a lot of changes locally, and exchange news with the server in
one go. That action is called synchronizing the project.
Some inline tags are optional: maybe that bold formatting in the source text is not needed in your translation at all. Others, however, are mission-critical: if they represent N in the sentence “You have N enemies left”, then if you omit the tag, the translated game will crash and the outrage of gamers will put your client out of business. To avoid such an outcome, the QA module of CAT tools gives you a tag error right in the editor, and won’t let you deliver your translation until you fix it.
Tags can be a real nuisance as you translate: you need to think about where they must go, you need special shortcuts to insert them, and generally, they throw you out of the flow. So in memoQ you can just focus on translating a segment’s text first, then activate tag insertion mode and sprinkle your target segment with tags in the right places.
An unfortunate but all too frequent situation when a document that you have just imported is chock full of tags that are unexpected, pointless, or both. This most often happens with Word documents that an OCR tool produced from a PDF because it wanted to make sure everything is shown exactly in the right place, down to a hundredth of a millimeter. You can make things better by tweaking the OCR tool’s settings, running a cleanup macro like Dave Turner’s CodeZapper, or pestering your CAT tool’s developers to do something about it. Only the first two have been conclusively shown to work.
The content we need to translate consists mostly of text – but not exclusively. One oddity is formatting changes: how do you represent a change of text color in the middle of a sentence? The other oddity is a consequence of structured content, where text is intertwined with markup like hyperlinks, cross-references, or placeholders that will be substituted with, say, a number when a piece of software runs.
CAT tools cope with all this by introducing tags: symbols inside your segments that act like a
character in the editor, but look completely different. These creatures are called
inline or internal tags.
Then you have formats like XML or HTML that have tags woven into their own DNA. Some of these tags
define structure (“this is a headword”, “this is a caption”), always enclosing text from the outside.
These are called structural or external tags, and should never show up in your segments. They only do
if the XML filter was not configured properly before the import. You can fix that by hiring a good localization engineer.
The analysis output of well-behaved CAT tools has a separate section that shows how many tags the text contains in addition to good old-fashioned characters. This is important, because tags can be a lot of work and really slow you down as you translate.
A bit of a schizophrenic creature that cannot completely make up its mind whether it’s a match rate or a segment status. It rears its head in the complicated scenario when you need to translate a source segment that contains tracked changes, which you need to reproduce in the translation too. A TC match is basically an exact match for the original form of the source segment, pretending those tracked changes were never put in there.
A widely used workflow that involves a translator and two different people subsequently reviewing her work, with the aim of ensuring a high-quality translation, and giving feedback for the translator to improve.
A “database” or a component of CAT tools that allows users to store important
words/expressions and their equivalents. It saves the hassle of researching the same term twice.
It also helps translators adhere to terminology mandated by their clients, or at least stay
consistent with themselves. In fact, it’s indispensable for consistency if different people are
translating the same large text simultaneously, collaborating online from different locations.
Often used interchangeably with glossary, but they’re not quite the same. A glossary is usually just a word
list in two languages, while term bases have structure and metadata too.
A function of advanced CAT tools that looks at new source text or a body of existing translations and extracts important words and expressions. The output typically contains a lot of “false positives,” but it allows a translator to research important terms before starting to translate, include them in a term base, and make sure they are then translated both correctly and consistently.
The idea that initially gave rise to commercial translation technology. Why translate the same sentence twice? The TM is a “database” of segments and their translations. TMs quickly evolved to give a hint also for segments that are only similar (see TM match types), to allow concordance searches, and to support subsegment leverage.
Translation memories are big bags full of source segments and their translations. When a new segment
comes along, CAT tools rack these bags for segments that were translated before, and return these
as matches. If the same sentence is there in the bag, that’s an exact match, whose match rate is 100%.
It can get even better: if there’s a translation of not only the same segment, but the same segment from
the exact same context, that’s a context match (aka “ICE” for in-context exact). If the best there
is is the translation of something similar but not identical, that’s a fuzzy match with a match
rate below 100%. Often, high fuzzies are distinguished: these matches only differ in punctuation,
capitalization or numbers, and are therefore easier to fix.
An advanced function of memoQ that dynamically splits or joins segments during pre-translation to get better TM matches. It’s a simple idea. What if a translator joined two segments before storing the translation in the TM, and now the same two segments show up again? By recognizing this on the spot, the two segments can be joined in the current document too for a perfect match, without human intervention.
Software that helps you manage translations and organize resources. Re: manage, think “these 1500 files must be translated into 25 languages, with 6 translators and 2 reviewers working in parallel for each, making sure that nobody overrides approved translations from the past, and ready by next Monday, with real-time visibility into the project’s progress until then.” Re: organize, think “I must find the right translation memories and term bases from among the 2000 resources I have around for various clients and language pairs.”
An XML-based format to, well, exchange translation memories. The adoption of this standard was a crucial step in the industry towards interoperability, and at this point virtually all tools support it.
Many regulated industries (like pharma) are required by law to track every change they make to crucial documents, such as the usage instructions and side effects of a medicament. Not only that, but when they sell to multiple markets, they must reproduce all these changes in translated materials too. As a translator or LSP, the only way to achieve this without losing your sanity and/or getting sued out of your profits is if your CAT tool has special functions to both cope with change-tracked documents and preserve the benefits of TMs, term bases, QA and everything else.
In a CAT tool, you translate documents segment by segment. Once you store the translation of a source segment in your TM, the two together, plus some metadata like “who translated this and when,” are bundled up and transmogrified into a translation unit.
Imagine you store a nonsense translation in your TM. When you receive an updated document, pre-translation picks it up as a perfect match, and you don’t even get to see it. If you train an MT engine with this data, it will produce nonsense translations. Once trash gets into the system, it perpetuates itself. How do you avoid that? Through QA, through TEP, through separating working and master TMs, and other similar efforts. All of which are only possible if you use a CAT tool that allows you centralize your resources, define the right processes, and eliminate error-prone manual steps.
According to Wikipedia, Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems.
According to me, Unicode is the best thing that has happened since sliced bread. It means if you write a text with your own language’s special characters, that text can be read by people anywhere in the world, using any gadget with a CPU and a display. Even (usually) in Excel.
Nonetheless, to prove that our world is not the best of all possible worlds, you must keep in mind that while Unicode doesn’t support Klingon, it does have a character for the handgun emoji.
Since CAT tools are apparently great fans of deconstructivism and start their day by tearing text into chunks called segments, you might as well max this out by slicing and dicing the living daylight out of those poor segments. As in: “I have just turned this User Guide into 1300 segments and pre-translated them. Now give me those segments that have no TM match, occur at least twice, and have the words ‘squinting squirrels’ in them. Also, show me each segment only once, and order them alphabetically.” That is the kind of thing that views allow you to do.
A component of CAT tools that allows translators and reviewers in an online project to work from a browser, without installing software on their own computer. A web editor is to traditional desktop tools as Google Docs is to Word, except advanced CAT tools offer both options (even within the same project) and don’t force you to choose between two incompatible companies.
Keeping stuff organized is an age-old challenge. If you don’t get it right, you end up with trash in, trash out. One way to stay on top of data within a translation project is to designate one TM as the master (translations coming from there get precedence over others); another one as the working TM (new, as-yet unrevised translations get stored there, keeping the master pristine); and the rest as reference (to fill in the gaps that the master does not cover).
Short for XML Localization Interchange File Format, a standard maintained by OASIS. In practice, it is a file format that allows translatable text to be passed between tools in a source/target, bilingual form.
An extremely versatile format for storing structured information in files that are readable by machines
and not completely unreadable by humans. A tremendous amount of content that gets translated comes from
XML files, particularly if the content’s source is a CMS system. You generally don’t need to understand
the dirty details unless you are a localization engineer, in which case you are a wizard who knows
all about file format filters and don’t need this glossary anyway.
So you’re half done translating a file, with tons of comments in there and segments in all imaginable statuses. At this point your client calls you and says, “Hey, our editors have been busy, we have a revised source file, we actually need you translate that instead of the one from last week.” If your CAT tool has X-translate, this is not a problem. The function compares the original source document with the updated source document, going segment by segment, and recreates your work, including comments, ignored QA warnings, segment statuses and all the rest. Whatever changed in the source text is left empty, so you can catch up with the changes and continue where you left off.