Language connects (or separates?) the top 50 LSPs. An interactive data visualization
Is translation destined to become a high-tech commodity, or will it evolve as a strong value generator? This question is one of the industry’s defining narratives today, so I was curious to see how the world’s top LSPs position themselves on the market. I analyzed the language from these companies’ home pages and created this interactive visualization.
How does this work? Enlarge
LSPs that rank higher on the top 50 list have a larger bubble.
If two bubbles are close, that means the sites use similar language.
The outcome varies if you create a different number of clusters. You can play with this by changing the number at the top.
With this button you can toggle between surface forms and stems.
I won’t draw any conclusions from the data myself; the whole point is to let you explore and find your own insights. What patterns do you see? Do they match up with your perception of these companies? If you are one of the LSPs: does the language rhyme with your own market positioning? I am excited to hear.
Also, I don’t do document clustering every day, so I may well have made errors. At the end I include links to all the data and scripts used in the analysis. Give me a shout if you notice anything odd.
To obtain a selection of LSPs I consulted CSA’s top 100 list for 2016. I only used the first half so that it’s still humanly possible to get an overview in the visualization. Already since 2016 this landscape has been altered by mergers and acquisitions. #23 Merrill Brink International figures in my analysis as United Language Group. #42 Global Language Solutions has been acquired by #7 Welocalize and does not have a website, so I skipped it altogether. Consequently, my list’s #50 is Mother Tongue Writers, which ranks #51 on the original CSA list.
I was curious to see how different LSPs present themselves publicly, so I decided to analyze the language they use on their websites. To keep this truly simple, I only looked at the home page that welcomes visitors. In a very few instances the default page is not English. The only navigation I undertook was to switch to the English version in these cases.
I manually saved the home pages as HTML [May 17, 2017], and extracted the plain text in preparation for the analysis. Throughout the rest, documents (i.e., LPSs) are identified by their pure website URL.
I followed Brandon Rose’s document clustering guide very closely. The subject of his experiment was Hollywood movies, which he clustered based on their Wikipedia and IMDB synopses. I did the same with a few tweaks, replacing synopses with home page content and movies with LSPs. The rest of this section describes the process in detail.
(1) Text is normalized to lowercase and tokenized. To reduce noise I used NLTK’s English stopword list, and I also eliminated tokens that contained digits or were shorter than 3 characters. I ran the entire analysis twice, once with stemming and once without. The stemmer is NLTK’s Porter Stemmer.
(2) Each document is mapped into a vector space using TF-IDF, a staple in information retrieval. In this representation the similarity between any two documents can be expressed as the cosine of their vectors’ angle. NLTK’s TF-IDF vectorizer has a band-pass filter that greatly affects the result. Words that occur in too few documents are unhelpful for finding similarities; words that occur in too many documents are not distinctive enough.
I set the low cut-off to a pretty low 0.05, i.e., it drops words that occur in only one or two documents. Basically this gets rid of brand names. The high cut-off point is also relatively low at 0.60. I found that with higher values the results are less informative; the variations on “translation” are too frequent. In other words, these are empirically found values. In practice a lot of NLP is “tweak it until you achieve TLAR.”
(3) The next step uses k-means clustering to build N groups of similar documents. I ran the anlysis for values 2 through 6. This is the main control you find in the interactive visualization.
(4) For each cluster the script identifies the 6 words that are closest to the cluster’s center. The visualization shows these in bold below the bubble chart. I didn’t make those texts up; they are what “best describes” each cluster’s choice of words.
(5) The visualization uses colors to indicate clusters. But you also want to get an idea how close the different documents are, i.e., how similar they are. To project points from a high-dimensional space onto a 2D chart, the script uses multidimensional scaling.
I used the script to produce all the combinations of the two parameters: cluster count, and stemming vs. surface form. This page uses Chart.js for the bubble chart. I used JS/HTML/LESS to build the interaction surrounding the chart itself.
Five of the six cluster colors come from a 2005 David Foster Wallace article in The Atlantic. I don’t think repurposing them here amounts to IP theft, and I really like those colors.
Data and code
[Download] The analyzed LSP home pages, saved manually on May 17, 2017.
[Gist] Python script to extract plain text from the HTML pages.
[Download] Result of the plain text conversion.
[Gist] Python script to produce the data for the interactive visualization.
 If you’re not from the translation industry: LSP stands for language service provider, aka a company that provides translations and other language services.
 Top 100 Language Service Providers: 2016. CSA Research
 Document Clustering with Python. Brandon Rose
 Natural Language Toolkit:
 The most influential paper Gerard Salton Never Wrote. David Dubin
 TF-IDF on Wikipedia:
 That Looks About Right. A key technique in aviation and natural language processing.
 K-means clustering on Wikipedia:
 Multidimensional scaling on Wikipedia:
 Host. The Atlantic, April 2005. David Foster Wallace