Language connects (or separates?) the top 50 LSPs. An interactive data visualization

May 22, 2017 Stuff NLP

Is translation destined to become a high-tech commodity, or will it evolve as a strong value generator? This question is one of the industry’s defining narratives today, so I was curious to see how the world’s top LSPs[1] position themselves on the market. I analyzed the language from these companies’ home pages and created this interactive visualization.

How does this work? Enlarge

Every bubble represents an LSP. You can mouse-over to see their website URL.
LSPs that rank higher on the top 50 list have a larger bubble.
If two bubbles are close, that means the sites use similar language.
The analysis finds the most characteristic words in each website, and clusters LSPs that use similar language on their home page.
The outcome varies if you create a different number of clusters. You can play with this by changing the number at the top.
This section has details about the clusters. Bubbles in the same cluster share the same color.
The analysis automatically extracts each cluster's most characteristic words: this is what you see in bold.
The cluster's members are listed in italics. The number means the LSP's rank on the top 50 list. The smaller the number, the larger the LSP.
If you analyze surface forms, "agency" and "agencies" are different words. This also affects the outcome.
With this button you can toggle between surface forms and stems.
Close Next
Clusters: two three four five six surface
Minimize

The analysis

I won’t draw any conclusions from the data myself; the whole point is to let you explore and find your own insights. What patterns do you see? Do they match up with your perception of these companies? If you are one of the LSPs: does the language rhyme with your own market positioning? I am excited to hear.

Also, I don’t do document clustering every day, so I may well have made errors. At the end I include links to all the data and scripts used in the analysis. Give me a shout if you notice anything odd.

Population

To obtain a selection of LSPs I consulted CSA’s top 100 list for 2016[2]. I only used the first half so that it’s still humanly possible to get an overview in the visualization. Already since 2016 this landscape has been altered by mergers and acquisitions. #23 Merrill Brink International figures in my analysis as United Language Group. #42 Global Language Solutions has been acquired by #7 Welocalize and does not have a website, so I skipped it altogether. Consequently, my list’s #50 is Mother Tongue Writers, which ranks #51 on the original CSA list.

Data sourcing

I was curious to see how different LSPs present themselves publicly, so I decided to analyze the language they use on their websites. To keep this truly simple, I only looked at the home page that welcomes visitors. In a very few instances the default page is not English. The only navigation I undertook was to switch to the English version in these cases.

I manually saved the home pages as HTML [May 17, 2017], and extracted the plain text in preparation for the analysis. Throughout the rest, documents (i.e., LPSs) are identified by their pure website URL.

Analysis

I followed Brandon Rose’s document clustering guide[3] very closely. The subject of his experiment was Hollywood movies, which he clustered based on their Wikipedia and IMDB synopses. I did the same with a few tweaks, replacing synopses with home page content and movies with LSPs. The rest of this section describes the process in detail.

(1) Text is normalized to lowercase and tokenized. To reduce noise I used NLTK’s[4] English stopword list, and I also eliminated tokens that contained digits or were shorter than 3 characters. I ran the entire analysis twice, once with stemming and once without. The stemmer is NLTK’s Porter Stemmer.

(2) Each document is mapped into a vector space[5] using TF-IDF[6], a staple in information retrieval. In this representation the similarity between any two documents can be expressed as the cosine of their vectors’ angle. NLTK’s TF-IDF vectorizer has a band-pass filter that greatly affects the result. Words that occur in too few documents are unhelpful for finding similarities; words that occur in too many documents are not distinctive enough.

I set the low cut-off to a pretty low 0.05, i.e., it drops words that occur in only one or two documents. Basically this gets rid of brand names. The high cut-off point is also relatively low at 0.60. I found that with higher values the results are less informative; the variations on “translation” are too frequent. In other words, these are empirically found values. In practice a lot of NLP is “tweak it until you achieve TLAR[7].”

(3) The next step uses k-means clustering[8] to build N groups of similar documents. I ran the anlysis for values 2 through 6. This is the main control you find in the interactive visualization.

(4) For each cluster the script identifies the 6 words that are closest to the cluster’s center. The visualization shows these in bold below the bubble chart. I didn’t make those texts up; they are what “best describes” each cluster’s choice of words.

(5) The visualization uses colors to indicate clusters. But you also want to get an idea how close the different documents are, i.e., how similar they are. To project points from a high-dimensional space onto a 2D chart, the script uses multidimensional scaling[9].

Visualization

I used the script to produce all the combinations of the two parameters: cluster count, and stemming vs. surface form. This page uses Chart.js[10] for the bubble chart. I used JS/HTML/LESS to build the interaction surrounding the chart itself.

Five of the six cluster colors come from a 2005 David Foster Wallace article[11] in The Atlantic. I don’t think repurposing them here amounts to IP theft, and I really like those colors.

Data and code

[Download] The analyzed LSP home pages, saved manually on May 17, 2017.

[Gist] Python script to extract plain text from the HTML pages.

[Download] Result of the plain text conversion.

[Gist] Python script to produce the data for the interactive visualization.

References

[1] If you’re not from the translation industry: LSP stands for language service provider, aka a company that provides translations and other language services.

[2] Top 100 Language Service Providers: 2016. CSA Research
http://www.commonsenseadvisory.com/Marketing/2016-largest-LSPs.aspx

[3] Document Clustering with Python. Brandon Rose
http://brandonrose.org/clustering

[4] Natural Language Toolkit:
http://www.nltk.org/

[5] The most influential paper Gerard Salton Never Wrote. David Dubin
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.184.910&rep=rep1&type=pdf

[6] TF-IDF on Wikipedia:
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

[7] That Looks About Right. A key technique in aviation and natural language processing.
http://johnandmartha.kingschools.com/2016/11/21/when-tlar-beats-perfection/

[8] K-means clustering on Wikipedia:
https://en.wikipedia.org/wiki/K-means_clustering

[9] Multidimensional scaling on Wikipedia:
https://en.wikipedia.org/wiki/Multidimensional_scaling

[10] Chart.js:
http://www.chartjs.org/

[11] Host. The Atlantic, April 2005. David Foster Wallace
https://www.theatlantic.com/magazine/archive/2005/04/host/303812/