A curious cliff: coincidence, anomaly, or proof of ancestry?
As I’m preparing for the release of CHDICT, an open-source Chinese-Hungarian dictionary that I have seeded by
translating 10 thousand headwords from the venerable
I became curious to see what proportion of the Chinese vocabulary these sources cover.
The results are pretty exciting, at least if you are a data-obsessed language nerd like me. The analysis of these dictionaries, plus their French counterpart CFDICT, reveals a shared idiosyncratic pattern. Is it a trace showing that the German and French dictionaries were derived from the English original? Is it an anomaly in the word frequency list? Is it a ghost in the machine? A hitherto-unknown law of the Chinese lexicon? Read on, and decide for yourself.
Introducing the actors
In case you are not into Chinese open-source dictionaries, a short intro. The story began with CC-CEDICT, a collaborative dictionary that was started by Paul Denisowski in 1997. The undertaking was inspired by Jim Breen’s work on EDICT, who had started building a Japanese-English dictionary back in 1991. CEDICT has changed maintainers twice since its inception, and changed its name to CC-CEDICT along the way. It is still being actively developed, and its current website allows you to search over 115,000 Chinese headwords, or to download the entire dictionary as a text file.
HanDeDict is a Chinese-German dictionary that started in 2006. When its active development stopped in 2011 it contained over 150,000 headwords. CFDICT is a Chinese-French dictionary that started in 2009, and by 2013 exceeded 130 thousand entries. Both projects name CEDICT as their inspiration.
What constitutes a dictionary?
The words of any given language are not anyone’s property, and yet the dictionaries that contain definitions or translations for these same words are considered intellectual property. What, exactly, “is” a dictionary, then?
One thing that’s not immediately obvious is the dictionary’s scope: what words, exactly, are included in it? How are those words selected? What is the motivation, and what is the method?
If the number of words in a language is effectively infinite (it is), and your dictionary is inevitably limited, then it’s reasonable to first include more common words: that’s how you ensure people get the maximum benefit out of your work. This is why I chose to analyze each dictionary’s headwords against a list of words ranked by frequency: the selection of headwords is part of what defines a dictionary as a unique intellectual product.
Word frequencies from SUBTLEX-CH
No matter what language you’re dealing with, you inevitably face one key problem: nobody really knows (or agrees) what a “word” precisely is. With languages that are written in an alphabetic script, you can get around the problem by saying (chàbuduō, for the sake of simplicity) that a word is whatever stands between two spaces, minus punctuation plus stemming. With Chinese, you are out of luck because the script doesn’t use spaces to indicate word boundaries.
The internet is full of freely available corpus-based character frequency lists. Finding word frequencies, alas, is more difficult. But not completely impossible: there is SUBTLEX-CH, based on a 33-million-word corpus of movie subtitles, and tokenized into words through NLP tools. That is the resource that I used.
What does Zipf say?
Let’s first take a quick look at SUBTLEX-CH’s word frequencies, out of due diligence. The published ranked list contains (almost) 100,000 words. Let’s plot each word’s frequency against its rank on a graph with two logarithmic scales. Zipf’s law says that we should be seeing a shape made up of one or more straight lines, approximately.
With some generosity, our graph of SUBTLEX-CH’s word frequencies and ranks match the Zipfian predictions. In my superficial judgement there’s greater variation in the slanting angle of the three sections than expected, but the shape is by no means extremely off the mark: this seems like a legit word frequency list all right.
(Admittedly, questions do linger… To the extent the shape is not what one would expect, is that a peculiarity of written Chinese? Is it special to this corpus based on movie subtitles? Does the authors’ word segmentation method play a role? All exciting questions, but I must move on.)
Now for the interesting part. I used this quickly hacked-together Python script to plot CC-CEDICT’s coverage against word frequencies. This, precisely, is what’s going on:
- I split the 100,000 words on the ranked frequency list into 100 buckets. Bucket 1 has the 1000 most frequent words; bucket 2 has the next 1000 most frequent words; etc.
- I took CC-CEDICT’s simplified headwords, and looked up their frequency buckets. For each headword, I incremented the corresponding bucket’s count by one.
- A point (x, y) on the graph means: For bucket x, the dictionary contains y of the bucket’s 1000 words. The lower the dot, the lower the dictionary’s coverage in that frequency range.
The graph shows that for the most frequent 2 or 3 thousand words, CC-CEDICT has nearly 100% coverage: they’re all in the dictionary. As you move down the list, there’s a nearly straight line all the way down to the middle, sinking to about 300 out of 1000, i.e., 30% coverage.
Between 50k and 70k it gets interesting. The line kind of flattens out with some fluctuation: that’s Plateau 1. Just under 70k there’s a sharp cutoff, the Cliff. From then on, it remains a stable 12% or so almost all the way to the end: Plateau 2.
Without further ado, let’s look at the same curve for HanDeDict:
And for CFDICT:
What could possibly the explain this recurring pattern? I have three explanations. I present them in what I perceive to be their order of plausibility.
1: Ancestry. For whatever reason, CC-CEDICT acquired this idiosyncrasy as it evolved. One hypothesis: it probably includes systematically imported word lists for proper names and the like, which contribute to the middle or to the lower frequency ranges. In any case, by the time HanDeDict and CFDICT started, CC-CEDICT was already very near its current size, with this shape established. The German and the French projects largely relied on CC-CEDICT’s headword list for their own scope, and whatever else happened on top of that, the “noise” from their own subsequent development has still not masked the inherited plateau-cliff-plateu shape.
2: Ghost in the machine. SUBTLEX-CH’s word frequencies seem to obey the Zipfian distribution. But maybe, just maybe, there is a subtle effect hiding in there; one that only shows when you project the headwords of truly balanced dictionaries onto the frequency list. The effect might come from the corpus’s unbalanced nature (remember, it’s movie subtitles), or from the particular word segmentation tool the authers used.
3: A natural law of Chinese lexicography. It may happen that there is some underlying regularity within the Chinese lexicon itself, and you would get this shape no matter which dictionary and corpus you combine in your analysis.
I would love to hear!
And stay tuned until I check back with an analysis of CHDICT’s own coverage. There will be news of a different sort in there.