Chinese computing resources #2

April 2, 2015 Software Chinese

This is a follow-up to a previous collection of online resources that you need if you are to create Chinese language processing software – e.g., a new dictionary :) The previous post was about fonts, dictionaries and example sentences; read on for more esoteric stuff such as character recognition, frequency lists and stroke order animations here.

Character recognition

An electronic Chinese dictionary with no ability to recognize handwritten characters makes about zero sense. Sure, generations have managed to learn Mandarin in the pre-digital age, but the sheer amount of time wasted navigating radical tables, counting strokes and turning pages makes me cringe. So the very first thing I looked at when I was considering whether Zydeo was feasible or not was the availability open-source handwriting recognition for Hanzi. These resources were hard to find, but indeed there are several alternatives.

Frequency lists

Knowing what words are frequent and which ones are rare is not absolutely necessary to create a good dictionary application, but it can be useful. CC-CEDICT is very much a Chinese to English dictionary (it is structured around Chinese headwords), but you can also search for English words that occur in the definitions. Now if you’re searching for, say, “tree,” it would be ideal if Zydeo listed the more common Chinese words for “tree” first, instead of following some random order. This is not happening in Zydeo’s first version yet, but it’s one of the more obvious improvements down the road.

Here’s what I found in terms of Chinese word frequency lists. Note that coming up with word frequencies is by no means a trivial task since the script does not indicate word boundaries with spaces, and in fact there is often little agreement about what exactly constitutes a word in Chinese. I completely ignore this can of worms for now, and just share what I have found.

Stroke order animations

If you’re at a somewhat advanced stage with Chinese, then by looking at an unknown character you are supposed to know the correct stroke order. That is the theory at least – with 1500 Hanzi under my belt, I still keep getting surprised regularly at one or the other character. So while stroke order animations are not such an absolute must-have in a dictionary tool as character recognition, I see them as a big value add.

Are there freely available character animations out there? Here the answer is rather negative. A number of sites provide them for free, either as GIFs or as Flash animations, but apparently the animations are mostly syndicated from a few IP owners in exchange for ad revenues. Not a workable solution for an offline tool.

I would personally find it very exciting to launch an open-source stroke order animation project, but even after a little thought given to how this could be achieved, it turns out it’s just a bit more complicated than creating a brand new Chinese font. Read: it’s a LOT of work. This is worth a series of posts on its own; for now, here’s what I’ve found out there so far.

Offshoot: fonts

While I was trying to chase up that perfect source for character animations, I came across two really interesting papers; in fact, both are theses of computer science graduates

Philip Jägenstedt (2008): Vector Graphics Stylized Stroke Fonts (referenced in a blog post by him) looks at efficient ways of representing CJK ideographs in fonts. Elena Jakubiak Hutchinson (2009): An Improved Representation for Stroke-Based Typefaces and a Method for their Creation looks at how stroke-based characters can be represented in a modular typeface. Both texts contain an exciting introduction into the inner workings of modern fonts as well as a wealth of information about the CJK script itself.

Finally, it was inevitable that I came across the great free font editor FontForge, a piece of software that allows you to open any TTF file and view or even edit the outline of characters in there – in other words, to play typographer in your spare time and create your own fonts.