jealous markup
Software and language, mostly

Authorial analysis #2: Cross-interrogation methods

August 20, 2013 NLP Language Software

In the first post of the series I looked at the goals of authorial analysis, the linguistic variables it works with, and the way it measures the similarity of texts. It’s now time to test those methods for real – and my aim is to see how the same techniques work for a different language, Hungarian. First I must explain what I mean by different, and then come up with a way to decide what works and what doesn’t.

Define “different”

Just how do I mean Hungarian is different from English? Why would that have any kind of bearing on authorial analysis?

Here are a few properties of Hungarian, contrasted with English, that justify the question of whether the same linguistic variables would work:

Measuring success

Here’s the experiment I set up to learn what works and what does not.

First, I needed data to feed into the authorial analyzer. I gathered texts from 6 different columnists, writing for three online portals or magazines. From each author the corpus contains 8 to 14 articles, with a total of 9 to 13 thousand words. For statistical methods, this counts as a very small amount of data. The table below gives you a precise breakdown; authors are identified by two letters.

Author # of articles Total words Total characters w/o spaces
fa 17 10,747 66,152
me 14 12,196 71,672
se 12 11,312 72,338
tg 8 9,172 60,742
to 20 12,939 77,398
va 14 12,792 82,547

My program processed the texts from each author and counted the raw and relative frequencies of:

I’ll come back to the last four measures later. The key challenge at this point is to find out how good these metrics are at identifying authorship. For that the program reads texts paragraph by paragraph, and creates two samples for every author, so that approximately 20% of the text ends up in a “test” corpus, and the remaining 80% ends up in a “training” corpus. There are two sets of counts per author, one from the test corpus and one from the training corpus. Then, the program takes one author’s test corpus vectors at a time, and calculates their similarity with the corresponding vectors from every author’s training corpus.

A variable is good at predicting authorship if an author’s vector from the test corpus is consistently closest to the same author’s vector from the training corpus.

A typical method in corpus linguistics is to divide the content into 10 equal parts, use 9 parts for training and 1 part for verification, and do this ten times around, using a different 1/10th for verification each time. I used 1/5th because the corpus is already very small to begin with, and I didn’t do the full cycle, only selected a single random set of paragraphs for verification. After all, this is just a playful exercise, not a publication-grade study.

The outcome is a set of tables like the one below, one for every linguistic variable. These figures are for word length.

Test author Rank Winner fa me se tg to
fa 1 fa 0.9988419 0.9883497 0.9922947 0.9897225 0.9922224
me 1 me 0.9873105 0.9990566 0.9929206 0.9742669 0.9962779
se 1 se 0.9953771 0.9924851 0.9975414 0.9891039 0.9940981
tg 2 va 0.9838168 0.9763228 0.9861519 0.9935337 0.9774470
to 1 to 0.9837986 0.9953516 0.9905211 0.9750831 0.9973213
va 2 tg 0.9898528 0.9722818 0.9865904 0.9981027 0.9753838

Expert committee

Now let’s stop filling up hundreds of cells with long numbers and try to answer the simple question: I have a new article here, who is the author? For each variable, the program computes the vectors of frequencies, and compares them to every author’s vectors from our full training corpus. And we end up with something like this:

Metric Authors ordered by similarity
Word length fa, to, me, se, va, tg
Most frequent words se, va, fa, tg, to, me
4-grams va, se, tg, to, me, fa
Word-final trigrams fa, me, se, va, to, tg

We asked four experts, and they came up with different answers. The final touch to the authorship analyzer is to synthesize these diverging opinions into a single answer. Depending on the text we’re looking at, some variables may predict authorship perfectly; others may get it almost right; and others still may be thrown off the mark by the current text’s idiosyncrasies. Which ones get it right and which ones fail will vary from text to text. Remember, our training corpus is already quite small, and a single newspaper article may only contain a few hundred words. This leaves significant room for error with statistical methods.

So I decided to score every author based on their position in each variable’s ordered list of similarity. Whenever fa occurs first, the program increments fa’s score by 1. For every second place, it adds 0.5. For every third place, 0.25; and so on. The authors’ combined scores from the table above, then:

fa me se tg to va
2.28125 0.84375 1.875 0.4375 0.75 1.6875

The winner is fa with 2.28125. Incidentally, for the recent article that I analyzed here, that was exactly the right answer.

So what’s the deal with Hungarian?

If you paid very close attention, you’ll have noticed that I skipped one metric from Juola’s list altogether: word pairs (bigrams). I had the intuition that with my small corpus size and Hungarian’s large morphological variability, that variable would yield data that is too sparse to be useful.

However, I added a couple of my own; some just for fun, but some out of genuine curiosity.

Character bigrams and trigrams. These are not referenced anywhere in the literature about authorial analysis, and there’s no obvious reason why they should be particularly useful for Hungarian. And indeed, the program’s hit rate dropped sharply whenever these two guys without a pedigree sat on the expert committee. The world is not a richer place with this knowledge, but it was reassuring to contrast the really useful stuff with bad ideas.

Vowels. I was genuinely curious about this metric, but it turned out to be a lemon. The frequency of different vowels had about the same predictive power for authorship as character bigrams or trigrams – none. I’d still like to go back and see how useful more specific features (long vs. short vowels, or front/back vowels) would be, but that’s for next time.

Word-final trigrams. This variable turned out to be a hit, in line with my hopes. Hungarian relies much less on short function words like prepositions in English; instead, it has a rich case system. The last three letters of nouns encode most of that information, and writers have a similar degree of freedom in their choice of case as they have in their choice of preposition in English. This explanation is definitely not the full story as it relies on a very narrow view of language, but my experiments have shown that expert committees had a higher success rate when this metric was sitting on them.

Do my earlier assertions about the number and frequency of words and character 4-grams hold true? The table below compares one of the Hungarian authors (to) in my experiment to an English-language academic blogger (lf).

Author Total chars Different words Avg wd freq Different 4-grams Avg 4-gram freq
to 77,398 2,034 4.71 21,847 4.04
lf 68,203 1,998 6.03 16,260 5.04

I would have expected the difference to be a lot more drastic, but the tendency is still clear: the Hungarian corpus contains a greater number of different word forms that each occur less frequently, and you find a larger number of different character 4-grams, which all occur fewer times.

From insight to black box and back

The way I came up with ideas for new metrics, tested them, and then interpreted the results, is a typical full circle in statistics-based computer linguistics. You start out with some formal or intuitive insight into language:

You then construct metrics that have nothing to do with grammatical categories like case or function words, and start counting big time. In effect, you drop your magnifying glass and create a statistical black box. The trick is to choose metrics that are really dumb (do not require automated linguistic analysis or other language-specific knowledge in your code) but still capture the information you are looking for.

In the end, there are exciting possibilities to curve back and learn very language-y things from the experiment. Consider this. Instead of attributing texts to specific authors, you could instead try to train a program to determine if a text was written by a male or female author. This is not new, there are people out there getting paid to do this kind of thing. You may start out from one of the metrics I used in my experiment: the distribution of the 100 most frequent words. But then you can start narrowing down the vector from 100 dimensions to only a few – in practical terms, looking at a much shorter list of specific words in a specific language. It may turn out some words have strong gender predicting capacity, while others may be more gender neutral. Do male writers prefer definite articles? Do female writes prefer personal pronouns? The answer is yours to find out.

I am also quite intrigued to see what metrics could be used to identify authors in languages that use an ideographic writing system, like Chinese. Because there is no short alphabet and words are not delimited by spaces, many of the metrics I’ve used successfully for Hungarian are simply not applicable.

But that’s not what I’ll do next. Instead, in the next post I’ll show you how I used my shiny new authorial analyzer to solve a decades-old mystery of Hungarian literature, and to see who really influences the style of a transated novel: the original author or the translator.