Authorial analysis #2: Cross-interrogation methods
In the first post of the series I looked at the goals of authorial analysis, the linguistic variables it works with, and the way it measures the similarity of texts. It’s now time to test those methods for real – and my aim is to see how the same techniques work for a different language, Hungarian. First I must explain what I mean by different, and then come up with a way to decide what works and what doesn’t.
Just how do I mean Hungarian is different from English? Why would that have any kind of bearing on authorial analysis?
Here are a few properties of Hungarian, contrasted with English, that justify the question of whether the same linguistic variables would work:
Hungarian is a morphologically rich, agglutinative language. It keeps tacking suffixes to the end of words:
for nouns, these suffixes can indicate possession, number and case. With three persons, singular and plural,
and at least 17 cases (depending on who you’re asking), the suffixes can combine to produce several hundred
word forms for a single dictionary noun. Compare that with two or four possibilities (singular and plural,
with and without possessive ‘s) for English.
Remember how we defined “words” in the last post: lower-case sequences of letters, separated by spaces, with punctuation trimmed. That means we’re really looking at word forms, not dictionary words. The same amount of text will contain a much larger number of word forms in Hungarian, and each word form will occur a lot less frequently. Is the list of the most frequent 100 words still useful? Will we find a whole lot of word bigrams occurring often enough to be useful?
Add to this that Hungarian uses a larger alphabet relying on 35 characters:
Combined with the rich morphology, and contrasted with the 26 characters in the English alphabet, this will produce a greater number of different character 4-grams, all of which will occur less frequently. Will the character 4-gram variable still work as expected?
Finally, let’s consider one more linguistic feature: vowel harmony. Each vowel can be categorized as back,
front unrounded or front rounded. Many suffixes have two or three variants, and when a suffix is needed,
the corresponding form is tacked onto the word at hand to preserve vowel harmony. Some words have
alternatives (e.g., fel and föl) that are fully equivalent except for the choice of vowel, and they
attract a different set of suffix forms.
Could a person’s preference of vowels be characteristic of their writing style? With a bit of insight into the language we may be able to come up with a useful linguistic variable – but it may equally well turn out to be a lemon.
Here’s the experiment I set up to learn what works and what does not.
First, I needed data to feed into the authorial analyzer. I gathered texts from 6 different columnists, writing for three online portals or magazines. From each author the corpus contains 8 to 14 articles, with a total of 9 to 13 thousand words. For statistical methods, this counts as a very small amount of data. The table below gives you a precise breakdown; authors are identified by two letters.
|Author||# of articles||Total words||Total characters w/o spaces|
My program processed the texts from each author and counted the raw and relative frequencies of:
- Character 4-grams
- Words of different lengths (from 1 to 32 letters)
- Vowels (14 of them)
- Character trigrams from the end of each word
- Character trigrams
- Character bigrams
I’ll come back to the last four measures later. The key challenge at this point is to find out how good these metrics are at identifying authorship. For that the program reads texts paragraph by paragraph, and creates two samples for every author, so that approximately 20% of the text ends up in a “test” corpus, and the remaining 80% ends up in a “training” corpus. There are two sets of counts per author, one from the test corpus and one from the training corpus. Then, the program takes one author’s test corpus vectors at a time, and calculates their similarity with the corresponding vectors from every author’s training corpus.
A variable is good at predicting authorship if an author’s vector from the test corpus is consistently closest to the same author’s vector from the training corpus.
A typical method in corpus linguistics is to divide the content into 10 equal parts, use 9 parts for training and 1 part for verification, and do this ten times around, using a different 1/10th for verification each time. I used 1/5th because the corpus is already very small to begin with, and I didn’t do the full cycle, only selected a single random set of paragraphs for verification. After all, this is just a playful exercise, not a publication-grade study.
The outcome is a set of tables like the one below, one for every linguistic variable. These figures are for word length.
- Each row represents an author’s vector from the test corpus; the last six columns represent authors’ vectors from the training corpora. The number in fa’s row and me’s column is 0.9883497, meaning that the word length vector from fa’s test corpus was 0.9883497 “similar” to the vector from me’s training corpus, in terms of cosine similarity.
- Rank is 1 if X is the most similar to X in this way; it is 2 if X is the second most similar to X, following a different author Y; etc.
- Clearly, word length gets it right, or almost right, for every author in this evaluation.
Now let’s stop filling up hundreds of cells with long numbers and try to answer the simple question: I have a new article here, who is the author? For each variable, the program computes the vectors of frequencies, and compares them to every author’s vectors from our full training corpus. And we end up with something like this:
|Metric||Authors ordered by similarity|
|Word length||fa, to, me, se, va, tg|
|Most frequent words||se, va, fa, tg, to, me|
|4-grams||va, se, tg, to, me, fa|
|Word-final trigrams||fa, me, se, va, to, tg|
We asked four experts, and they came up with different answers. The final touch to the authorship analyzer is to synthesize these diverging opinions into a single answer. Depending on the text we’re looking at, some variables may predict authorship perfectly; others may get it almost right; and others still may be thrown off the mark by the current text’s idiosyncrasies. Which ones get it right and which ones fail will vary from text to text. Remember, our training corpus is already quite small, and a single newspaper article may only contain a few hundred words. This leaves significant room for error with statistical methods.
So I decided to score every author based on their position in each variable’s ordered list of similarity. Whenever fa occurs first, the program increments fa’s score by 1. For every second place, it adds 0.5. For every third place, 0.25; and so on. The authors’ combined scores from the table above, then:
The winner is fa with 2.28125. Incidentally, for the recent article that I analyzed here, that was exactly the right answer.
So what’s the deal with Hungarian?
If you paid very close attention, you’ll have noticed that I skipped one metric from Juola’s list altogether: word pairs (bigrams). I had the intuition that with my small corpus size and Hungarian’s large morphological variability, that variable would yield data that is too sparse to be useful.
However, I added a couple of my own; some just for fun, but some out of genuine curiosity.
Character bigrams and trigrams. These are not referenced anywhere in the literature about authorial analysis, and there’s no obvious reason why they should be particularly useful for Hungarian. And indeed, the program’s hit rate dropped sharply whenever these two guys without a pedigree sat on the expert committee. The world is not a richer place with this knowledge, but it was reassuring to contrast the really useful stuff with bad ideas.
Vowels. I was genuinely curious about this metric, but it turned out to be a lemon. The frequency of different vowels had about the same predictive power for authorship as character bigrams or trigrams – none. I’d still like to go back and see how useful more specific features (long vs. short vowels, or front/back vowels) would be, but that’s for next time.
Word-final trigrams. This variable turned out to be a hit, in line with my hopes. Hungarian relies much less on short function words like prepositions in English; instead, it has a rich case system. The last three letters of nouns encode most of that information, and writers have a similar degree of freedom in their choice of case as they have in their choice of preposition in English. This explanation is definitely not the full story as it relies on a very narrow view of language, but my experiments have shown that expert committees had a higher success rate when this metric was sitting on them.
Do my earlier assertions about the number and frequency of words and character 4-grams hold true? The table below compares one of the Hungarian authors (to) in my experiment to an English-language academic blogger (lf).
|Author||Total chars||Different words||Avg wd freq||Different 4-grams||Avg 4-gram freq|
I would have expected the difference to be a lot more drastic, but the tendency is still clear: the Hungarian corpus contains a greater number of different word forms that each occur less frequently, and you find a larger number of different character 4-grams, which all occur fewer times.
From insight to black box and back
The way I came up with ideas for new metrics, tested them, and then interpreted the results, is a typical full circle in statistics-based computer linguistics. You start out with some formal or intuitive insight into language:
- In Hungarian, the last few characters of words encode grammatical information that may be characteristic of an author’s style.
- Authors may prefer different function words (e.g., definite vs. indefinite articles), and you can capture this by looking at the distribution of the 100 most frequent words.
You then construct metrics that have nothing to do with grammatical categories like case or function words, and start counting big time. In effect, you drop your magnifying glass and create a statistical black box. The trick is to choose metrics that are really dumb (do not require automated linguistic analysis or other language-specific knowledge in your code) but still capture the information you are looking for.
In the end, there are exciting possibilities to curve back and learn very language-y things from the experiment. Consider this. Instead of attributing texts to specific authors, you could instead try to train a program to determine if a text was written by a male or female author. This is not new, there are people out there getting paid to do this kind of thing. You may start out from one of the metrics I used in my experiment: the distribution of the 100 most frequent words. But then you can start narrowing down the vector from 100 dimensions to only a few – in practical terms, looking at a much shorter list of specific words in a specific language. It may turn out some words have strong gender predicting capacity, while others may be more gender neutral. Do male writers prefer definite articles? Do female writes prefer personal pronouns? The answer is yours to find out.
I am also quite intrigued to see what metrics could be used to identify authors in languages that use an ideographic writing system, like Chinese. Because there is no short alphabet and words are not delimited by spaces, many of the metrics I’ve used successfully for Hungarian are simply not applicable.
But that’s not what I’ll do next. Instead, in the next post I’ll show you how I used my shiny new authorial analyzer to solve a decades-old mystery of Hungarian literature, and to see who really influences the style of a transated novel: the original author or the translator.