For years I've had Python code for normalising Greek forms, checking for stray characters, etc. I finally got around to consolidating them in a library.


Release of greek-normalisation 0.1 and more...

Release of greek-normalisation 0.1

For years I’ve had Python code for normalising Greek forms, checking for stray characters, etc. I finally got around to consolidating them in a library.

It has a few little utilities like:

>>> strip_last_accent_if_two('γυναῖκά')
>>> grave_to_acute('τὴν')
>>> breathing_check('ἀι')

but the core of it is the normalisation of tokens with knowledge of clitics and elision.

>>> normalise('τὴν')
('τήν', ['grave'])
>>> normalise('γυναῖκά')
('γυναῖκα', ['extra'])
>>> normalise('σου')
('σου', ['enclitic'])
>>> normalise('Τὴν')
('τήν', ['grave', 'capitalisation'])
>>> normalise('ὁ')
('ὁ', ['proclitic'])
>>> normalise('μετ’')
('μετά', ['elision'])
>>> normalise('οὐκ')
('οὐ', ['movable', 'proclitic'])

See my previous post The Normalisation Column in MorphGNT for the original work this code came form.

There are also some regular expressions that I’ve used to check mistakes in things like the Open Apostolic Fathers.

It’s just an initial 0.1 release but parts of the code have already been in use for years.

The repository is and it’s pip-installable as greek-normalisation.


Summer Conferences

Here are the conferences I’m attending (and in some cases, presenting at) in June through August. I probably should have posted this at the start of my conference travel, but here it is.

First LiLa Workshop: Linguistic Resources & NLP Tools for Latin

I’m excited about the LiLa project, which is about a Linguistic Linked Open Data (LLOD) approach to Latin resources. Because I’m interested in LLOD for Ancient Greek, I was keen to attend the first workshop to get ideas, but then I got asked to speak about Scaife anyway.

Quantitative Approaches to Versification

This was a conference about computational analysis of poetry (especially meter). I had done some work with Sophia Sklaviadis on the relationship between repeating n-grams and metrical position in Homer and presented a paper on it at this conference. Not normally my area but I have some more ideas to persue that I might write about here at some point.


When I went to the American Association for Applied Linguistics annual meeting last year, I mostly attended the track on vocabulary research. Regular readers of this blog know that, along with morphology, it’s my main research area. Well, the Vocab@ conferences are 100% vocabulary research. I did actually submit a paper to this conference that got rejected but I’ll be presenting it as a poster at EuroCALL (see below).

Digital Humanities Conference 2019

The big DH conference of the year. Will be my first time attending and I’m sure it will be overwhelming. I’m presenting as part of a panel on Confronting the Complexity of Babel in a Global and Digital Age and I’ll specifically be talking about online reading environments to scaffold understanding of texts in historical languages.

After this I’m briefly heading back to Boston for a couple of weeks. Then two Tolkien-related conferences:


This is the International Conference on J.R.R. Tolkien’s Invented Languages. Not speaking (Elvish or otherwise) just attending.

Tolkien 2019

Giving a talk on Tolkien and Digital Philology, basically how we might treat Tolkien’s works as the objects of philological study and use the same digital methods one might for, say, an Ancient Greek text. The talk will culminate in me outlining my vision for the Digital Tolkien project.

And finally:


This is the major European conference for Computer-Aided Language Learning. I’m presenting a poster on what is possibly the longest running topic of this blog: the sequencing of vocabulary learning from texts. There’ll be lots more blog posts here on that in the future!


A Tour of Greek Morphology: Part 28

Part twenty-eight of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In this post, we look systematically at the imperfect active distinguishers in much the same way as we did the present active distinguishers in Part 13.

Before we summarise all the distinguisher paradigms we’ve seen so far, there are actually three forms in the SBLGNT not covered yet: εἰσῄει, παρῆσαν, and συνῆσαν (all in Luke/Acts). εἰσῄει is from εἰς+εἶμι (making it a compound of IA-11) and παρῆσαν is παρά+εἰμί (making it a compound of IA-10). In our text, συνῆσαν is from σύν+εἰμί but could be from σύν+εἶμι. Either way, for completeness we need to add IA-10-COMP and IA-11-COMP.

So with those, here are all the imperfect active distinguisher paradigms we’ve discussed:

IA-1 IA-2 IA-3 IA-4 IA-5
1SG Xον Xουν Xουν Xων Xων
2SG Xες Xεις Xους Xᾱς Xης
3SG Xε(ν) Xει Xου Xᾱ
1PL Xομεν Xοῦμεν Xοῦμεν Xῶμεν Xῶμεν
2PL Xετε Xεῖτε Xοῦτε Xᾶτε Xῆτε
3PL Xον Xουν Xουν Xων Xων
IA-6 IA-7 IA-8 IA-9 IA-9b
1SG Xῡν Xην/Xειν Xουν Xην Xην
2SG Xῡς Xεις Xους Xης Xης/Xησθα
3SG Xῡ Xει Xου
1PL Xυμεν Xεμεν Xομεν Xαμεν Xαμεν
2PL Xυτε Xετε Xοτε Xατε Xατε
3PL Xυσαν Xεσαν Xοσαν Xασαν Xασαν
IA-10 IA-11 IA-10-COMP IA-11-COMP
1SG ἦ/ἦν ᾖα/ᾔειν Xῆ/Xῆν Xῇα/Xῄειν
2SG ἦς/ἦσθα ᾔεις/ᾔεισθα Xῆς/Xῆσθα Xῄεις/Xῄεισθα
3SG ἦν ᾔει(ν) Xῆν Xῄει(ν)
1PL ἦμεν ᾖμεν Xῆμεν Xῇμεν
2PL ἦτε ᾖτε Xῆτε Xῇτε
3PL ἦσαν ᾖσαν/ᾔεσαν Xῆσαν Xῇσαν/Xῄεσαν

It will be worth taking some future posts to talk about the -σθα ending that crops up in the 2SG as well as some of the more extraordinary forms in IA-10 and IA-11 (along with compounds).

But for now, just capturing the common element in each row (like we did in Part 13):

IA-1 IA-2 IA-3 IA-4 IA-5 IA-6 IA-7 IA-8 IA-9 IA-10 IA-11
2SG -ς/-σθα
3SG - -(v)
1PL -μεν
2PL -τε
3PL -σαν

As with the present active paradigms, some cells across inflectional classes have identical distinguishers and so those cells alone can’t identify the inflectional class (and hence all the other forms in that class). In particular:

  • The 1SG can’t distinguish within the set {IA-2, IA-3, IA-8} or within the set {IA-4, IA-5} or within the set {IA-7 (if η), IA-9}
  • The 2SG and 3SG can’t distinguish within the set {IA-2, IA-7} or within the set {IA-3, IA-8} or within the set {IA-5, IA-9}
  • The 1PL can’t distinguish within the set {IA-2, IA-3} or within the set {IA-4, IA-5} or within the set {IA-1, IA-8}
  • The 2PL can’t distinguish within the set {IA-1, IA-7}
  • The 3PL can’t distinguish within the set {IA-2, IA-3} or within the set {IA-4, IA-5}

The distinctions from IA-7 on up are less important because they are tiny, non-productive classes. Looking at just IA-1 through IA-6:

  • {IA-2, IA-3} can’t be distinguished by 1SG, 1PL, or 3PL but can by 2SG, 3SG, or 2PL.
  • {IA-4, IA-5} also can’t be distinguished by 1SG, 1PL, or 3PL but can by 2SG, 3SG, or 2PL.

So at least for the first six classes, any of 2SG, 3SG, or 2PL uniquely identifies the class (at least within the imperfect active system).

It is interesting then that the 2SG and 3SG are the very cells most likely to cause confusion within the sets {IA-2, IA-7}, {IA-3, IA-8}, and {IA-5, IA-9} and in those cases, it is the 1PL or 3PL that can come to the rescue in identifying the class (although the value of X itself can do that given the tiny size of the IA-7, IA-8 and IA-9 classes).

If we try to group our classes along the lines we did in Part 13, we get a hierarchy very similar to that in the present:

IA-{1, 2, 3, 4, 5} 3PL in -ν; 1SG and 3PL identical
 IA-{2, 3, 4, 5}  long vowels before the endings; circumflexes in the 1PL and 2PL
  IA-{2, 3}   ου in 1SG, 1PL, and 3PL
  IA-{4, 5}   ω in 1SG, 1PL, and 3PL
IA-{6, 7, 8, 9, 9b, 10, 11, 10-COMP, 11-COMP} 3PL in -σαν
 IA-{6, 7, 8, 9}  2SG only in -ς
 IA-{9b, 10, 11, 10-COMP, 11-COMP}  2SG in -ς/-σθα

along with cross-cutting categories such as:

IA-{2, 3, 8} ουν in 1SG
IA-{2, 7} ει in 2SG and 3SG
IA-{3, 8} ου in 1SG, 2SG, and 3SG
IA-{1, 7} ετε in 2PL

and, ignoring accents:

IA-{4, 9} ατε in 2PL

But given the closed nature of IA-7 and up, many of these will be easy to disambiguate. We’ll go through the details in the next post.


Consolidating Vocabulary Coverage and Ordering Tools

One of my goals for 2019 is to bring more structure to various disperate Greek projects and, as part of that, I’ve started consolidating multiple one-off projects I’ve done around vocabulary coverage statistics and ordering experiments.

Going back at least 15 years (when I first started blogging about Programmed Vocabulary Learning) I’ve had little Python scripts all over the place to calculate various stats, or try out various approaches to ordering.

I’m bringing all of that together in a single repository and updating the code so:

  • it’s all in one place
  • it’s usable as a library in other projects or in things like Jupyter notebooks
  • it can be extended to arbitrary chunking beyond verses (e.g. books, chapters, sentences, paragraphs, pericopes)
  • it can be extended to other texts such as the Apostolic Fathers, Homer, etc (other languages too!)

I’m partly spurred on by a desire to explore more stuff Seumas Macdonald have been talking about and be more responsive to the occasional inquiries I get from Greek teachers. Also I have a poster Vocabulary Ordering in Text-Driven Historical Language Instruction: Sequencing the Ancient Greek Vocabulary of Homer and the New Testament that got accepted for EUROCALL 2019 in August and this code library helps me not only produce the poster but also make it more reproducible.

Ultimately I hope to write a paper or two out of it as well.

I’ve started the repo at:

where I’ve basically rewritten half of my existing code from elsewhere so far. I’ve reproduced the code for generating core vocabulary lists and also the coverage tables I’ve used in multiple talks (including my BibleTech talks in 2010 and 2015).

I’ve taken the opportunity to generalise and decouple the code (especially with regard to the different chunking systems) and also make use of newer Python stuff like Counter and dictionary comprehensions which simplifies much of my earlier code.

There are a lot of little things you can do with just a couple of lines of Python and I’ve tried to avoid turning those into their own library of tiny functions. Instead, I’m compiling a little tutorial / cookbook as I go which you can read the beginnings of here:

There’s still a fair bit more to move over (even going back 11 years to some stuff from 2008) but let me know if you have any feedback, questions, or suggestions. I’m generalising more and more as I go so expect some things to change dramatically.

If you’re interested in playing around with this stuff for corpora in other languages, let me know how I can help you get up and running. The main requirement is a tokenised and lemmatised corpus (assuming you want to work with lemmas, not surface forms, as vocabulary items) and also some form of chunking information. See for the GNT-specific stuff that would (at least partly) need to be replicated for another corpus.


Initial Apostolic Fathers Text Complete

Exactly three months ago to the day, I announced that Seumas Macdonald and I were working on a corrected, open, digital edition of the Apostolic Fathers based on Lake. That initial work is now complete.

Preparing an Open Apostolic Fathers discussed the original motivation and the rather detailed process we went through.

The corrected raw text files are available on GitHub at but I also generated a static site at to browse the texts. The corrections will be contributed back to the OGL First1KGreek project.

The next step for us will be to lemmatise the text and there has already been some interest from others in getting the English translation corrected and aligned as well.

Recall that, while we were essentially correcting the Open Greek and Latin text, we used the CCEL text and that in Logos to identify particular places to look at in the printed text. We did this by lining up the CCEL, OGL and Logos texts and seeing where any of them disagreed. Those became the places we went back to, in multiple scans of the printed Lake, to make our corrections to the base text we started with from OGL.

How often did each of those three “witnesses” disagree? Here are some stats. A = CCEL, B = OGL, C = Logos. And so AB/C is where CCEL and OGL agreed against Logos, AC/B is where CCEL and Logos agreed against OGL, A/BC is where OGL and Logos agreed against CCEL, and A/B/C is where all three disagreed.

001 1.29% 1.15% 7.97% 0.32%
002 0.76% 1.20% 3.39% 0.37%
003 1.58% 2.20% 4.97% 0.28%
004 0.57% 1.33% 7.01% 0.28%
005 1.05% 1.79% 6.21% 0.84%
006 0.88% 1.18% 7.54% 0.69%
007 0.39% 0.88% 3.34% 0.20%
008 0.79% 0.87% 5.41% 0.44%
009 0.25% 1.53% 2.68% 0.38%
010 0.44% 4.05% 4.36% 0.25%
011 0.36% 1.86% 4.23% 0.14%
012 0.92% 1.15% 5.59% 0.43%
013 1.29% 0.90% 6.08% 0.34%
014 1.25% 0.34% 4.91% 0.08%
015 0.96% 0.65% 6.74% 0.50%
TOTAL 1.11% 1.12% 5.98% 0.34%

One can immediately see CCEL diverged the most from the others (it had considerable lacunae for a start). The numbers involving Logos diverging are probably overly high because there was a weird systemic error we only noticed after work had started that a middle dot was often erroneously added after eta. This ultimately didn’t affect anything other than perhaps flagging places Seumas and I had to check that we otherwise wouldn’t have needed to.

But at the end of the day, how much did we change? How much of the OGL original remained? How similar was our result to the text on CCEL? And for a bit of fun, how often was my first correction and Seumas’s first correction the same as what we ended up with after consensus was achieved? Here’s the breakdown by work:

001 91.27% 99.02% 99.85% 99.91%
002 96.02% 98.90% 99.77% 99.90%
003 94.58% 97.63% 99.77% 99.60%
004 92.42% 98.48% 99.91% 100.00%
005 92.32% 98.32% 99.79% 99.89%
006 91.28% 98.82% 98.82% 99.80%
007 96.07% 98.92% 99.90% 99.90%
008 93.89% 99.30% 100.00% 99.91%
009 96.82% 99.75% 98.60% 99.87%
010 94.94% 96.27% 99.87% 99.68%
011 95.04% 98.54% 99.77% 99.91%
012 93.86% 98.78% 99.87% 99.90%
013 93.15% 99.20% 99.87% 99.83%
014 94.90% 99.62% 99.92% 99.74%
015 92.69% 99.16% 99.96% 99.62%
TOTAL 93.32% 98.97% 99.83% 99.84%

You just beat me Seumas :-)