I'm off for another string of conferences, this time in Copenhagen, Chicago, and New Orleans.


Conference Time and more...

Conference Time

I’m off for another string of conferences, this time in Copenhagen, Chicago, and New Orleans.

First is a workshop on Original Language Resources for Bible Translation and Education organised by Nicolai Winther-Nielsen of the Global Learning Initiative and Reinier de Blois of the United Bible Societies. David Instone-Brewer put it best when he responded to the workshop invitation with “All the key people in one place with lots of time to talk and plan. How could I miss this?” Perhaps most exciting for me is I finally get to meet Ulrik Sandborg-Petersen for the first time after working together for more than twelve years!

I fly from Copenhagen to Chicago at the end of the week for the annual conference of the American Association of Applied Linguistics. It will be my first time attending the conference and I’m looking forward to learning a lot (although in contrast to the Copenhagen workshop, I’ll know virtually no one).

I have to leave AAAL slightly early though, to go down to New Orleans for the first US VueConf. Vue.JS is an important technology in the Scaife Viewer and DeepReader reading environments. I went to the first European VueConf last year and gave a lightning talk on DeepReader. I had hoped to give a talk on the Scaife Viewer at VueConf US but my talk wasn’t accepted so I’m hoping at least for another lightning talk.


A Tour of Greek Morphology: Part 21

Part twenty-one of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

I started this series with

I ultimately hope to cover everything that a beginner-intermediate grammar might but in a much more exploratory fashion. I’ll occasionally touch on morphological theory but I mostly want to point out phenomena in the language that students have already seen but perhaps have not thought about in any depth.

(emphasis added)

In short, the primary goal has been (and will continue to be) to take data the reader already is assumed to know and to make observations and construct relationships that the reader perhaps didn’t already realise or know. The secondary goal is to talk a little bit about linguistic theory and historical linguistics in relation to the specific phenomena being discussed.

Now that we’re finished our first pass over (particularly the endings of) the present indicatives and infinitives, I wanted to summarise a few key points we’ve touched on that are of a more conceptual nature.

  • A paradigm is a way of showing related forms next to one another for comparison. We often keep some morphosyntactic properties constant while varying others. We often but, not always, keep the lexeme constant.
  • We can look at paradigms along (at least) three dimensions: (1) we can take one lexeme’s inflection and look at what stays the same and what changes in different cells; (2) we can take a morphosyntactic property set and look at what stays the same and what changes across different lexemes; (3) we can take a subset of morphosyntactic properties and vary them while keeping the rest of the set (and the lexeme) fixed.
  • Greek rarely has a one-to-one mapping between an individual morphosyntactic property and some surface property of the inflected form.
  • There are some cells in a paradigm that are highly predictable and others than are highly predictive.
  • There are relationships between cells which are often more helpful than relationships between a cell and its underlying or historical stem.
  • The primary role of morphology is to discriminate between alternatives, not build up compositional meaning.
  • Ambiguity in morphology can be tolerated if other things (syntax, context) help disambiguate.
  • There is a big difference between looking at patterns in the surface forms and exploring the historical reasons those patterns developed. While the latter is vital for answering “why”, it is not a crucial part of language acquisition. (Native English speakers don’t acquire strong verbs by understanding how Proto-Indo-European ablaut patterns led to Germanic inflectional classes!)

As well as these conceptual points, we’ve talked about the actual endings, inflectional classes, vowel contractions, frequency effects, and which cells might be the best to use as a lemma.

We also spent time actually testing our models against the corpus data with some Python scripts and showed how that uncovered some patterns we hadn’t previously considered.

We haven’t looked at everything to do with the presents, but it’s time to move on, at least for a while, to a different part of the verbal system.

That said, if you have any questions about the previous twenty parts, or any questions you’re hoping will be answered in subsequent posts, just leave a comment (or email me if you want to ask anonymously).


A Tour of Greek Morphology: Part 20

Part twenty of a tour through Greek inflectional morphology to help getstudents thinking more systematically about the word forms they see (and maybeteach a bit of general linguistics along the way).

In part 17,we went through counts for our present active (infinitive and indicative) classes. Now we’ll wrap things up by doing the same for the middle.

Recall this is based on the analysis of 820 tokens availableherewhich was described in the last two parts.

Let us first of all look at the number of distinct lemmas in each of our 14 classes.

PM-1 barytone thematics with INF -εσθαι / 3SG -εται 105
PM-2 circumflex thematics with INF -εῖσθαι / 3SG -εῖται 21
PM-3 circumflex thematics with INF -οῦσθαι / 3SG -οῦται (ζηλόω, ἐλαττόω, λυτρόομαι, διαβεβαιόομαι) 4
PM-4 circumflex thematics with INF -ᾶσθαι / 3SG -ᾶται 11
PM-5 circumflex thematics with INF -ῆσθαι / 3SG -ῆται (χράομαι and compound) 2
PM-6a INF -υσθαι / 3SG -υται (ἀπόλλυμι, ἐνδείκνυμι, συναναμίγνυμι) 3
PM-7 INF -εσθαι / 3SG -εται (compound of τίθημι) 3
PM-8 INF -οσθαι / 3SG -οται -
PM-9 INF -ασθαι / 3SG -αται (δύναμαι, compounds of ἵστημι) 8
PM-10 ἧμαι -
PM-10-COMP compounds of ἧμαι (κάθημαι) 1
PM-11 κεῖμαι 1
PM-11-COMP compounds of κεῖμαι 7
PM-12 οἶμαι 1

Again, even the small counts are elevated due to compound verbs. Foldingcompounds of the same base verb, only PM-1, PM-2, PM-3, PM-4,and PM-6a have more than one or two members (and PM-6a only has three).

This is just looking at the number of unique lemmas in each class but there aretwo other sets of numbers that are worth looking at:(1) the total number of tokens in the SBLGNT;(2) the distribution of classes amongst the hapax legomena.

class lemmas tokens hapax hapax details
PM-1 105 523 45
PM-2 21 57 7
PM-3 4 5 3 ζηλόω ἐλαττόω λυτρόομαι
PM-4 11 33 4 μυκάομαι κοιμάομαι καταράομαι ἐγκαυχάομαι
PM-5 2 2 2 χράομαι and συγχράομαι
PM-6a 3 9 -
PM-7 3 5 2 διατίθεμαι and μετατίθημι
PM-8 - - -
PM-9 8 156 4 ἐξίστημι ἐφίστημι ἀνθίστημι ἀφίσταμαι
PM-10 - - -
PM-10-COMP 1 5 -
PM-11 1 9 -
PM-11-COMP 7 15 -
PM-12 1 1 1 οἶμαι

Recall the hapax legomena matter because they give an indication of whatclasses were still productive.

If we fold compounds under their base verb, only PM-1, PM-2, PM-3,and PM-4 have more than one hapax legomenon.

Let’s now look at counts for each paradigm cell for each class:

  PM-1 PM-2 PM-3 PM-4 PM-5 PM-6a PM-7 PM-8 PM-9 PM-10-C PM-11 PM-11-C PM-12
INF 89 15 4 8 - 4 - - 12 2 - 3 -
1SG 85 17 - 3 - 1 4 - 9 1 1 - 1
2SG 19 1 - 5 - - - - 7 - - - -
3SG 228 7 - 8 - - - - 74 2 7 11 -
1PL 20 4 - 3 1 3 - - 9 - 1 - -
2PL 24 9 - 3 - - 1 - 32 - - - -
3PL 58 4 1 3 1 1 - - 13 - - 1 -
  523 57 5 33 2 9 5 - 156 5 9 15 1

As in the active, the 3SG and INF dominate with only a few interestingexceptions. The third person (especially 3SG but also 3PL) is unusually low inPM-2. In PM-9, the 2PL is usually high. This is almost certainly justbecause of particular lexical items that happen to be in those classes rather thanan inherent characteristic of the class itself, although because the originsof some classes are derivational, there may occasionally be tendencies onsemantic grounds.

If the goal is just to identify the person/number, not the class,(which is true in reception but not learning) then most of these numberscollapse because of shared endings. Here are the counts just focused on thecommon endings (without accents):

INF -σθαι 137
1SG -μαι 122
2SG -{ι} 25
-σαι 7
3SG -ται 337
1PL -μεθα 41
2PL -σθε 69
3PL -νται 82

And that’s it for the present middles. I’ll do a brief summary post next andthen we’ll start exploring beyond the presents.


New Draft Morphological Tags for MorphGNT

I’ve finally done the work in translating the MorphGNT tagging system to a new proposal for initial feedback.

At least going back to my initial collaboration with Ulrik Sandborg-Petersen in 2005, I’ve been thinking about how I would do morphological tags in MorphGNT if I were starting from scratch.

Much later, in 2014, I had some discussions with Mike Aubrey at my first SBL conference and put together a straw proposal. There was a rethinking of some parts-of-speech, handling of tense/aspect, handling of voice, handling of syncretism and underspecification.

Even though some of the ideas were more drastic than others, a few things have remained consistent in my thinking:

  • there is value in a purely morphological analysis that doesn’t disambiguate on syntactic or semantic grounds
  • this analysis does not need the notion of parts-of-speech beyond purely Morphological Parts of Speech
  • this analysis should not attempt to distinguish middles and passives in the present or perfect system

As part of the handling of syncretism and underspecification, I had originally suggested a need for a value for the case property that didn’t distinguish nominative and accusative and a need for a value for the gender property like “non-neuter”.

In the absence of feedback beyond a vague feeling that something like this should be done, I didn’t immediately make further progress but, a year later, started gathering more notes on handling ambiguity. That then led to a more concrete proposal just around gender and case (although not without open questions).

I’ve now implemented those smaller-scale proposals as a first draft for the MorphGNT SBLGNT and plan to apply them to other GNT texts soon. The new-tags branch for MorphGNT SBLGNT is available at: https://github.com/morphgnt/sblgnt/tree/new-tags.

This adds a new column (the intention is not to replace existing analyses yet, just augment them) that:

  • makes voice formal not functional (while still using P in the aorist and future for what Carl Conrad would called MP2)
  • does not give morphosyntactic properties for uninflected words
  • implements basic nominative/accusative case syncretism in the neuter with a single value
  • implements basic non-neuter, non-feminine, and (in most genitive plurals) complete gender syncretism with a value for each

One immediate affect of this is that a list I have from Randall Tan of disagreements between the MorphGNT SBLGNT analysis and that of the Nestle 1904 largely goes away because many of them were merely different judgements of gender or case on non-morphological grounds. This new tag retains the uncertainty. Another benefit of the tagging scheme is that it provides a reasonable output for an automated morphological analysis system which can then, in a separate step, be disambiguated syntactically (or semantically), potentially with human input.

There are some important things to note, however, as just saying “this is a purely morphological analysis that doesn’t disambiguate” oversimplifies things greatly.

Firstly, while punting distributional and semantic part-of-speech questions like “is this an adverb or a conjunction” or “what type of pronoun is this” is extremely helpful, there are still some questions that impact a purely morphological tagging such as whether to represent a fossilised verb acting as a particle as having morphological inflection.

Secondly, there are what I have called extended syncretisms not modelled where there can be uncertainty between properties taken as a pair. For example 1st person singular vs 3rd person plural in -ον, or 1st declension genitive singular vs accusative plural in -ας. It may be worth still conveying this ambiguity but just through disjunction, saying for example that a word is GSF^APF. These are almost always phonological coincidences rather than structural syncretism and so should be modelled differently.

Related to this is the “double” syncretism between accusative singular masculine and neuter on the one hand and nominative and accusative singular neuter on the other hand. If we model the latter as CSN then we’ve lost the former (which, if by itself could be modelled as ASY). So, in a sense CSN and ASY are syncretic (but also share an overlapping cell). CSN^ASY doesn’t quite seem right because of that overlap and the fact that this isn’t just a phonological coincidence as best I know.

Thirdly, I have only modelled basic syncretism, not endings in wildly different parts of the paradigms (so would definitely not be called syncretism) that also happen to have converged by phonological change. For example both -ου and -ον can be nominal endings or unrelated verbal endings (with quite a few interpretations, mind you, especially for -ου). No attempt has been made to capture this in a single tag (although a disjunctive representation might be possible).

And finally (although related to the previous point), a certain amount of lexical disambiguation is applied. There are many cases where not being familiar with the lexeme makes a form highly ambiguous but that ambiguity goes away if the lemma is known. A simple example is imperfects versus second aorists where the principal parts resolve the ambiguity. The draft new tags for MorphGNT SBLGNT effectively assume the lemmatisation has been done and is correct.

In light of this, some people might be surprised, therefore, that υἱοῦ is tagged GSY and not GSM given it’s lexically masculine. My current argument (at least in my own head) is that, regardless of a specific lexeme like υἱοῦ, GSM, as a morphological tag, doesn’t really make sense in the Greek paradigmatic system because, by nature, genitive singulars have the same form in the non-feminines. I think there’s definitely a difference, if subtle, between true ambiguity and underspecification. It’s not that υἱοῦ is ambiguous as to gender, it’s just that the cell doesn’t distinguish masculine from neuter. Lexical knowledge is still being used, otherwise it could be feminine (or even a middle imperative!).

So, in short, syncretism inherent to the paradigmatic system is captured well but other forms of ambiguity will need to be handled other ways (potentially via a disjunctive list of possibilities). This seems a reasonable, practical compromise.

Let me know your thoughts. There’s definitely still more to do and I do plan on expressing more ambiguity with some form of disjunction. I’ll probably do a post soon with some more thoughts (and stats) on that.


Lexical Dispersion in the Greek New Testament Via Gries's DP

Measures of dispersion are interesting to apply to a corpus because they tell you whether a word is distributed across parts of the corpus as expected or concentrated more in just some parts. I thought I’d play around with Gries’s DP as a measure of dispersion on the SBLGNT lemmas.

There are lots of measures of dispersion but Stefan Th. Gries’s is perhaps the simplest (see [1] for a detailed survey of lots of different measures as well as the original definition of his own).

Here it is in Python for lemmas:

dp = sum(abs((p[part] / t) - (lp[lemma][part] / l[lemma])) for part in p) / 2


  • p[part] is a dictionary mapping corpus part to the count of words in that part
  • l[lemma] is a dictionary mapping lemmas to the count of that lemma in the corpus
  • lp[lemma][part] is a dictionary of dictionaries mapping lemmas and parts to the count of the lemma in that part

but see [1] for some simple worked examples.

One thing Gries doesn’t talk about (email me if you know of any discussion of this) is how to handle very low frequency words as they’ll dominate the high DP values.

Using books as the parts, here are the top 10 most evenly dispersed lemmas in the GNT:

0.0466 ὁ0.1085 εἰς0.1154 καί0.1178 ὅς0.1250 εἰμί0.1358 ποιέω0.1382 γίνομαι0.1385 πολύς0.1395 μετά0.1420 μή

Here are the top 10 least evenly dispered lemmas (including all frequencies, even hapax legomena):

0.9984 φιλοπρωτεύω0.9984 ἐπιδέχομαι0.9984 μειζότερος0.9984 Διοτρέφης0.9984 φλυαρέω0.9982 χάρτης0.9982 κυρία0.9976 προσοφείλω0.9976 ἑκούσιος0.9976 ἄχρηστος

but this list looks very different if we, say, restrict ourselves to lemmas that occur 5 times or more:

0.9827 ἀντίχριστος0.9752 καταλαλέω0.9687 ἐπιφάνεια0.9681 νήφω0.9680 ἀρετή0.9667 μῦθος0.9641 Μελχισέδεκ0.9568 πλεονεκτέω0.9557 νόημα0.9532 ἐνέργεια

or 30 times or more:

0.8952 ἀρνίον0.8085 καυχάομαι0.8024 θηρίον0.7987 μέλος0.7969 εἴτε0.7266 συνείδησις0.7202 περιτομή0.7199 θρόνος0.7139 ὑποτάσσω0.7116 Παῦλος

If we use chapters as the corpus division, we get a little different top ten most evenly distributed by Gries’s DP:

0.0677 ὁ0.1440 καί0.1913 εἰμί0.2084 εἰς0.2117 αὐτός0.2259 ἐν0.2366 οὗτος0.2378 ὅς0.2437 δέ0.2561 οὐ

and obviously this is even more problematic for lower frequency words at the other end.

It’s interesting to look, though, at chapters within a single book. For example, here are the most evenly distributed lemmas in John’s gospel using chapters for parts:

0.0574 ὁ0.0867 καί0.0977 αὐτός0.1331 οὐ0.1391 οὗτος0.1440 ὅτι0.1480 λέγω0.1569 δέ0.1576 εἰμί0.1658 εἰς

and here are the least evenly distributed lemmas that occur at least 10 times:

0.9470 σταυρόω0.9414 Ἀβραάμ0.9126 νίπτω0.8958 Πιλᾶτος0.8914 πρόβατον0.8812 Λάζαρος0.8493 καρπός0.8426 ἄρτος0.8371 προσκυνέω0.8221 ψυχή

Obviously Gries’s DP is extremely easy to calculate, and I plan to experimentally include it in the Greek Vocabulary Tool for the Perseus Project but there are still some things to work out with low frequency words.

It’s very interesting, though, as a way of contrasting words that otherwise have the same frequency in a corpus. For example, here are all the lemmas that occur exactly 30 times in the SBLGNT, with their book-based Gries’s DP:

0.3276 διδαχή0.3558 ἐγγύς0.3708 σκότος0.4143 ἀγοράζω0.5360 σκανδαλίζω0.5833 συνέρχομαι0.6230 ἴδε0.6485 ἐπικαλέω0.7266 συνείδησις0.8952 ἀρνίον

There is a massive range in the DP which I think is quite illustrative.

Here is the list with their chapter-based DP (notice how high the lowest DP now is):

0.8769 ἀγοράζω0.8821 σκότος0.8869 συνέρχομαι0.8958 σκανδαλίζω0.9016 ἐγγύς0.9016 διδαχή0.9034 ἴδε0.9083 ἐπικαλέω0.9441 συνείδησις0.9609 ἀρνίον

One of my reasons for exploring Gries’s DP (and potentially other measures of lexical dispersion) is the application to language learning. My sense is that dispersion might be a useful input to deciding what vocabulary to learn. For example διδαχή or σκότος might be better to learn before ἀρνίον because, even though they all have the same frequency, you are more likely to encounter διδαχή or σκότος in a random book or chapter.

[1] Gries, Stefan Th. (2008) Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13:4. John Benjamins.