An analysis I did of a couple of chapters of Herodotus looks like it might be an interesting example to use for various treebanking approaches—both in terms of how things are structured as well as how they are visualised.


Comparing Analyses from Herodotus and more...

Comparing Analyses from Herodotus

An analysis I did of a couple of chapters of Herodotus looks like it might be an interesting example to use for various treebanking approaches—both in terms of how things are structured as well as how they are visualised.

As the last assignment for my Postgraduate Diploma in Ancient Greek, I had to write a brief commentary of Herodotus 2.35–36, which catalogs (with hasty generalisations galore) differences between Egypt and the rest of the world. The catalog consist of a series of statements of the form “Egyptians do THIS whereas everyone else does THAT” or “[In Egypt] the men do THIS and the women do THAT [as opposed to the other way around like everywhere else]”.

In his commentary, Lloyd notes that this sort of catalog could be quite monotonous but that Herodotus avoids this through “skilful stylistic variation”. My commentary spent a decent proportion of its short word count digging deeper into this variation.

Quite coincidentally, Greg Crane sent me some examples of student treebanking recently in the context of how to compare analyses and they happened to be of Herodotus 2.35. They differ from each other and from my own way of thinking about the sentences. Note that these aren’t difficult or ambiguous sentences, though! The syntax is easy, I just don’t think most analysis conventions and visualisation tools do a great job of capturing what’s going on.

In my assignment, I started off presenting a canonical example of the construction and it’s that example that I want to show here. The original sentence is

τὰ ἄχθεα οἱ μὲν ἄνδρες ἐπὶ τῶν κεφαλέων φορέουσι, αἱ δὲ γυναῖκες ἐπὶ τῶν ὤμων.

But I started off considering these sentences:

οἱ ἄνδρες τὰ ἄχθεα ἐπὶ τῶν κεφαλέων φορέουσι

αἱ γυναῖκες τὰ ἄχθεα ἐπὶ τῶν ὤμων φορέουσι

The verb (in the present, as always in these comparisons), direct object, and prepositional phrase construction are identical. What is being contrasted (shown in bold) is how the particular location (the complement in the prepositional phrase) varies with the subject.

Herodotus sets up the contrast with μέν and δέ postpositives.

οἱ μὲν ἄνδρες τὰ ἄχθεα ἐπὶ τῶν κεφαλέων φορέουσι

αἱ δὲ γυναῖκες τὰ ἄχθεα ἐπὶ τῶν ὤμων φορέουσι

He then alters the “constants” in the comparison, topicalising the direct object and eliding repetition of the verb. This results in:

τὰ ἄχθεα
        οἱ … ἄνδρες
            ἐπὶ τῶν κεφαλέων φορέουσι
        αἱ … γυναῖκες
            ἐπὶ τῶν ὤμων [φορέουσι]

The above was an indented structure I manually constructed for my commentary. It’s not machine actionable and is missing a lot but I think it does a decent job of capturing some of what’s going on. It makes clear:

  • the topicalisation of τὰ ἄχθεα
  • the μέν and δέ construction as a whole
  • the elision of the verb

It is these three properties that I think make this a particularly interesting example.

Here’s the first student treebank analysis:

The student supplies the elided verb (although it’s not co-referenced in any way) but not the elided direct object. There’s no indication of the topicalisation.

It doesn’t quite seem right to me to say the two clauses are conjoined by δέ with the μέν hanging off the verb. I think of the μέν and δέ as equal partners in this construction and as tagging the two things being compared.

Here’s the second student treebank analysis:

This analysis seems a lot more confused. The coordination is shown as being done with the μέν this time, with the δέ dangling. The prepositional phrases are shown as governed by the subjects rather than the verb.

To be clear, I’m not trying to critique the students so much as raise questions for analysis conventions and visualisation, especially for reading environments and querying.

Again, this (and the other sentences in Herodotus 2.35–36) aren’t difficult. I doubt either student had any trouble understanding the sentence. I just think it wasn’t clear how to adequately model their understanding of the structure.

I think elision and conjunction are the biggest issues in most analyses like this and good structures and visualisation for handling those will go a long way to making treebanks more consistent and more useful.

Using this sentence from Herodotus as an example, what are better ways of making sure analyses both enable useful queries and can be visualised in more perspicuous ways?

UPDATE: perhaps “coordination” would be better than conjunction as one of the “biggest issues” and I think “theticals” (HT: Jonathan Robie) could be added to that list to make the triad: elision, coordination, and theticals.

UPDATE 2: I also need to stop saying elision when I mean ellipsis! I’m spending too much time with morphophonology and not enough time with syntax :-)


Headed to Germany Next Week

Next week I’m headed to Germany for a whirlwind trip to Göttingen, Heidelberg, and Leipzig to share and discuss ideas with other scholars.

I’ll be speaking at a Global Philology workshop in Göttingen, attending a Digital Classics conference in Heidelberg (where I’ll also have to sit the final exam for my Postgraduate Diploma in Greek if I can find someone to invigilate), and then spending a few days in Leipzig meeting with the team at the Humboldt Chair of Digital Humanities at Universität Leipzig.

I’m very excited to now be working more closely with the digital classics community and meeting many people whose names I’ve known for a while.

I’m also thrilled to visit Leipzig again after more than ten years and get my fill of musical history there. I’m also hoping for a bit of a physics history fill too given the importance of both Göttingen and Leipzig in the history of quantum mechanics.


Handling Morphological Ambiguity

On my now page, I currently list “finalising an improved set of morphology tags to use” under Medium Term. As I find myself sometimes having to clarify the motivation for and state of this, I thought I’d share what I just wrote in the Biblical Humanities Slack.

Firstly, some background on previous notes…

Back in 2014, I wrote down some notes Proposal for a New Tagging Scheme after discussions with Mike Aubrey. In 2015, after some discussions with Emma Ehrhardt, wrote down Handling Ambiguity. Then in February 2017, after discussion on the Biblical Humanities Slack, I put forward a concrete Proposal for Gender Tagging.

Here’s a slightly cleaned up version of what I wrote in Slack…

All I’ve done is propose a way of representing certain single-feature ambiguities (especially gender but also nom/acc in neuter). I have not proposed anything for multi-feature ambiguities nor have I actually DONE any work that uses these proposals.

Multi-feature ambiguities at the morphology level (1S vs 3P, GS vs AP, etc) are rarely ambiguous at the syntactic or semantic level for very good reason: the syntactic/semantic-level disambiguation is what allows one to tolerate the ambiguity at the morphology level (one reason that, as a cognitive scientist, I quite like discriminative models of morphology).

But if I continue with my goal to produce a purely morphology analysis, without “downward” disambiguation, then I want to be able to provide a way of representing form over function AND representing ambiguity.

I want to stress again that I think nom vs acc in neuter, or gender in genitive plurals is a DIFFERENT kind of ambiguity than 1S vs 3P or GS vs AP. For these multi-feature ambiguities (or what my wiki page calls extended syncretism although not sure I really like that term) it may come down to just providing a disjunction of codes, e.g. GSF∨APF.

Also just in terms of motivation: clearly a morphological analysis that ignores downward disambiguation from syntax or semantics is unhelpful (and potentially even misleading) for exegesis and so a lot of use cases wouldn’t want to do it. HOWEVER, my goal is three fold:

(1) I want to have a way to model the output of automated morphological analysis systems prior to either automated or human downward disambiguation;
(2) as someone studying how morphology works from a cognitive point of view, I care about modelling how ambiguity is resolved at different levels and so want a model that can handle that;
(3) because a student is quite likely to be confronted with this disambiguity, it needs to be in my learning models. I want to be able to search for cases where 1S vs 3P ambiguity or GSF vs APF ambiguity or NSN vs ASN ambiguity is resolved by syntax or semantics so they can be illustrated to the student. I want to know, for a given passage, whether such ambiguity exists so learning can be appropriately scaffolded. And note that, for me, this extends to ambiguity resolved by just accentuation as well (which is another potentially useful thing to model for various applications).

In conclusion, I want to again state I’m not at all against a functional, full-disambiguated parse code existing. I have NEVER proposed REPLACING the existing tagging schemes. I just want to add a new column useful for the reasons I’ve listed above in (1) – (3) and produce new resources that perhaps ONLY use that purely morphological parse code.

Finally I want to note there’s an important difference between what we put in our data and how we present it to users. People should not assume that when I’m describing codes to use in data that I’m suggesting that’s what end-users should see.

UPDATE: one topic I didn’t discuss here is ambiguity in endings that is resolved by knowledge of the stems or principal parts. For example, without a lexicon, there are ambiguities between imperfect and aorist that are easily resolved with additional lexical-level information.


An Initial Reboot of Oxlos

In a recent post, Update on LXX Progress, I talked about the possibility of putting together a crowd-sourcing tool to help share the load of clarifying some parse code errors in the CATSS LXX morphological analysis. Last Friday, Patrick Altman and I spent an evening of hacking and built the tool.

Back at BibleTech 2010, I gave a talk about Django, Pinax, and some early ideas for a platform built on them to do collaborative corpus linguistics. Patrick Altman was my main co-developer on some early prototypes and I ended up hiring him to work with me at Eldarion.

The original project was called oxlos after the betacode transcription of the Greek word for “crowd”, a nod to “crowd-sourcing”. Work didn’t continue much past those original prototypes in 2010 and Pinax has come a long way since so, when we decided to work on oxlos again, it made sense to start from scratch. From the initial commit to launching the site took about six hours.

At the moment there is one collective task available—clarifying which of a set of parse codes is valid for a given verb form in the LXX—but as the need for others arises, it will be straightforward to add them (and please contact me if you have similar tasks you’d like added to the site).

If you’re a Django development, you are welcome to contribute. The code is open source under an MIT license and available at We have lots we can potentially add beyond merely different kinds of tasks.

If your Greek morphology is reasonably strong, I invite you to sign up at

and help out with the LXX verb parsing task.

It’s probably not that relevant anymore, but you can watch the original 2010 talk below. I’d skip past the Django / Pinax intro and go straight to about 37:00 where I start to discuss the collective intelligence platform.


Analysing the Verbs in Nestle 1904

The last couple of weeks, I’ve been working on getting my greek-inflexion code working on Ulrik Sandborg-Petersen’s analysis of the Nestle 1904. The first pass of this is now done.

The motivation for doing this work was (a) to expand the verb stem database and stemming rules; (b) to be able to annotate the Nestle 1904 with additional morphological information for my adaptive reader and some similar work Jonathan Robie is doing.

My usual first step when dealing with a next text is to automatically generate as many new entries in the lexicon / stem-database as I can (see the first step in Update on LXX Progress).

In some cases, this is just a new stem for an existing verb because of a new form of an already known verb. But sometimes it’s an entirely new verb.

I thought the Nestle 1904 would be considerably easier than the LXX because the text is so similar but there were numerous challenges that arose.

It became clear very quickly that there were considerable differences in lemma choice between the Nestle 1904 and the MorphGNT SBLGNT. This didn’t completely surprise me: I’ve spend quite a bit of time cataloging lemma choice differences between lexical resources and there are considerable differences even between BDAG and Danker’s Concise Lexicon.

But even these aside, there were 7,743 out of 28,352 verbs mismatching after my code had already done it’s best to automatically fill in missing lexical entries and stems.

A. The normalisation column in Nestle 1904 doesn’t normalise capitalisation, clitic accentuation, or moveable nu, all of which greek-inflexion assumes has been done.

Capitalisation alone accounted for 1042 mismatches. Clitic accentuation alone accounted for 1008 mismatches. Moveable nu alone accounted for 4153 mismatches.

B. Nestle 1904 systematically avoids assimilation of συν and ἐν preverbs.

Taken alone, these accounted for 91 mismatches. Mapping prior to analysis by greek-inflexion is somewhat of a hack that I’ll address in later passes.

C. There were 8 spelling differences in the endings which required an update to stemming.yaml:

  • κατασκηνοῖν (PAN) in Matt 13:32
  • κατασκηνοῖν (PAN) in Mark 4:32
  • ἀποδεκατοῖν (PAN) in Heb 7:5
  • φυσιοῦσθε (PMS-2P) in 1Cor 4:6
  • εἴχαμεν (IAI.1P) in 2John 1:5
  • εἶχαν (IAI.3P) in Mark 8:7
  • εἶχαν (IAI.3P) in Rev 9:8
  • παρεῖχαν (IAI.3P) in Acts 28:2

D. The different parse code scheme (Robinson’s vs CCAT) had to be mapped over.

This should have been straightforward but voice in the formal morphology field sometimes seemed to be messed up (which I corrected as part of G. below).

E. There were 182 differences (type not token) in lemma choice, mostly active vs middle forms.

See for the full list.

F. There were a small handful of per-form lemma corrections I made

  • ἐπεστείλαμεν AAI.1P ἀποστέλλω ἐπιστέλλω
  • ἀγαθουργῶν PAP.NSM ἀγαθοεργέω ἀγαθουργέω
  • συνειδυίης XAP.GSF συνοράω σύνοιδα
  • γαμίσκονται PMI.3P γαμίζω γαμίσκω

G. Finally, I made 69 (type not token) parse code changes.

See for the list.

With all this, the greek-inflexion code (on a branch not yet pushed at the time of writing) can correctly generate all the the verbs in the Nestle 1904 morphology.

There are definitely improvements I need to make in a second pass and at least a small number of corrections that I think need to be made to the Nestle 1904 analysis.

But it’s now possible for me to produce an initial verb stem annotation for the Nestle 1904 and I’m a step closer to a morphological lexicon with broader coverage.

UPDATE: I’ve added some more parse corrections but not yet updated the gist.

Click here to safely unsubscribe from "J. K. Tauber: at the intersection of computing, linguistics, biblical greek and learning science."
Click here to view mailing archives, here to change your preferences, or here to subscribePrivacy
Email subscriptions powered by FeedBlitz, LLC, 365 Boston Post Rd, Suite 123, Sudbury, MA 01776, USA.