Handling Morphological Ambiguity
On my now page, I currently list “finalising an improved set of morphology tags to use” under Medium Term. As I find myself sometimes having to clarify the motivation for and state of this, I thought I’d share what I just wrote in the Biblical Humanities Slack.
Firstly, some background on previous notes…
Back in 2014, I wrote down some notes Proposal for a New Tagging Scheme after discussions with Mike Aubrey. In 2015, after some discussions with Emma Ehrhardt, wrote down Handling Ambiguity. Then in February 2017, after discussion on the Biblical Humanities Slack, I put forward a concrete Proposal for Gender Tagging.
Here’s a slightly cleaned up version of what I wrote in Slack…
All I’ve done is propose a way of representing certain single-feature ambiguities (especially gender but also nom/acc in neuter). I have not proposed anything for multi-feature ambiguities nor have I actually DONE any work that uses these proposals.
Multi-feature ambiguities at the morphology level (1S vs 3P, GS vs AP, etc) are rarely ambiguous at the syntactic or semantic level for very good reason: the syntactic/semantic-level disambiguation is what allows one to tolerate the ambiguity at the morphology level (one reason that, as a cognitive scientist, I quite like discriminative models of morphology).
But if I continue with my goal to produce a purely morphology analysis, without “downward” disambiguation, then I want to be able to provide a way of representing form over function AND representing ambiguity.
I want to stress again that I think nom vs acc in neuter, or gender in genitive plurals is a DIFFERENT kind of ambiguity than 1S vs 3P or GS vs AP. For these multi-feature ambiguities (or what my wiki page calls extended syncretism although not sure I really like that term) it may come down to just providing a disjunction of codes, e.g. GSF∨APF.
Also just in terms of motivation: clearly a morphological analysis that ignores downward disambiguation from syntax or semantics is unhelpful (and potentially even misleading) for exegesis and so a lot of use cases wouldn’t want to do it. HOWEVER, my goal is three fold:
(1) I want to have a way to model the output of automated morphological analysis systems prior to either automated or human downward disambiguation;
(2) as someone studying how morphology works from a cognitive point of view, I care about modelling how ambiguity is resolved at different levels and so want a model that can handle that;
(3) because a student is quite likely to be confronted with this disambiguity, it needs to be in my learning models. I want to be able to search for cases where 1S vs 3P ambiguity or GSF vs APF ambiguity or NSN vs ASN ambiguity is resolved by syntax or semantics so they can be illustrated to the student. I want to know, for a given passage, whether such ambiguity exists so learning can be appropriately scaffolded. And note that, for me, this extends to ambiguity resolved by just accentuation as well (which is another potentially useful thing to model for various applications).
In conclusion, I want to again state I’m not at all against a functional, full-disambiguated parse code existing. I have NEVER proposed REPLACING the existing tagging schemes. I just want to add a new column useful for the reasons I’ve listed above in (1) – (3) and produce new resources that perhaps ONLY use that purely morphological parse code.
Finally I want to note there’s an important difference between what we put in our data and how we present it to users. People should not assume that when I’m describing codes to use in data that I’m suggesting that’s what end-users should see.
UPDATE: one topic I didn’t discuss here is ambiguity in endings that is resolved by knowledge of the stems or principal parts. For example, without a lexicon, there are ambiguities between imperfect and aorist that are easily resolved with additional lexical-level information.
An Initial Reboot of Oxlos
In a recent post, Update on LXX Progress, I talked about the possibility of putting together a crowd-sourcing tool to help share the load of clarifying some parse code errors in the CATSS LXX morphological analysis. Last Friday, Patrick Altman and I spent an evening of hacking and built the tool.
Back at BibleTech 2010, I gave a talk about Django, Pinax, and some early ideas for a platform built on them to do collaborative corpus linguistics. Patrick Altman was my main co-developer on some early prototypes and I ended up hiring him to work with me at Eldarion.
The original project was called oxlos after the betacode transcription of the Greek word for “crowd”, a nod to “crowd-sourcing”. Work didn’t continue much past those original prototypes in 2010 and Pinax has come a long way since so, when we decided to work on oxlos again, it made sense to start from scratch. From the initial commit to launching the site took about six hours.
At the moment there is one collective task available—clarifying which of a set of parse codes is valid for a given verb form in the LXX—but as the need for others arises, it will be straightforward to add them (and please contact me if you have similar tasks you’d like added to the site).
If you’re a Django development, you are welcome to contribute. The code is open source under an MIT license and available at https://github.com/jtauber/oxlos2. We have lots we can potentially add beyond merely different kinds of tasks.
If your Greek morphology is reasonably strong, I invite you to sign up at
and help out with the LXX verb parsing task.
It’s probably not that relevant anymore, but you can watch the original 2010 talk below. I’d skip past the Django / Pinax intro and go straight to about 37:00 where I start to discuss the collective intelligence platform.
Analysing the Verbs in Nestle 1904
The last couple of weeks, I’ve been working on getting my
greek-inflexion code working on Ulrik Sandborg-Petersen’s analysis of the Nestle 1904. The first pass of this is now done.
The motivation for doing this work was (a) to expand the verb stem database and stemming rules; (b) to be able to annotate the Nestle 1904 with additional morphological information for my adaptive reader and some similar work Jonathan Robie is doing.
My usual first step when dealing with a next text is to automatically generate as many new entries in the lexicon / stem-database as I can (see the first step in Update on LXX Progress).
In some cases, this is just a new stem for an existing verb because of a new form of an already known verb. But sometimes it’s an entirely new verb.
I thought the Nestle 1904 would be considerably easier than the LXX because the text is so similar but there were numerous challenges that arose.
It became clear very quickly that there were considerable differences in lemma choice between the Nestle 1904 and the MorphGNT SBLGNT. This didn’t completely surprise me: I’ve spend quite a bit of time cataloging lemma choice differences between lexical resources and there are considerable differences even between BDAG and Danker’s Concise Lexicon.
But even these aside, there were 7,743 out of 28,352 verbs mismatching after my code had already done it’s best to automatically fill in missing lexical entries and stems.
A. The normalisation column in Nestle 1904 doesn’t normalise capitalisation, clitic accentuation, or moveable nu, all of which greek-inflexion assumes has been done.
Capitalisation alone accounted for 1042 mismatches. Clitic accentuation alone accounted for 1008 mismatches. Moveable nu alone accounted for 4153 mismatches.
B. Nestle 1904 systematically avoids assimilation of συν and ἐν preverbs.
Taken alone, these accounted for 91 mismatches. Mapping prior to analysis by
greek-inflexion is somewhat of a hack that I’ll address in later passes.
C. There were 8 spelling differences in the endings which required an update to stemming.yaml:
- κατασκηνοῖν (PAN) in Matt 13:32
- κατασκηνοῖν (PAN) in Mark 4:32
- ἀποδεκατοῖν (PAN) in Heb 7:5
- φυσιοῦσθε (PMS-2P) in 1Cor 4:6
- εἴχαμεν (IAI.1P) in 2John 1:5
- εἶχαν (IAI.3P) in Mark 8:7
- εἶχαν (IAI.3P) in Rev 9:8
- παρεῖχαν (IAI.3P) in Acts 28:2
D. The different parse code scheme (Robinson’s vs CCAT) had to be mapped over.
This should have been straightforward but voice in the formal morphology field sometimes seemed to be messed up (which I corrected as part of G. below).
E. There were 182 differences (type not token) in lemma choice, mostly active vs middle forms.
See https://gist.github.com/jtauber/28ddfeee3175903026dade4ab965ac6c#file-lemma-differences-txt for the full list.
F. There were a small handful of per-form lemma corrections I made
- ἐπεστείλαμεν AAI.1P ἀποστέλλω ἐπιστέλλω
- ἀγαθουργῶν PAP.NSM ἀγαθοεργέω ἀγαθουργέω
- συνειδυίης XAP.GSF συνοράω σύνοιδα
- γαμίσκονται PMI.3P γαμίζω γαμίσκω
G. Finally, I made 69 (type not token) parse code changes.
See https://gist.github.com/jtauber/28ddfeee3175903026dade4ab965ac6c#file-parse-txt for the list.
With all this, the
greek-inflexion code (on a branch not yet pushed at the time of writing) can correctly generate all the the verbs in the Nestle 1904 morphology.
There are definitely improvements I need to make in a second pass and at least a small number of corrections that I think need to be made to the Nestle 1904 analysis.
But it’s now possible for me to produce an initial verb stem annotation for the Nestle 1904 and I’m a step closer to a morphological lexicon with broader coverage.
UPDATE: I’ve added some more parse corrections but not yet updated the gist.
Update on LXX Progress
As mentioned in previous posts, I’ve been working through the LXX, initially making sure my
greek-inflexion library can generate the same analysis of verbs as the CATSS LXX Morphology and adding to the verb stem database accordingly. This is a preliminary to being able to run the code on alternative LXX editions such as Swete and provide a freely available morphologically-tagged LXX.
The general process has been, one book at a time:
- programmatically expand the stem database with missing stems where the analysis given by CATSS fits what
greek-inflexion stemming rules expect
- where the analysis from CATSS doesn’t fit what
greek-inflexion expects, evaluate if it’s
- a parse error in the CATSS (at this stage by far the most common problem, but also the most time consuming to identify and fix)
- a missing stemming rule (very rare at this stage)
- some temporary limitation of
greek-inflexion (it could be smarter about some accentuation, for example)
Working a few hours a week, it took about a month to do 1 Kings (i.e. 1 Samuel), in part because it had close to 100 parsing errors in the CATSS, many of them quite inexplicable (like getting the voice wrong when the ending should make that very easy to determine).
The work up until this point covers about 35% of the LXX, but I decided for the rest to go broad rather than book-by-book.
In other words, I’ve expanded the stem database (per step one above) for the entire LXX in one go and will now work through the problem cases.
What is very encouraging is that expanding the verbs attempted from 35% to 100% only led to 731 analysis mismatches in 1,875 locations. Given the LXX has just over 100,000 verbs, that’s less than a 2% error rate.
Let me be clear, however, what I’m claiming. I’m NOT saying I can morphologically tag verbs with 98% accuracy. I’m merely saying that 98% of the CATSS LXX morphological analysis can be explained by the rules and data in
greek-inflexion. The other 2% is likely to MOSTLY be errors in the CATSS analysis with a few errors in my stem database, stemming rules, or accentuation rules.
At the rate I worked through 1 Kings, going through the rest of the mismatches might take the rest of the year, but I think I can speed things up by batching similar kinds of mismatches together. For example, there are 586 forms where
greek-inflexion didn’t generate the form in the CATSS analysis with the morphosyntactic properties given but was able to generate the form with different morphosyntactic properties. In almost all cases that corresponds to a mistake in the CATSS analysis. It’s the most time consuming part to deal with but batching them up together (especially dealing with the same mismatch across all remaining books at once) should speed things up.
It may also lend itself to crowd-sourcing. I could probably pretty easily whip up a little website that shows people the form and asks them to choose between the CATSS analysis and the
greek-inflexion analysis (not telling them which is which).
It may be worth me spending a few hours setting that up!
New MorphGNT Releases and Accentuation Analysis
Over the last few weeks, I’ve made a number of new releases of the MorphGNT SBLGNT analysis fixing some accentuation issues mostly in the normalization column. This came out of ongoing work on modelling accentuation (and, in particular, rules around clitics).
Back in 2015, I talked about Annotating the Normalization Column in MorphGNT. This post could almost be considered Part 2.
I recently went back to that work and made a fresh start on a new repo gnt-accentuation intended to explain the accentuation of each word in the GNT (and eventually other Greek texts). There’s two parts to that: explaining why the normalized form is accented the way it but then explaining why the word-in-context might be accented differently (clitics, etc). The repo is eventually going to do both but I started with the latter.
My goal with that repo is to be part of the larger vision of an “executable grammar” I’ve talked about for years where rules about, say, enclitics, are formally written up in a way that can be tested against the data. This means:
- students reading a rule can immediately jump to real examples (or exceptions)
- students confused by something in a text can immediately jump to rules explaining it
- the correctness of the rules can be tested
- errors in the text can be found
It is the fourth point that meant that my recent work uncovered some accentuation issues in the SBLGNT, normalization and lemmatization. Some of that has been corrected in a series of new releases of the MorphGNT: 6.08, 6.09, and 6.10. See https://github.com/morphgnt/sblgnt/releases for details of specifics. The reason for so many releases was I wanted to get corrections out as soon as I made them but then I found more issues!
There are some issues in the text itself which need to be resolved. See the Github issue https://github.com/morphgnt/sblgnt/issues/52 for details. I’d very much appreciate people’s input.
In the meantime, stay tuned for more progress on
Email subscriptions powered by FeedBlitz, LLC, 365 Boston Post Rd, Suite 123, Sudbury, MA 01776, USA.