Exactly three months ago to the day, I announced that Seumas Macdonald and I were working on a corrected, open, digital edition of the Apostolic Fathers based on Lake. That initial work is now complete.

 

Initial Apostolic Fathers Text Complete and more...



Initial Apostolic Fathers Text Complete

Exactly three months ago to the day, I announced that Seumas Macdonald and I were working on a corrected, open, digital edition of the Apostolic Fathers based on Lake. That initial work is now complete.

Preparing an Open Apostolic Fathers discussed the original motivation and the rather detailed process we went through.

The corrected raw text files are available on GitHub at https://github.com/jtauber/apostolic-fathers but I also generated a static site at https://jtauber.github.io/apostolic-fathers/ to browse the texts. The corrections will be contributed back to the OGL First1KGreek project.

The next step for us will be to lemmatise the text and there has already been some interest from others in getting the English translation corrected and aligned as well.

Recall that, while we were essentially correcting the Open Greek and Latin text, we used the CCEL text and that in Logos to identify particular places to look at in the printed text. We did this by lining up the CCEL, OGL and Logos texts and seeing where any of them disagreed. Those became the places we went back to, in multiple scans of the printed Lake, to make our corrections to the base text we started with from OGL.

How often did each of those three “witnesses” disagree? Here are some stats. A = CCEL, B = OGL, C = Logos. And so AB/C is where CCEL and OGL agreed against Logos, AC/B is where CCEL and Logos agreed against OGL, A/BC is where OGL and Logos agreed against CCEL, and A/B/C is where all three disagreed.

FILE AB/C AC/B A/BC A/B/C
001 1.29% 1.15% 7.97% 0.32%
002 0.76% 1.20% 3.39% 0.37%
003 1.58% 2.20% 4.97% 0.28%
004 0.57% 1.33% 7.01% 0.28%
005 1.05% 1.79% 6.21% 0.84%
006 0.88% 1.18% 7.54% 0.69%
007 0.39% 0.88% 3.34% 0.20%
008 0.79% 0.87% 5.41% 0.44%
009 0.25% 1.53% 2.68% 0.38%
010 0.44% 4.05% 4.36% 0.25%
011 0.36% 1.86% 4.23% 0.14%
012 0.92% 1.15% 5.59% 0.43%
013 1.29% 0.90% 6.08% 0.34%
014 1.25% 0.34% 4.91% 0.08%
015 0.96% 0.65% 6.74% 0.50%
TOTAL 1.11% 1.12% 5.98% 0.34%

One can immediately see CCEL diverged the most from the others (it had considerable lacunae for a start). The numbers involving Logos diverging are probably overly high because there was a weird systemic error we only noticed after work had started that a middle dot was often erroneously added after eta. This ultimately didn’t affect anything other than perhaps flagging places Seumas and I had to check that we otherwise wouldn’t have needed to.

But at the end of the day, how much did we change? How much of the OGL original remained? How similar was our result to the text on CCEL? And for a bit of fun, how often was my first correction and Seumas’s first correction the same as what we ended up with after consensus was achieved? Here’s the breakdown by work:

FILE CCEL OGL JT SM
001 91.27% 99.02% 99.85% 99.91%
002 96.02% 98.90% 99.77% 99.90%
003 94.58% 97.63% 99.77% 99.60%
004 92.42% 98.48% 99.91% 100.00%
005 92.32% 98.32% 99.79% 99.89%
006 91.28% 98.82% 98.82% 99.80%
007 96.07% 98.92% 99.90% 99.90%
008 93.89% 99.30% 100.00% 99.91%
009 96.82% 99.75% 98.60% 99.87%
010 94.94% 96.27% 99.87% 99.68%
011 95.04% 98.54% 99.77% 99.91%
012 93.86% 98.78% 99.87% 99.90%
013 93.15% 99.20% 99.87% 99.83%
014 94.90% 99.62% 99.92% 99.74%
015 92.69% 99.16% 99.96% 99.62%
TOTAL 93.32% 98.97% 99.83% 99.84%

You just beat me Seumas :-)

       
 

More Thoughts on Different Morphological Analyses

In Five Types of Morphological Analysis I outlined five distinct ways of approaching morphological (or potentially any linguistic) analysis. In support of some of these, I have some additional examples from a pair of papers I’m reading and a conference I just attended.

Baayen et al (2018) (co-written by Jim Blevins, my undergraduate advisor from 25 years ago and still a mentor), in describing their own word-based, discriminative approach to morphology, contrast it with both widespread morpheme-based approaches and increasingly popular exponent-focused realizational approaches. I’ll leave a discussion of these different approaches to another time, but what is relevant to my previous post is this comment:

[morpheme-based and realizational analyses] may be of practical value, especially in the context of adult second language acquisition. It is less clear whether the corresponding theories, whose practical utility derives ultimately from their pedagogical origins, can be accorded any cognitive plausibility.

Note the distinction they are making between analyses of practical (adult SLA, pedagogical) value and cognitive plausibility.

Again, it’s not the point of this post to describe (much less assess) their arguments for why morphemes and exponents might not be cognitively plausible and what the alternative is, merely that they acknowledge certain analyses might be useful for pedagogical purposes independent of their cognitive plausibility (thereby agreeing with my psychological vs pedagogical distinction).

Perhaps cognitive would be another word for my psychological category.

They furthermore suggest:

Constructional schemata, inheritance, and mechanisms spelling out exponents are all products of descriptive traditions that evolved without any influence from research traditions in psychology. As a consequence, it is not self-evident that these notions would provide an adequate characterization of the representations and processes underlying comprehension and production. It seems particularly implausible that children would be motivated to replicate the descriptive scaffolding of [these] theoretical accounts…

Terms like “descriptive traditions” and “descriptive scaffolding of theoretical accounts” refer to what I had in mind with my synchronic category of analysis. Perhaps descriptive and theoretical would be other words for that category.

In a related paper, Baayen et al (2019), they talk about three possible responses to the challenge posed to linguistics (or at least linguistically-informed natural language processing) by the success of machine learning.

Αgain it’s outside the scope of this post to get into those details, but in short, their suggested possible responses are: (1) admit defeat, (2) claim the hidden layers reflect traditional linguistic representations, (3) rethink the nature of language processing in the brain. They go on to explore the third option in the context of morphology and the lexicon, stating that

the model that we propose here brings together several strands of research across theoretical morphology, psychology, and machine learning.

Note that this is essentially a claim that it’s possible to reconcile at least three of the different approaches I’ve outlined: the synchronic/description/theoretical, the cognitive/psychological, and the algorithmic/machine-learning.

(Missing here is any reference to diachrony or pedagogy, which I think they would agree are distinct approaches to what they are attempting to unify).

Now last week, I attended the Society for Computation in Linguistics meeting, coinciding with the big annual meeting of the Linguistic Society of America. One of the goals of SCiL is to build bridges from the NLP community to the linguistics community so it was of particular interest to me.

But again one of the big things that came up in multiple talks was distinct approaches: the approach of the NLP practitioners, often referred to as the engineering approach, and that of the linguists, often referred to as the scientific approach. At their most self-deprecating, the NLP practioners confessed their over-obsession with metrics on “tasks” and lack of regard for the underlying scientific “questions”. Noah Smith, in fact, joked that NLPers can annoy linguists by asking what their “task” is and linguists can annoy NLPers by asking what their “question” is.

The point of mentioning this is yet another example of a difference in approach and perspective.

Diachrony didn’t feature at all in either the Baayen/Blevins papers nor at SCiL, but certainly my other distinctions seem more broadly confirmed (albeit with alternative terminology). So I think we have:

  • algorithmic / engineering / task-oriented
  • diachronic
  • synchronic / descriptive / theoretical
  • psychological / cognitive
  • pedagogical

Now this is not to say some of these approaches can’t be combined (as shown in the Baayen/Blevins papers). But even when one is attempting to combine some of them, I think it’s useful to acknowledge (a) the multiple approaches being combined; (b) other approaches with distinct goals and evaluation procedures that aren’t being consisdered but which may still be valuable in other contexts.

At the end of the day, I’m trying to turn arguments of the form “that isn’t a good theory/description/implementation/explanation of morphology” into a more nuanced “it probably isn’t good for this but it might be good for that”.

References

Baayen, R. H., Chuang, Y. Y., and Blevins, J. P. (2018). Inflectional morphology with linear mappings. The Mental Lexicon, 13 (2), 232-270.

Baayen, R. H., Chuang, Y. Y., Shafaei-Bajestan E., and Blevins, J. P. (2019). The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in (de)composition but in linear discriminative learning. Complexity, 2019, 1-39.

       
 

Five Types of Morphological Analysis

People talking about morphological analyses can often speak across each other because they have different purposes in mind. Here’s an initial attempt to outline five possibly distinct notions one might be referring to.

I’m tentatively labelling them:

  • algorithmic
  • diachronic
  • synchronic
  • psychological
  • pedagogical

although the labels matter less than being clear about the distinction.

Algorithmic means I can go from an inflected form to a lemma + morphosyntactic properties (or vice versa) efficiently on a computer. The way this is achieved might not be psychologically plausible or historically accurate but it can be implemented in software to get the job done.

Diachronic means I can explain (or at least speculate) how the inflected form came about: what the roots are, what grammaticalisation took place, what sound changes explain seeming irregularities, etc.

Synchronic means I can describe the inflected forms without recourse to historical data or reconstruction. This might focus on perspicuity rather than computational efficiency or psychological plausibility.

Psychological means the analysis is consistent with what I think is (or was) going on in the minds of native speakers. Some people may equate this with syncronic analyses but I think you can have a psychologically implausible yet still descriptively adequate synchronic analysis.

Pedagogical means a useful way of explaining it to students. This may be diachronic, but might be more synchronic (whether psychologically plausible or not).

Analyses can obviously be compatible with more than one of these. But I think it’s helpful to be clear what the goals of any morphological description are. If the goal is to lemmatise and tag a new text, then psychological or historical plausibility, or analytical or pedagogical clarity might not matter. If one’s goal is a diachronically-informed analysis to help students, it should be clear why an otherwise perfectly adequate morphological parser might not be producing useful information.

Those who have been following my Tour of Greek Morphology know I’ve tried to be careful distinguishing, for example, historical explanations from how I think native speakers internalise(d) word forms, or how students should learn them.

I still come across a lot of people who think the “modern” way of understanding morphology is learning the “morphemes” and rules, not memorising paradigms. Besides getting the history somewhat wrong, this is also making the mistake of conflating these different types of analyses and not recognising that one type of analysis might be perfectly valid for one purpose but not another.

Here’s a fun game to play: how would you analyse/explain the form λαμβάνω? Or ἔλαβον (especially when 3rd plural) or λήμψομαι? Or μαθητής vs μαθητοῦ? Or ἔδωκεν vs δέδωκα vs δός?

Maybe I haven’t quite nailed the labels yet. Maybe there are further distinctions to draw. I welcome people’s input.

       
 

Preparing an Open Apostolic Fathers

I’m working with Seumas Macdonald on an open, corrected digital edition of the Apostolic Fathers based on Lake.

Seumas Macdonald asked me a few weeks ago what it would take to expand some of our text and vocab ordering experiments to the text of Apostolic Fathers (we’re both desirous of more comprehensible input for Greek learners).

My reply was that we first of all needed to get a good open text and then lemmatise it. I thought the “get a good open text” would be trivial but it turned out not to be.

I asked around without much positive response. I found HTML versions of the Lake texts on the Christian Classics Ethereal Library (CCEL) website but they turned out to be problematic quality-wise (see below).

It then occurred to me to check what was in the Perseus Digital Library. It only had the Epistle of Barnabas but the related First 1000 Years of Greek at the Open Greek and Latin Project had done the rest.

The Perseus/OGL texts were considerably better than the CCEL ones, but were still not without problems. It was clear that the two collections had been produced independently, however, which is important for what follows.

I’m almost certain the CCEL texts were keyed in. There is haplography and dittography galore! The hapolography even corresponds almost perfectly to line breaks in the printed Lake editions I looked at.

The Perseus/OGL texts, on the other hand, are the results of OCR with some manual correction.

I wrote some code to extract both the CCEL and Perseus/OGL texts and put them in a comparable format. I then wrote a script to align the two. My thinking was to go through all the places where the two disagreed, check the printed Lake and correct the Perseus/OGL text accordingly.

I decided to throw the Lake text from Logos into the mix as well, not as an input to the correction itself but merely as another “edition” to flag differences with (to then check with the printed Lake).

Thus began a project Seumas and I have been working on the last few weeks. Once differences in any of the three texts are identified, they are flagged for review and Seumas and I independently look at the printed Lake and correct the Perseus/OGL base text.

If our corrections disagree, we continue to work on them until we come to consensus. This three-way comparison followed by two-way independent correction is proving to work very well (although it’s a lot of work!)

All the code, the source texts (except Logos), and work-in-progress are available at

https://github.com/jtauber/apostolic-fathers

and you can follow along the status in the README. There are also more detailed notes on the whole process.

Once the candidate versions of all the texts are published, I’ll do another post just with some interesting statistics on the nature of errors in the CCEL, Perseus/OGL, and Logos texts. The “scribal errors” in the CCEL text are particularly fascinating but even some of the Perseus/OGL OCR errors will be worth writing about.

Seumas and I will then contribute back the corrections to CCEL, Perseus/OGL, and Logos. Hopefully our texts will also be featured on the Biblical Humanities Dashboard as the go-to open digital text of the Apostolic Fathers (so no one else has to repeat this effort).

Finally, we’ll start the process of lemmatisation so the Apostolic Fathers can be included in our open learning materials.

       
 

A Tour of Greek Morphology: Part 27

Part twenty-seven of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

Let’s finish our survey of imperfect middle endings in the indicative with the athematic verbs.

 IM-6 IM-7 IM-8 IM-9
1SG Xύμην Xέμην Xόμην Xάμην
2SG Xυσο Xεσο Xοσο Xασο/Xω
3SG Xυτο Xετο Xοτο Xατο
1PL Xύμεθα Xέμεθα Xόμεθα Xάμεθα
2PL Xυσθε Xεσθε Xοσθε Xασθε
3PL Xυντο Xεντο Xοντο Xαντο

The classes are similar to their IA- equivalents except there is no ablaut between the singular and plural.

IM-6-νυ- verbs like δείκνυμιstem ends in ῠ
IM-7τίθημι, ἵημι and their compounds stem ends in ε
IM-8δίδωμι and compoundsstem ends in ο
IM-9ἵστημι and compoundsstem ends in ᾰ

The intervocalic sigma in 2SG generally does not drop out in the athematics although it sometimes can, particularly in IM-9 which seems to be the class most starting to merge with the thematics. Note, though, that the lack of circumflex in this case eliminates confusion with an IM-4 2SG.

The lack of circumflex in the 3SG and 2PL also eliminates confusion with IM-4 in those cells.

IM-7 can be confused for IM-1 in the 3SG and 2PL, though.

In the next few posts we’ll summarise the inference rules and ambiguities for the imperfect and look at some type and token frequencies, just like we did for the present.