One of my goals for 2019 is to bring more structure to various disperate Greek projects and, as part of that, I've started consolidating multiple one-off projects I've done around vocabulary coverage statistics and ordering experiments.

 

Consolidating Vocabulary Coverage and Ordering Tools and more...



Consolidating Vocabulary Coverage and Ordering Tools

One of my goals for 2019 is to bring more structure to various disperate Greek projects and, as part of that, I’ve started consolidating multiple one-off projects I’ve done around vocabulary coverage statistics and ordering experiments.

Going back at least 15 years (when I first started blogging about Programmed Vocabulary Learning) I’ve had little Python scripts all over the place to calculate various stats, or try out various approaches to ordering.

I’m bringing all of that together in a single repository and updating the code so:

  • it’s all in one place
  • it’s usable as a library in other projects or in things like Jupyter notebooks
  • it can be extended to arbitrary chunking beyond verses (e.g. books, chapters, sentences, paragraphs, pericopes)
  • it can be extended to other texts such as the Apostolic Fathers, Homer, etc (other languages too!)

I’m partly spurred on by a desire to explore more stuff Seumas Macdonald have been talking about and be more responsive to the occasional inquiries I get from Greek teachers. Also I have a poster Vocabulary Ordering in Text-Driven Historical Language Instruction: Sequencing the Ancient Greek Vocabulary of Homer and the New Testament that got accepted for EUROCALL 2019 in August and this code library helps me not only produce the poster but also make it more reproducible.

Ultimately I hope to write a paper or two out of it as well.

I’ve started the repo at:

https://github.com/jtauber/vocabulary-tools/

where I’ve basically rewritten half of my existing code from elsewhere so far. I’ve reproduced the code for generating core vocabulary lists and also the coverage tables I’ve used in multiple talks (including my BibleTech talks in 2010 and 2015).

I’ve taken the opportunity to generalise and decouple the code (especially with regard to the different chunking systems) and also make use of newer Python stuff like Counter and dictionary comprehensions which simplifies much of my earlier code.

There are a lot of little things you can do with just a couple of lines of Python and I’ve tried to avoid turning those into their own library of tiny functions. Instead, I’m compiling a little tutorial / cookbook as I go which you can read the beginnings of here:

https://github.com/jtauber/vocabulary-tools/blob/master/examples.rst

There’s still a fair bit more to move over (even going back 11 years to some stuff from 2008) but let me know if you have any feedback, questions, or suggestions. I’m generalising more and more as I go so expect some things to change dramatically.

If you’re interested in playing around with this stuff for corpora in other languages, let me know how I can help you get up and running. The main requirement is a tokenised and lemmatised corpus (assuming you want to work with lemmas, not surface forms, as vocabulary items) and also some form of chunking information. See https://github.com/jtauber/vocabulary-tools/tree/master/gnt_data for the GNT-specific stuff that would (at least partly) need to be replicated for another corpus.

       
 

Initial Apostolic Fathers Text Complete

Exactly three months ago to the day, I announced that Seumas Macdonald and I were working on a corrected, open, digital edition of the Apostolic Fathers based on Lake. That initial work is now complete.

Preparing an Open Apostolic Fathers discussed the original motivation and the rather detailed process we went through.

The corrected raw text files are available on GitHub at https://github.com/jtauber/apostolic-fathers but I also generated a static site at https://jtauber.github.io/apostolic-fathers/ to browse the texts. The corrections will be contributed back to the OGL First1KGreek project.

The next step for us will be to lemmatise the text and there has already been some interest from others in getting the English translation corrected and aligned as well.

Recall that, while we were essentially correcting the Open Greek and Latin text, we used the CCEL text and that in Logos to identify particular places to look at in the printed text. We did this by lining up the CCEL, OGL and Logos texts and seeing where any of them disagreed. Those became the places we went back to, in multiple scans of the printed Lake, to make our corrections to the base text we started with from OGL.

How often did each of those three “witnesses” disagree? Here are some stats. A = CCEL, B = OGL, C = Logos. And so AB/C is where CCEL and OGL agreed against Logos, AC/B is where CCEL and Logos agreed against OGL, A/BC is where OGL and Logos agreed against CCEL, and A/B/C is where all three disagreed.

FILE AB/C AC/B A/BC A/B/C
001 1.29% 1.15% 7.97% 0.32%
002 0.76% 1.20% 3.39% 0.37%
003 1.58% 2.20% 4.97% 0.28%
004 0.57% 1.33% 7.01% 0.28%
005 1.05% 1.79% 6.21% 0.84%
006 0.88% 1.18% 7.54% 0.69%
007 0.39% 0.88% 3.34% 0.20%
008 0.79% 0.87% 5.41% 0.44%
009 0.25% 1.53% 2.68% 0.38%
010 0.44% 4.05% 4.36% 0.25%
011 0.36% 1.86% 4.23% 0.14%
012 0.92% 1.15% 5.59% 0.43%
013 1.29% 0.90% 6.08% 0.34%
014 1.25% 0.34% 4.91% 0.08%
015 0.96% 0.65% 6.74% 0.50%
TOTAL 1.11% 1.12% 5.98% 0.34%

One can immediately see CCEL diverged the most from the others (it had considerable lacunae for a start). The numbers involving Logos diverging are probably overly high because there was a weird systemic error we only noticed after work had started that a middle dot was often erroneously added after eta. This ultimately didn’t affect anything other than perhaps flagging places Seumas and I had to check that we otherwise wouldn’t have needed to.

But at the end of the day, how much did we change? How much of the OGL original remained? How similar was our result to the text on CCEL? And for a bit of fun, how often was my first correction and Seumas’s first correction the same as what we ended up with after consensus was achieved? Here’s the breakdown by work:

FILE CCEL OGL JT SM
001 91.27% 99.02% 99.85% 99.91%
002 96.02% 98.90% 99.77% 99.90%
003 94.58% 97.63% 99.77% 99.60%
004 92.42% 98.48% 99.91% 100.00%
005 92.32% 98.32% 99.79% 99.89%
006 91.28% 98.82% 98.82% 99.80%
007 96.07% 98.92% 99.90% 99.90%
008 93.89% 99.30% 100.00% 99.91%
009 96.82% 99.75% 98.60% 99.87%
010 94.94% 96.27% 99.87% 99.68%
011 95.04% 98.54% 99.77% 99.91%
012 93.86% 98.78% 99.87% 99.90%
013 93.15% 99.20% 99.87% 99.83%
014 94.90% 99.62% 99.92% 99.74%
015 92.69% 99.16% 99.96% 99.62%
TOTAL 93.32% 98.97% 99.83% 99.84%

You just beat me Seumas :-)

       
 

More Thoughts on Different Morphological Analyses

In Five Types of Morphological Analysis I outlined five distinct ways of approaching morphological (or potentially any linguistic) analysis. In support of some of these, I have some additional examples from a pair of papers I’m reading and a conference I just attended.

Baayen et al (2018) (co-written by Jim Blevins, my undergraduate advisor from 25 years ago and still a mentor), in describing their own word-based, discriminative approach to morphology, contrast it with both widespread morpheme-based approaches and increasingly popular exponent-focused realizational approaches. I’ll leave a discussion of these different approaches to another time, but what is relevant to my previous post is this comment:

[morpheme-based and realizational analyses] may be of practical value, especially in the context of adult second language acquisition. It is less clear whether the corresponding theories, whose practical utility derives ultimately from their pedagogical origins, can be accorded any cognitive plausibility.

Note the distinction they are making between analyses of practical (adult SLA, pedagogical) value and cognitive plausibility.

Again, it’s not the point of this post to describe (much less assess) their arguments for why morphemes and exponents might not be cognitively plausible and what the alternative is, merely that they acknowledge certain analyses might be useful for pedagogical purposes independent of their cognitive plausibility (thereby agreeing with my psychological vs pedagogical distinction).

Perhaps cognitive would be another word for my psychological category.

They furthermore suggest:

Constructional schemata, inheritance, and mechanisms spelling out exponents are all products of descriptive traditions that evolved without any influence from research traditions in psychology. As a consequence, it is not self-evident that these notions would provide an adequate characterization of the representations and processes underlying comprehension and production. It seems particularly implausible that children would be motivated to replicate the descriptive scaffolding of [these] theoretical accounts…

Terms like “descriptive traditions” and “descriptive scaffolding of theoretical accounts” refer to what I had in mind with my synchronic category of analysis. Perhaps descriptive and theoretical would be other words for that category.

In a related paper, Baayen et al (2019), they talk about three possible responses to the challenge posed to linguistics (or at least linguistically-informed natural language processing) by the success of machine learning.

Αgain it’s outside the scope of this post to get into those details, but in short, their suggested possible responses are: (1) admit defeat, (2) claim the hidden layers reflect traditional linguistic representations, (3) rethink the nature of language processing in the brain. They go on to explore the third option in the context of morphology and the lexicon, stating that

the model that we propose here brings together several strands of research across theoretical morphology, psychology, and machine learning.

Note that this is essentially a claim that it’s possible to reconcile at least three of the different approaches I’ve outlined: the synchronic/description/theoretical, the cognitive/psychological, and the algorithmic/machine-learning.

(Missing here is any reference to diachrony or pedagogy, which I think they would agree are distinct approaches to what they are attempting to unify).

Now last week, I attended the Society for Computation in Linguistics meeting, coinciding with the big annual meeting of the Linguistic Society of America. One of the goals of SCiL is to build bridges from the NLP community to the linguistics community so it was of particular interest to me.

But again one of the big things that came up in multiple talks was distinct approaches: the approach of the NLP practitioners, often referred to as the engineering approach, and that of the linguists, often referred to as the scientific approach. At their most self-deprecating, the NLP practioners confessed their over-obsession with metrics on “tasks” and lack of regard for the underlying scientific “questions”. Noah Smith, in fact, joked that NLPers can annoy linguists by asking what their “task” is and linguists can annoy NLPers by asking what their “question” is.

The point of mentioning this is yet another example of a difference in approach and perspective.

Diachrony didn’t feature at all in either the Baayen/Blevins papers nor at SCiL, but certainly my other distinctions seem more broadly confirmed (albeit with alternative terminology). So I think we have:

  • algorithmic / engineering / task-oriented
  • diachronic
  • synchronic / descriptive / theoretical
  • psychological / cognitive
  • pedagogical

Now this is not to say some of these approaches can’t be combined (as shown in the Baayen/Blevins papers). But even when one is attempting to combine some of them, I think it’s useful to acknowledge (a) the multiple approaches being combined; (b) other approaches with distinct goals and evaluation procedures that aren’t being consisdered but which may still be valuable in other contexts.

At the end of the day, I’m trying to turn arguments of the form “that isn’t a good theory/description/implementation/explanation of morphology” into a more nuanced “it probably isn’t good for this but it might be good for that”.

References

Baayen, R. H., Chuang, Y. Y., and Blevins, J. P. (2018). Inflectional morphology with linear mappings. The Mental Lexicon, 13 (2), 232-270.

Baayen, R. H., Chuang, Y. Y., Shafaei-Bajestan E., and Blevins, J. P. (2019). The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in (de)composition but in linear discriminative learning. Complexity, 2019, 1-39.

       
 

Five Types of Morphological Analysis

People talking about morphological analyses can often speak across each other because they have different purposes in mind. Here’s an initial attempt to outline five possibly distinct notions one might be referring to.

I’m tentatively labelling them:

  • algorithmic
  • diachronic
  • synchronic
  • psychological
  • pedagogical

although the labels matter less than being clear about the distinction.

Algorithmic means I can go from an inflected form to a lemma + morphosyntactic properties (or vice versa) efficiently on a computer. The way this is achieved might not be psychologically plausible or historically accurate but it can be implemented in software to get the job done.

Diachronic means I can explain (or at least speculate) how the inflected form came about: what the roots are, what grammaticalisation took place, what sound changes explain seeming irregularities, etc.

Synchronic means I can describe the inflected forms without recourse to historical data or reconstruction. This might focus on perspicuity rather than computational efficiency or psychological plausibility.

Psychological means the analysis is consistent with what I think is (or was) going on in the minds of native speakers. Some people may equate this with syncronic analyses but I think you can have a psychologically implausible yet still descriptively adequate synchronic analysis.

Pedagogical means a useful way of explaining it to students. This may be diachronic, but might be more synchronic (whether psychologically plausible or not).

Analyses can obviously be compatible with more than one of these. But I think it’s helpful to be clear what the goals of any morphological description are. If the goal is to lemmatise and tag a new text, then psychological or historical plausibility, or analytical or pedagogical clarity might not matter. If one’s goal is a diachronically-informed analysis to help students, it should be clear why an otherwise perfectly adequate morphological parser might not be producing useful information.

Those who have been following my Tour of Greek Morphology know I’ve tried to be careful distinguishing, for example, historical explanations from how I think native speakers internalise(d) word forms, or how students should learn them.

I still come across a lot of people who think the “modern” way of understanding morphology is learning the “morphemes” and rules, not memorising paradigms. Besides getting the history somewhat wrong, this is also making the mistake of conflating these different types of analyses and not recognising that one type of analysis might be perfectly valid for one purpose but not another.

Here’s a fun game to play: how would you analyse/explain the form λαμβάνω? Or ἔλαβον (especially when 3rd plural) or λήμψομαι? Or μαθητής vs μαθητοῦ? Or ἔδωκεν vs δέδωκα vs δός?

Maybe I haven’t quite nailed the labels yet. Maybe there are further distinctions to draw. I welcome people’s input.

       
 

Preparing an Open Apostolic Fathers

I’m working with Seumas Macdonald on an open, corrected digital edition of the Apostolic Fathers based on Lake.

Seumas Macdonald asked me a few weeks ago what it would take to expand some of our text and vocab ordering experiments to the text of Apostolic Fathers (we’re both desirous of more comprehensible input for Greek learners).

My reply was that we first of all needed to get a good open text and then lemmatise it. I thought the “get a good open text” would be trivial but it turned out not to be.

I asked around without much positive response. I found HTML versions of the Lake texts on the Christian Classics Ethereal Library (CCEL) website but they turned out to be problematic quality-wise (see below).

It then occurred to me to check what was in the Perseus Digital Library. It only had the Epistle of Barnabas but the related First 1000 Years of Greek at the Open Greek and Latin Project had done the rest.

The Perseus/OGL texts were considerably better than the CCEL ones, but were still not without problems. It was clear that the two collections had been produced independently, however, which is important for what follows.

I’m almost certain the CCEL texts were keyed in. There is haplography and dittography galore! The hapolography even corresponds almost perfectly to line breaks in the printed Lake editions I looked at.

The Perseus/OGL texts, on the other hand, are the results of OCR with some manual correction.

I wrote some code to extract both the CCEL and Perseus/OGL texts and put them in a comparable format. I then wrote a script to align the two. My thinking was to go through all the places where the two disagreed, check the printed Lake and correct the Perseus/OGL text accordingly.

I decided to throw the Lake text from Logos into the mix as well, not as an input to the correction itself but merely as another “edition” to flag differences with (to then check with the printed Lake).

Thus began a project Seumas and I have been working on the last few weeks. Once differences in any of the three texts are identified, they are flagged for review and Seumas and I independently look at the printed Lake and correct the Perseus/OGL base text.

If our corrections disagree, we continue to work on them until we come to consensus. This three-way comparison followed by two-way independent correction is proving to work very well (although it’s a lot of work!)

All the code, the source texts (except Logos), and work-in-progress are available at

https://github.com/jtauber/apostolic-fathers

and you can follow along the status in the README. There are also more detailed notes on the whole process.

Once the candidate versions of all the texts are published, I’ll do another post just with some interesting statistics on the nature of errors in the CCEL, Perseus/OGL, and Logos texts. The “scribal errors” in the CCEL text are particularly fascinating but even some of the Perseus/OGL OCR errors will be worth writing about.

Seumas and I will then contribute back the corrections to CCEL, Perseus/OGL, and Logos. Hopefully our texts will also be featured on the Biblical Humanities Dashboard as the go-to open digital text of the Apostolic Fathers (so no one else has to repeat this effort).

Finally, we’ll start the process of lemmatisation so the Apostolic Fathers can be included in our open learning materials.