I started working on some Plato texts a while ago but now I’m back to it, integrating various information and hitting some more issues with the Diorisis corpus. About a year ago, I wrote about some vocabulary statistics I’d put together around ...
‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ 

 

Working on Plato Texts and more...



Working on Plato Texts

I started working on some Plato texts a while ago but now I’m back to it, integrating various information and hitting some more issues with the Diorisis corpus.

About a year ago, I wrote about some vocabulary statistics I’d put together around various texts. This included a subcorpus of Plato I wanted to put together for the Greek Learner Texts Project. Based on the “core” works list I’d put together for https://vocab.perseus.org, this included:

  • Euthyphro
  • Apology
  • Crito
  • Symposium
  • Republic

At the time, I used Giuseppe Celano’s experimental lemmatization as the basis for the vocabulary counts.

In the intervening period, I went further with Crito and restructured the citation scheme to be based on units of dialogue and sentences. You can see HTML generated from the result at:

https://jtauber.github.io/plato-texts/

but the underlying data is available at https://github.com/jtauber/plato-texts/blob/master/text/crito.txt.

I also took a first pass at aligning an English translation at the sentence level and the raw data for that is available at https://github.com/jtauber/plato-texts/blob/master/analysis/crito_aligned.txt.

My plan was always to return to the other four texts and last weekend I started on that, freshly bringing in the texts (in both Greek and English) from Perseus, the Diorisis corpus tagging, and the treebanks from AGLDT (Euthyphro) and Vanessa Gorman (the Apology).

I also added:

  • Phaedo
  • Meno

and may add others if they seem appropriate for the Greek Learner Texts Project.

The first thing I did was produce a stripped-down tokenized version of the Greek texts from Perseus with minimal markdown. In this process, I found a small number of issues with the Perseus XML which I’ll submit corrections for shortly (mostly some stray gammas).

I then wrote a script to extract similar tokens from Diorisis for alignment. As I’ve written about before, the Diorisis corpus made the odd choice to use betacode for the tokens so I had to do a conversion. Then the real fun began.

Firstly, the Perseus text, based on the Burnet edition, has various editorial markup like <add>, <del>, <corr>, <sic>. I quickly discovered that the Diorisis text drops the <del> and <sic> elements. That’s fine although I might seek the advice of people more familiar with Burnet and the text scholarship of Plato as to what the Greek Learner Texts edition should do.

Secondly, in Phaedo at least, named entities are marked up in the Perseus TEI XML. People and places are all appropriately tagged. I don’t happen to need that right now although it’s potentially useful information. But the Diorisis corpus drops those elements. I don’t just mean it drops the tags, it dropps the elements. So if the sentence was <persName>John</persName> loves <persName>Mary</persName>, Diorisis would just give the sentence as loves (at least in Phaedo). Fairly easy to work around for alignment purposes, though.

The more time consuming aspect is the odd way Diorisis handles quotations. It seems to repeat the tokens of each quotation, once in context and then once in a sentence of its own. Except sometimes the repetition is incorporated in an unrelated sentence.

For example, the Homeric quotation in 408a (Republic Book 3) is analyzed inline but then also repeated in another sentence where it’s part of the first sentence of 409a (“δικαστὴς δέ γε…”) which, unless I’m missing something considerable, is just completely wrong.

I’m manually correcting all this (it comes up as an alignment mismatch and I’m going in and editing the Diorisis XML to remove the duplication). But even without the bad sentence merges, this also means that the vocabulary counts I’ve previously generated from Diorisis (and in vocabulary-tools) may have doubled up on any words appearing in quotations.

So there’s lots more to do with Plato, not least of all the manual curation of lemmatization. But the goal, like that of the Greek Learner Texts Project as a whole, is to have a set of openly-licensed, high-quality, lemmatized texts for extensive reading by language learners.

Collaboration always welcome. Just ping me on the Greek Learner Texts Project Slack workspace.

       
 

Ordering Vocabulary by Pericope Dispersion

Jesse Egbert’s Plenary at JAECS 2020 is giving me a bunch of ideas of things to try on the New Testament and larger Greek corpora. In this post, I briefly explore text dispersion keyness using pericopes as a way of ordering vocabulary.

Back in Lexical Dispersion in the Greek New Testament Via Gries’s DP I wrote:

My sense is that dispersion might be a useful input to deciding what vocabulary to learn. For example διδαχή or σκότος might be better to learn before ἀρνίον because, even though they all have the same frequency, you are more likely to encounter διδαχή or σκότος in a random book or chapter.

Egbert’s plenary (available here after free signup) encouraged me to try a very simple metric instead of frequency: what proportion of text units in the corpus does the word appear in? Egbert emphasises using linguistically meaningful units of text (definitely not fixed-length windows) and pericopes seem perfect for this. There are dispersion measures that allow for varying sizes of text unit (like Gries’s DP) but it seemed to me that just seeing what proportion of pericopes the item appears in might be a good measure of the importance to learn (instead of frequency).

This downplays words that might get repeated a lot in just a handful of pericopes and favours those that appear in lots of pericopes even if only one or two times in that pericope. Intuitively this makes sense, A word that appears 10 times in one passage in the New Testament (and nowhere else) isn’t as generally useful to learn as a word that appears once in ten different passages. Overall corpus frequency can therefore be misleading because it treats these two cases as the same.

With vocabulary-tools it was trivial to produce a list of all the New Testament lemmas sorted by pericope dispersion.

This gist contains the code and the list:

https://gist.github.com/jtauber/fc4b0476a4c4a94d7cb01d068161892e

Eyeballing the resultant list, it seems a very promising ordering although I welcome comments on anything interesting people notice.

Next steps are:

  1. quantitative comparison with pure frequency
  2. application to other lemmatized Greek corpora with meaningful text units similar to pericopes
  3. try other meaningful text units I have for NT such as books or paragraphs or even sentences
 

More on Plato After GNT

In the previous post, we looked at lemma and token coverage in the works of Plato assuming knowledge of Greek New Testament vocabulary. Here we graphically look at those results and make an important observation.

For this first chart, I haven’t just shown the GNT 100% and 80% but also the 98%, 95%, and 90% levels. The chart shows, assuming you’ve learned a certain % of GNT lemmas, how many tokens in the works of Plato are from those lemmas plotted against the length of the Plato work. All the plots here are log-log because of the Zipfian nature of word distributions (although it is more important in subsequent plots than this one).

At mentioned in the previous post, I was actually surprised at how little coverage drops off as a function of the length of the Plato work. A 100,000 token work has very similar token coverage than a 5,000 token work.

Visually this can be seen in how horizontal the best-fit lines are above.

However, when it comes to lemma coverage rather than token coverage, the story is very different:

The drop-off above as the Plato work gets longer is quite dramatic (especially when you consider this is a log-log plot). The points fit quite well to a line, though, indicating how Zipfian the distribution is. This demonstrates the clear relationship between the length of the work and how many lemmas you’re likely familiar with. The longer a work is, the more distinct lemmas it will use, although they tend to be low frequency within the work (hence how horiztonal the lines in the first chart are).

Notice there are some outliers—some works that seem to have higher coverage than their length would suggest given the best-fit line. I’ve called out one here, showing just the GNT 80% points and best-fit line (although it’s an outlier on the others too):

This suggests that this work might be, in some sense, easier for a GNT reader to read compared with other works of Plato. It suggests that perhaps the vocabulary of that particular work is closer to that of the GNT. The data was all there in the previous post but it’s a lot easier to spot the outliers graphically.

The work indicated above is Parmenides. I started wonder what it was about that work that made it more “GNT like”.

Then I took a step back because I realised there may be a confounding factor here. The statement “this work might be easier for a GNT reader to read compared with other works of Plato” stands but note this might not be a property of any GNT/Parmenides shared vocabulary but rather just the word distribution in Parmenides itself. In other words, Parmenides might just be easier compared with other works of Plato and that might have nothing to do with any vocabulary similarity to the GNT.

So I decided to just plot the token-to-lemma counts in the works of Plato. This doesn’t involve the GNT at all, just how many tokens each work in Plato has versus how many unique lemmas that work has.

Here is the result with Parmenides called out:

In other words, a large part (and maybe all) of why Parmenides stands off the line in the coverage after GNT is because it simply has fewer lemmas for its overall token count. Its vocabulary is just smaller for its length.

In fact, visually you can see that most of the deviations of works from the line in the early charts maps to corresponding deviations in this chart (which remember has nothing to do with the GNT).

This is just some visual comparison. There are more quantative ways of actually measuring how much the deviations in the first three charts can be explained by those in the last chart. But I’ll save that for another post.

The important takeaway for now is that, to the extent some works of Plato might be easier to read after the GNT than others, this probably has little to do with any relationship between their vocabularies, and is more to do with the inherent token-to-lemma ratio of the target work of Plato. It is possible to separate out the effects of each, though, and I will explore that in the future.

Note all the caveats I listed in my previous post about this data. Better lemmatization and richer vocabulary models are still needed.

 

Plato Vocabulary Coverage After the New Testament

Seumas Macdonald asked me about vocabulary coverage for each work of Plato assuming one has learnt the New Testament vocabulary.

It turned out to be very simple to do with vocabulary-tools and you can now see the script in the repo as examples3.py.

But here let me share the results and give some caveats.

In the table below:

  • lemmas is the number of unique lemmas in the work; e.g. Crito has 712 unique lemmas
  • tokens is the number of total tokens in the work; e.g. Crito has 4,172 tokens
  • GNT lemmas is how many of those lemmas are in the GNT; e.g. 433 (of the 712) lemmas in Crito are also in the GNT
  • GNT tokens is how many of the total tokens in the work have lemmas in the GNT; e.g. 3,429 of the 4,172 tokens in Crito have lemmas in the GNT
  • % GNT lemmas and % GNT tokens just express those counts as percentages; e.g. 60.81% of the lemmas in Crito are in the GNT and 82.19% of tokens in Crito have lemmas seen in the GNT
id title lemmas tokens GNT lemmas GNT tokens % GNT lemmas % GNT tokens
001 Euthyphro 690 5,181 441 4,274 63.91% 82.49%
002 Apology 1,112 8,745 631 7,357 56.74% 84.13%
003 Crito 712 4,172 433 3,429 60.81% 82.19%
004 Phaedo 1,921 21,825 1,000 18,033 52.06% 82.63%
005 Cratylus 1,607 17,944 781 14,701 48.6% 81.93%
006 Theaetetus 2,072 22,489 966 17,962 46.62% 79.87%
007 Sophist 1,598 16,024 788 12,932 49.31% 80.7%
008 Statesman 2,013 16,953 937 13,384 46.55% 78.95%
009 Parmenides 805 15,155 478 12,738 59.38% 84.05%
010 Philebus 1,567 17,668 800 14,076 51.05% 79.67%
011 Symposium 1,949 17,461 961 13,806 49.31% 79.07%
012 Phaedrus 2,266 16,645 1,027 12,935 45.32% 77.71%
013 Alcibiades 1 1,138 10,264 628 8,356 55.18% 81.41%
014 Alcibiades 2 711 4,268 420 3,449 59.07% 80.81%
015 Hipparchus 431 2,256 281 1,890 65.2% 83.78%
016 Lovers 473 2,391 284 1,923 60.04% 80.43%
017 Theages 627 3,485 374 2,811 59.65% 80.66%
018 Charmides 919 8,311 534 6,875 58.11% 82.72%
019 Laches 960 7,674 559 6,100 58.23% 79.49%
020 Lysis 911 6,980 524 5,729 57.52% 82.08%
021 Euthydemus 1,268 12,453 686 10,015 54.1% 80.42%
022 Protagoras 1,753 17,795 869 14,306 49.57% 80.39%
023 Gorgias 1,938 26,337 951 21,467 49.07% 81.51%
024 Meno 961 9,791 534 8,066 55.57% 82.38%
025 Hippias Major 958 8,448 528 6,730 55.11% 79.66%
026 Hippias Minor 698 4,360 396 3,387 56.73% 77.68%
027 Ion 721 4,024 382 3,012 52.98% 74.85%
028 Menexenus 958 4,808 571 3,985 59.6% 82.88%
029 Cleitophon 418 1,549 284 1,293 67.94% 83.47%
030 Republic 4,846 88,878 1,782 71,377 36.77% 80.31%
031 Timaeus 2,666 23,662 1,122 18,644 42.09% 78.79%
032 Critias 1,130 4,950 638 3,997 56.46% 80.75%
033 Minos 528 2,859 309 2,333 58.52% 81.6%
034 Laws 5,227 103,193 1,804 82,652 34.51% 80.09%
035 Epinomis 1,014 6,309 590 5,135 58.19% 81.39%
036 Epistles 2,026 16,964 1,015 13,768 50.1% 81.16%

It’s encouraging how any works are above the 80% level. Here are some caveats, though:

  1. the Plato lemmatization is from Diorisis so has not been checked and may have errors throwing things off
  2. the GNT lemmatization is MorphGNT and so even if Diorisis got it “right” it may have a different lemmatization scheme than MorphGNT
  3. this assumes 100% knowledge of GNT lemmas
  4. this doesn’t take into account word families nor individual forms
  5. this coverage calculation theoretically favours shorter works. You can see that in how much lower the % GNT lemmas is for longer works like the Laws and Republic although (perhaps significantly) this doesn’t actually seem to skew token coverage

Favouring shorter works isn’t necessary a bad thing if the goal is to find the most readable (by vocabulary) works of Plato post-GNT.

Here’s a run of the code only assuming the 80% level of GNT vocabulary rather than the whole thing.

id title lemmas tokens GNT lemmas GNT tokens % GNT lemmas % GNT tokens
001 Euthyphro 690 5,181 149 3,135 21.59% 60.51%
002 Apology 1,112 8,745 165 5,551 14.84% 63.48%
003 Crito 712 4,172 150 2,581 21.07% 61.86%
004 Phaedo 1,921 21,825 214 13,647 11.14% 62.53%
005 Cratylus 1,607 17,944 192 11,208 11.95% 62.46%
006 Theaetetus 2,072 22,489 215 13,416 10.38% 59.66%
007 Sophist 1,598 16,024 183 9,644 11.45% 60.18%
008 Statesman 2,013 16,953 194 9,577 9.64% 56.49%
009 Parmenides 805 15,155 140 9,852 17.39% 65.01%
010 Philebus 1,567 17,668 187 10,209 11.93% 57.78%
011 Symposium 1,949 17,461 208 10,437 10.67% 59.77%
012 Phaedrus 2,266 16,645 212 9,395 9.36% 56.44%
013 Alcibiades 1 1,138 10,264 177 6,296 15.55% 61.34%
014 Alcibiades 2 711 4,268 142 2,566 19.97% 60.12%
015 Hipparchus 431 2,256 111 1,339 25.75% 59.35%
016 Lovers 473 2,391 104 1,427 21.99% 59.68%
017 Theages 627 3,485 124 2,129 19.78% 61.09%
018 Charmides 919 8,311 158 5,277 17.19% 63.49%
019 Laches 960 7,674 165 4,632 17.19% 60.36%
020 Lysis 911 6,980 150 4,204 16.47% 60.23%
021 Euthydemus 1,268 12,453 181 7,640 14.27% 61.35%
022 Protagoras 1,753 17,795 195 10,973 11.12% 61.66%
023 Gorgias 1,938 26,337 205 16,301 10.58% 61.89%
024 Meno 961 9,791 159 6,042 16.55% 61.71%
025 Hippias Major 958 8,448 154 5,123 16.08% 60.64%
026 Hippias Minor 698 4,360 134 2,446 19.2% 56.1%
027 Ion 721 4,024 133 2,236 18.45% 55.57%
028 Menexenus 958 4,808 161 2,877 16.81% 59.84%
029 Cleitophon 418 1,549 113 966 27.03% 62.36%
030 Republic 4,846 88,878 252 53,090 5.2% 59.73%
031 Timaeus 2,666 23,662 210 13,555 7.88% 57.29%
032 Critias 1,130 4,950 171 2,872 15.13% 58.02%
033 Minos 528 2,859 121 1,776 22.92% 62.12%
034 Laws 5,227 103,193 250 58,891 4.78% 57.07%
035 Epinomis 1,014 6,309 165 3,700 16.27% 58.65%
036 Epistles 2,026 16,964 211 10,229 10.41% 60.3%

The Plato coverage generally drops from around 80% to 60% which suggests it might be worth “topping up” one’s vocabulary with some common Plato words not in the GNT before embarking on a specific work. It would be easy to generate such a list with vocabulary-tools.

But it was quite striking to me in both tables just how little the token % drops due to length (in contrast to the lemma %).

This just goes to show that longer works introduce a lot of new words but very sparsely (probably with only one occurrence in many cases).

I might explore that graphically in a follow-up post.

 

A Tour of Greek Morphology: Part 48

Part forty-eight of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

We previously introduced the (θ)η-aorists. In this post, we’ll mention the stem variants and then go over some counts.

In terms of stem variants, we first of all have δέω, where we find the infinitive δεθῆναι alongside the 1SG ἐδεήθην and 3SG ἐδεήθη. The infinitive form suggests a stem of δε-θη whereas the finite forms suggest a stem of ἐ-δεη-θη with an extra η.

Secondly, we have two 3SG forms of ἁρπάζω: ἡρπάγη and ἡρπάσθη.

Finally we have ἀνοίγω with its confused augmentation (which we’ve seen in other aorists) and also both a θ and non-θ form. Putting aside the ἠνοι- vs ἀνεῳ- vs ἠνεῳ- variation, we have 3SG ἠνοίχθη alongside ἠνοίγη and 3PL ἠνοίχθησαν alongside ἠνοίγησαν.

Notice that in both the ἁρπάζω and ἀνοίγω cases, we have a non-θ form with γ before the η. We’ll look at the letters we find before η and θη later in this post.

But first let’s do our usual counts of tokens and lemmas.

class # lemmas # tokens # hapakes
-θη- 250 954 130
-η- 34 79 19

As one can see, the non-θ forms are more rare lexically and the lexemes that do take them occur less frequently. They both, however, seem productive.

  -θη- -η-
INF 166 4
1SG 29 6
2SG 8 2
3SG 489 43
1PL 30 5
2PL 44 3
3PL 188 16

The distribution above is what we might expect except for the INF which are disproportionately -θη-. This is not due to a single lexical item (unlike the 3SG where ἀπεκρίθη dominates).

This will be worth further investigation but we have other things to cover first. For example, is there any phonological reason why a non-θ form might be used rather than a θ-form? We saw previously, for example, that the existence or absence of the sigma in the alphathematic aorists was largely (although not entirely) predicted by the preceding letter.

It turns out, at least in our text (we’ll look more broadly later) there’s quite a strong correlation between whether a θ is found or not and what the preceding letter is.

For example, if the preceding letter is any of the vowels ε η ι ο υ ω, then we always find the θη form in the SBLGNT. α is the only exception and even then only in one lexical item out of 14, the κατακαίω form κατεκάη. (Notably κατεκαύ(σ)θη is more common elsewhere but we’ll have to wait a little to discuss καίω forms in general)

If the preceding letter is σ, then we always find the θη form. This is actually the most likely letter to precede θ by far, followed by η.

ξ ψ and ζ don’t appear in (θ)η aorists in the SBLGNT. Nor do δ τ or θ.

Amongst the velars: κ doesn’t appear in (θ)η aorists in the SBLGNT but γ and χ both do. γ is always followed directly by η (and in fact the bigram γθ never appears in the SBLGNT at all). In contrast, χ always takes the θ form (which might be explained by an underlying κ or γ becoming χ because of the following θ but this doesn’t explain why the θ would be absent in the -γ-η- instances).

Amongst the bilabials: both π and β are always followed directly by η (and neither πθ nor βθ appear as bigrams in the SBLGNT). φ however is found both in θη and η forms with a slight preference for φθη over φη.

This leaves our resonances: the liquids λ and ρ, and the nasals μ and ν. The bigram λθ is definitely allowed in Greek but we only find -λ-η- aorists, not -λ-θη-. With ν and ρ we find both θ and non-θ forms. There are no μ examples in the SBLGNT, nor do we find the bigram μθ.

Here’s a summary with lexeme counts:

  -θη- -η-
α 13 1
ε 21 -
η 80 -
ι 17 -
ο 11 -
υ 26 -
ω 52 -
σ 108 -
ξ - -
ψ - -
ζ - -
τ - -
δ - -
θ - -
κ - -
γ - 12
χ 37 -
β - 2
π - 3
φ 16 10
λ - 6
ρ 6 8
μ - -
ν 16 3

Clearly there are some patterns here. Vowels, σ, and the aspirated stops strongly (or even entirely) favour -θη-. The non-aspirated stops seem to entirely favour a plain η. The resonances are a mixed bag.

There are definitely some correlations but it’s unclear what the casual relationship is. And it raises the important question of where the letter before the θ (or η) comes from in the first place. This relates more broadly to the question of the aorist stem. What is the relationship between the aorist stems used in the active, middle and (θ)η forms? In the next post, we’ll start to explore that. Then, after reviewing all our endings so far, we’ll move on to the even bigger question: what’s the relationship between the aorist stem and the present stem?