Measures of dispersion are interesting to apply to a corpus because they tell you whether a word is distributed across parts of the corpus as expected or concentrated more in just some parts. I thought I'd play around with Gries's DP as a measure of ...

 

Lexical Dispersion in the Greek New Testament Via Gries's DP and more...



Lexical Dispersion in the Greek New Testament Via Gries's DP

Measures of dispersion are interesting to apply to a corpus because they tell you whether a word is distributed across parts of the corpus as expected or concentrated more in just some parts. I thought I’d play around with Gries’s DP as a measure of dispersion on the SBLGNT lemmas.

There are lots of measures of dispersion but Stefan Th. Gries’s is perhaps the simplest (see [1] for a detailed survey of lots of different measures as well as the original definition of his own).

Here it is in Python for lemmas:

dp = sum(abs((p[part] / t) - (lp[lemma][part] / l[lemma])) for part in p) / 2

where:

  • p[part] is a dictionary mapping corpus part to the count of words in that part
  • l[lemma] is a dictionary mapping lemmas to the count of that lemma in the corpus
  • lp[lemma][part] is a dictionary of dictionaries mapping lemmas and parts to the count of the lemma in that part

but see [1] for some simple worked examples.

One thing Gries doesn’t talk about (email me if you know of any discussion of this) is how to handle very low frequency words as they’ll dominate the high DP values.

Using books as the parts, here are the top 10 most evenly dispersed lemmas in the GNT:

0.0466 ὁ0.1085 εἰς0.1154 καί0.1178 ὅς0.1250 εἰμί0.1358 ποιέω0.1382 γίνομαι0.1385 πολύς0.1395 μετά0.1420 μή

Here are the top 10 least evenly dispered lemmas (including all frequencies, even hapax legomena):

0.9984 φιλοπρωτεύω0.9984 ἐπιδέχομαι0.9984 μειζότερος0.9984 Διοτρέφης0.9984 φλυαρέω0.9982 χάρτης0.9982 κυρία0.9976 προσοφείλω0.9976 ἑκούσιος0.9976 ἄχρηστος

but this list looks very different if we, say, restrict ourselves to lemmas that occur 5 times or more:

0.9827 ἀντίχριστος0.9752 καταλαλέω0.9687 ἐπιφάνεια0.9681 νήφω0.9680 ἀρετή0.9667 μῦθος0.9641 Μελχισέδεκ0.9568 πλεονεκτέω0.9557 νόημα0.9532 ἐνέργεια

or 30 times or more:

0.8952 ἀρνίον0.8085 καυχάομαι0.8024 θηρίον0.7987 μέλος0.7969 εἴτε0.7266 συνείδησις0.7202 περιτομή0.7199 θρόνος0.7139 ὑποτάσσω0.7116 Παῦλος

If we use chapters as the corpus division, we get a little different top ten most evenly distributed by Gries’s DP:

0.0677 ὁ0.1440 καί0.1913 εἰμί0.2084 εἰς0.2117 αὐτός0.2259 ἐν0.2366 οὗτος0.2378 ὅς0.2437 δέ0.2561 οὐ

and obviously this is even more problematic for lower frequency words at the other end.

It’s interesting to look, though, at chapters within a single book. For example, here are the most evenly distributed lemmas in John’s gospel using chapters for parts:

0.0574 ὁ0.0867 καί0.0977 αὐτός0.1331 οὐ0.1391 οὗτος0.1440 ὅτι0.1480 λέγω0.1569 δέ0.1576 εἰμί0.1658 εἰς

and here are the least evenly distributed lemmas that occur at least 10 times:

0.9470 σταυρόω0.9414 Ἀβραάμ0.9126 νίπτω0.8958 Πιλᾶτος0.8914 πρόβατον0.8812 Λάζαρος0.8493 καρπός0.8426 ἄρτος0.8371 προσκυνέω0.8221 ψυχή

Obviously Gries’s DP is extremely easy to calculate, and I plan to experimentally include it in the Greek Vocabulary Tool for the Perseus Project but there are still some things to work out with low frequency words.

It’s very interesting, though, as a way of contrasting words that otherwise have the same frequency in a corpus. For example, here are all the lemmas that occur exactly 30 times in the SBLGNT, with their book-based Gries’s DP:

0.3276 διδαχή0.3558 ἐγγύς0.3708 σκότος0.4143 ἀγοράζω0.5360 σκανδαλίζω0.5833 συνέρχομαι0.6230 ἴδε0.6485 ἐπικαλέω0.7266 συνείδησις0.8952 ἀρνίον

There is a massive range in the DP which I think is quite illustrative.

Here is the list with their chapter-based DP (notice how high the lowest DP now is):

0.8769 ἀγοράζω0.8821 σκότος0.8869 συνέρχομαι0.8958 σκανδαλίζω0.9016 ἐγγύς0.9016 διδαχή0.9034 ἴδε0.9083 ἐπικαλέω0.9441 συνείδησις0.9609 ἀρνίον

One of my reasons for exploring Gries’s DP (and potentially other measures of lexical dispersion) is the application to language learning. My sense is that dispersion might be a useful input to deciding what vocabulary to learn. For example διδαχή or σκότος might be better to learn before ἀρνίον because, even though they all have the same frequency, you are more likely to encounter διδαχή or σκότος in a random book or chapter.

[1] Gries, Stefan Th. (2008) Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics 13:4. John Benjamins.

       
 

Some Unix Command Line Exercises Using MorphGNT

I thought I’d help a friend learn some basic Unix command line (although pretty comprehensive for this type of work) with some practical graded exercises using MorphGNT. It worked out well so I thought I’d share in case they are useful to others.

The point here is not to actually teach how to use bash or commands like grep, awk, cut, sort, uniq, head or wc but rather to motivate their use in a gradual fashion with real use cases and to structure what to actually look up when learning how to use them.

This little set of commands has served me well for over twenty years working with MorphGNT in its various iterations (although I obviously switch to Python for anything more complex).

Task 0

Clone https://github.com/morphgnt/sblgnt in git.

Task 1

Using wc and the concept of wildcards/globbing (and relying on the fact I have one line-per-word in those files) work out how many words are in the main text of SBLGNT.

Task 2

Using grep and wc work out how many times μονογενής appears. (You might be able to do it with just grep and appropriate options, but try using grep without options and wc and understand the concept of “piping” the output of one command to the input of another)

Task 3

How many verbs (tokens) are there in John’s gospel? (still doable just with grep and wc)

Task 4

How many unique verbs (lemmas) are there in John’s gospel?

(learn how to use awk to extract fields, and how to use sort and uniq in tandem)

Task 5

What are the 5 most common verbs (lemmas) in John’s gospel? (you might want to use head)

Task 6

Get counts in John’s Gospel of how many tokens appear in each tense/aspect (hint: use cut) and write the results to a file called john.txt rather than just output it in the terminal.

Task 7

Come up with your own question that you think could be answered using the types of operations and try it out.

       
 

SBL Papers Now Online

I’ve put my two SBL papers this year (from both the recent Annual Meeting and the International Meeting) online and also sync’d my Annual Meeting slides to audio I recorded on my iPhone.

  • SBL 2017 Annual: Linking Lexical Resources for Biblical Greek
    [slides] [video]
  • SBL 2017 International: The Route to Adaptive Learning of Greek
    [slides]

For completeness, here are my other SBL talks:

  • SBL 2016 Annual: An Online Adaptive Reading Environment for the Greek New Testament
    [slides]
  • SBL 2015 Annual: A Morphological Lexicon of New Testament Greek
    [slides]
       
 

Speaking at SBL 2017 on Linking Lexical Resources

I’m again speaking at the SBL Annual Meeting, this time in Boston. My topic is basically the “lemma lattice” work started by Ulrik Sandborg-Petersen and I back in 2006 but which I’ve never presented in this sort of setting before.

Here’s the official abstract:

Linking Lexical Resources for Biblical Greek

As more resources for Biblical Greek, both old and new, become openly available, the opportunities for integrating them become greater. At the level of the word, it might seem a trivial task to match based on lemma. But no two texts are lemmatised the same way and no two lexicons will make the same choices of headwords. Numerical solutions such as Strongs and Goodrick-Kohlenberger solve some problems but introduce new ones. After surveying the various issues and challenges, this talk will provide both a framework for moving forward and a report on practical ways that a variety of texts, lexicons, and other resources such as principal-part lists are being linked in the service of open, biblical digital humanities.

I’ll certainly post my slides after my talk but I’ll also try to record it on my iPhone like I did at BibleTech 2015.

       
 

A Tour of Greek Morphology: Part 17

Part seventeen of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

As mentioned in the last post in the series, we now have an inflectional class for all 5,314 present active infinitive or indicative forms in the MorphGNT SBLGNT in a file that looks like the following:

010120 ἐστί(ν) 3SG PA-10 εἰμί PA-10010123 ἐστί(ν) 3SG PA-10 εἰμί PA-10010202 ἐστί(ν) 3SG PA-10 εἰμί PA-10010206 εἶ 2SG PA-10 εἰμί PA-10010213 μέλλει 3SG PA-1 μέλλω PA-1010213 ζητεῖν INF PA-2 ζητέω PA-2010218 εἰσί(ν) 3PL PA-10 εἰμί PA-10010222 βασιλεύει 3SG PA-1 βασιλεύω PA-1010303 ἐστί(ν) 3SG PA-10 εἰμί PA-10010309 λέγειν INF PA-1 λέγω PA-1010309 ἔχομεν 1PL PA-1/PA-8 ἔχω PA-1

Where the columns are:

  • the book/chapter/verse reference
  • the normalized form
  • the morphosyntactic properties
  • the inflectional classes possible without disambiguation
  • the lemma
  • the disambiguated inflectional class

Now it’s time to do some counts.

Let us first of all look at the number of distinct lemmas in each of our 13 classes.

The numbers for classes PA-5 and above are low enough that we should look at them individually:

PA-1 barytone omega verbs338
PA-2 circumflex omega verbs with INF -εῖν / 3SG -εῖ145
PA-3 circumflex omega verbs with INF -οῦν / 3SG -οῖ21
PA-4 circumflex omega verbs with INF -ᾶν / 3SG -ᾷ31
PA-5 ζάω + compound (συζάω)2
PA-6a ὀμνύω; δείκνυμι + compound (ἀμφιέννυμι)3
PA-7 τίθημι + compounds (ἐπιτίθημι παρατίθημι περιτίθημι);
compounds of ἵημι (ἀφίημι συνίημι)
6
PA-8 δίδωμι + compounds (διαδίδωμι ἀποδίδωμι μεταδίδωμι παραδίδωμι5
PA-9 compounds of ίστημι (καθίστημι μεθίστημι συνίστημι);
compound of φημί (σύμφημι);
that one weird case of συνίημι
5
PA-9-ENC φημί1
PA-10 εἰμί1
PA-10-COMP compounds of εἰμί (ἄπειμι ἔξεστι(ν) πάρειμι)3
PA-11-COMP compounds of εἶμι (ἔξειμι εἴσειμι)2

Notice that even the small counts are elevated due to compound verbs. Folding compounds of the same base verb, the classes from PA-5 on have only one or two members.

This is just looking at the number of unique lemmas in each class but there are two other sets of numbers that are worth looking at: (1) the total number of tokens in the SBLGNT; (2) the distribution of classes amongst the hapax legomena.

class lemmas tokens hapax hapax details
PA-1 338 2563 151
PA-2 145 856 65
PA-3 21 35 15
PA-4 31 117 16
PA-5 2 41 1 συζάω
PA-6a 3 5 2 ὀμνύω ἀμφιέννυμι
PA-7 6 37 3 εἴσειμι παρίστημι παρατίθημι
PA-8 5 35 2 διαδίδωμι μεταδίδωμι
PA-9 5 9 3 συνίημι σύμφημι μεθίστημι
PA-9-ENC 1 22 0
PA-10 1 1551 0
PA-10-COMP 3 39 1 ἄπειμι
PA-11-COMP 2 4 1 εἴσειμι

Why do the hapax legomena matter? Well they give an indication of what classes were still productive.

Note, however, that the hapax in PA-5 and above are VERY low in number and, with the exception of ὀμνύω in PA-6a they are all compounds. This strongly suggests that only PA-1, PA-2, PA-3, and PA-4 were productive.

Notice that the token numbers for PA-6a, PA-9 and PA-11-COMP are particularly low too. Potentially relevant in the case of PA-6a and PA-9 is that these are the classes most like to have developed thematic alternatives. This might be worthy of a future post in this series!

Let’s now look at counts for each paradigm cell for each class:

  PA-1 PA-2 PA-3 PA-4 PA-5 PA-6a PA-7 PA-8 PA-9 PA-9-ENC PA-10 PA-10-COMP PA-11-COMP
INF 394 171 5 21 13 1 11 10 1 - 124 3 3
1SG 460 116 3 21 6 1 7 10 2 4 138 1 -
2SG 164 46 - 5 2 - - 1 - - 92 1 -
3SG 923 295 16 35 13 3 11 13 5 17 896 31 -
1PL 141 52 2 19 5 - 1 - - - 52 1 -
2PL 218 99 4 8 1 - 4 - - - 93 1 -
3PL 263 77 5 8 1 - 3 1 1 1 156 1 1
  2563 856 35 117 41 5 37 35 9 22 1551 39 4

What is obvious from this is just how important, regardless of inflectional class, the 3SG form is. The INF is also very important. We’ve seen in a previous post that both cells are very good predictors of inflectional class (much better than 1SG) but they are also just both very common. The 1SG, despite being a bad predictor, is still important in terms of frequency.

The 3PL is a distant fourth with one apparent deviation: it is very common in PA-10 (i.e. the copula), more so than the INF or 1SG. In fact, the proportion of 3PL in this class is actually average, it’s the INF and 1SG that are unusually low (with much of the frequency drop taken up by the 3SG).

As well as εἰμί, φημί (PA-9-ENC) is also disproportionately 3SG.

Of course, given how common PA-1 is, even the plurals there outnumber the most common cells in the other classes.

If the goal is just to identify the person/number, not the class, (which is true in reception but not learning) then a lot of those numbers collapse because of shared endings. Here are the counts just focused on the common endings (without accents):

INF 604
-ναι 153
1SG 606
-μι 163
2SG -{ι}ς 217
1
(-)ει 93
3SG -{ι} 1282
-σι(ν) 49
(-)εστι(ν) 927
1PL -μεν 273
2PL -τε 448
3PL -σι(ν) 511
-ασι(ν) 7

This just emphasises even more (even though it was in the previous table) that there is only 1 2SG in -ς (without an iota, subscripted or otherwise): the παραδίδως in Luke 22.48.

The 7 3PLs in -ασι(ν) are:

  • τιθέασι(ν) in Matt 5.15
  • ἐπιτιθέασι(ν) in Matt 23.4
  • περιτιθέασι(ν) in Mark 15.17
  • φασί(ν) in Rom 3.8
  • συνιᾶσι(ν) in 2Co 10.12
  • εἰσίασι(ν) in Heb 9.6
  • διδόασι(ν) in Rev 17.13

One could argue that these are subsumed by saying 3PL ends in -σι(ν) but given that, in the very same lexemes, -σι(ν) can also indicate 3SG, it is useful calling out the α, even though the root vowel alternation is enough to distinguish singular and plural.

That’s it (for now) for counts of the present actives. In the next couple of posts, we’ll turn to the middle forms.

       
 
 
   
Click here to safely unsubscribe from "J. K. Tauber: at the intersection of computing, linguistics, biblical greek and learning science."
Click here to view mailing archives, here to change your preferences, or here to subscribePrivacy
Email subscriptions powered by FeedBlitz, LLC, 365 Boston Post Rd, Suite 123, Sudbury, MA 01776, USA.