In a 2008 paper, Wilfred Major constructs what he calls the 50% and 80% vocab lists for Classical Greek. That is, the lemmata that account for 50% and 80% respectively of tokens in the Classical Greek corpus. In this post I provide the code for the ...


The Core Vocabulary of New Testament Greek and more...

The Core Vocabulary of New Testament Greek

In a 2008 paper, Wilfred Major constructs what he calls the 50% and 80% vocab lists for Classical Greek. That is, the lemmata that account for 50% and 80% respectively of tokens in the Classical Greek corpus. In this post I provide the code for the equivalent for the Greek New Testament and talk about some of the results.

Major’s paper is It’s Not the Size, It’s the Frequency: The Value of Using a Core Vocabulary in Beginning and Intermediate Greek and as well as listing the 65 words in the “50% List” he lists the roughly 1,100 words in the “80% List” complete with glosses in both cases.

Major also discusses other issues near and dear to this blog such as the relevance of form frequency as well as lemma frequency. I’ll respond to him on some of these topics in later blog posts.

Now, for many years I’ve talked about the limitations of a purely frequency-based approach to vocab ordering but that doesn’t mean producing such lists is useless, just that there are things we can do to improve on that approach. But I still thought it would be interesting to produce GNT 50% and 80% lists.

The code is available here.

The 50% list consists of just 27 lemmata. The only verbs are γίνομαι, εἰμί, ἔχω, and λέγω. The only nouns are θεός, κύριος, and Ἰησοῦς.

The 80% list consists of 317 lemmata.

As expected, this is considerably smaller than Major’s Classical Greek lists which are based on a considerably larger corpus.

It’s easy to tweak the code to look at forms rather than lemmata. The 50% forms list for the GNT consists of 97 forms from 52 lemmata.

Interestingly, those 97 forms consist of 16 forms of the article, 15 forms of the (1st/2nd person) personal pronouns, and 6 forms of αὐτός. This suggests that even without arguments on morphological grounds, it’s worth learning the full paradigms for the article, the personal pronouns and αὐτός really early on.

Unsurprisingly, λέγω gets a decent showing with 4 forms: εἶπεν, λέγει, λέγω and λέγων. I’ve long though it’s worth learning those right away without needing to introduce full paradigms.

There’s a lot more that could be explored even with this frequency-based approach. And lots more to say based on the other things Major talks about in his paper.

Finally, it should be stressed that very few full verses of the GNT would be readable with just the 80% list and probably none with the 50% list. I may do another post later on to confirm that.

UPDATE: Now see Actual Core Vocab Lists for Greek New Testament


A Tour of Greek Morphology: Part 16

Part sixteen of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In the previous post we went through and made sure we had all our active endings covered ready for counting. As pointed out (and in detail in Part 13), though, we still had some ambiguities. If we want to assign just a single inflectional class to each form in the SBLGNT, we need some way of disambiguating. Fortunately, the lemma does this (even if it resorts to using fake forms like the uncontracted circumflex 1SGs).

This allows us to write code that basically follows these rules:

1SG:Xημι or 3SG:Xησι(ν)isPA-7 if lemma ends in τίθημι or ίημι
PA-9 if lemma ends in ίστημι or φημι
1PL:Xῶμεν or 3PL:Xῶσι(ν)isPA-5 if lemma is ζάω
PA-4 otherwise
1PL:Xοῦμεν or 3PL:Xοῦσι(ν)isPA-2 if lemma ends in έω
PA-3 if lemma ends in όω
2PL:XετεisPA-1 if lemma ends in ω
PA-7 if lemma ends in ημι
1PL:XομενisPA-1 if lemma ends in ω
PA-8 if lemma ends in ωμι
1SG:XῶisPA-2 if lemma ends in έω
PA-3 if lemma ends in όω
PA-5 if lemma is ζάω
PA-4 if lemma otherwise ends in άω
INF:XέναιisPA-7 if lemma ends with ίημι
PA-11-COMPOUND if lemma ends with ειμι

Part 13 also mentioned the 2SG:Xης ambiguity between PA-7 and PA-9 but that doesn’t crop up in the SBLGNT: there are in fact no PA-7 OR PA-9 2SGs in the SBLGNT.

There ARE however three 1PL forms which do still cause a problem with the rules above:

  • ἀφίομεν
  • ἱστάνομεν
  • συνιστάνομεν

Each of these matches 1PL:Xομεν BUT the MorphGNT lemmas are ἀφίημι, ἵστημι, and συνίστημι respectively.

What is happening here is that new forms have developed belonging to a different inflectional class than the particular form chosen for the lemma. For example ἱστάνομεν is an ω verb but it’s otherwise the same as the athematic ἵστημι. Arguably the MorphGNT lemmatization could be changed to ἱστάνω if you consider a difference in inflectional class to be a new lexeme. This is a topic I’ll be covering in my talk at SBL 2017 in Boston in November. For now, in our Python code, we’ll just special-case these as PA-1 but we will come back to discussing this more. Note that we only caught this here because it was an ambiguous form so we were checking for particular lemma patterns.

We now have an inflectional class for all 5,314 present active infinitive or indicative forms in the MorphGNT SBLGNT.

The output of my Python script begins:

010120 ἐστί(ν) 3SG PA-10 εἰμί PA-10010123 ἐστί(ν) 3SG PA-10 εἰμί PA-10010202 ἐστί(ν) 3SG PA-10 εἰμί PA-10010206 εἶ 2SG PA-10 εἰμί PA-10010213 μέλλει 3SG PA-1 μέλλω PA-1010213 ζητεῖν INF PA-2 ζητέω PA-2010218 εἰσί(ν) 3PL PA-10 εἰμί PA-10010222 βασιλεύει 3SG PA-1 βασιλεύω PA-1010303 ἐστί(ν) 3SG PA-10 εἰμί PA-10010309 λέγειν INF PA-1 λέγω PA-1010309 ἔχομεν 1PL PA-1/PA-8 ἔχω PA-1

The columns are:

  • the book/chapter/verse reference
  • the normalized form
  • the morphosyntactic properties
  • the inflectional classes possible without disambiguation
  • the lemma
  • the disambiguated inflectional class

You can download the entire thing here.

We’ll use this to do our counts in the next post.

One question comes to mind: are the disambiguated inflectional classes consistent for all the forms of a lexeme (beyond the three exceptions we already saw above)?

Well, looking at the full output of the script, we find there are a few more in the SBLGNT:

ὀμνύω INFὀμνύναι PA-6a
ὀμνύειν PA-1
all other forms
δείκνυμι INFδεικνύειν PA-1
1SGδείκνυμι PA-6a
συνίστημι 1PLσυνιστάνομεν PA-1
1SGσυνίστημι PA-9
ἀφίημι 1PLἀφίομεν PA-1
2SGἀφεῖς PA-2
all other forms PA-7
συνίημι INFσυνιέναι PA-7
3PLσυνίουσι(ν) PA-1
συνιᾶσι(ν) PA-9

In each case we have an originally athematic verb occasionally acting like it’s thematic (and, in the case of ὀμνύω even the lemma is written as if it was thematic). We WILL have more to say about this in a few posts but we’ve now done enough that we can count how many times each inflectional class appears in the SBLGNT and how many different lexemes follow each inflectional class. We’ll do that in the very next post.

There is still another thing worth checking: is the value of X in our paradigm patterns consistent across a lexeme too? Yes it is, accent aside, if you only compare within the same inflectional class. The X for the δείκνυμι cells in PA-6a is always δείκν, for example, but the PA-1 cases have X = δεικνύ.

UPDATE: I just discovered a mis-disambiguated παριστάνετε that needs to be special-cased as a PA-1.


A Tour of Greek Morphology: Part 12

Part twelve of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

There is one very important verb we haven’t looked at the paradigm of yet: the copula.

For comparison, we’ll put the present infinitive and indicative forms alongside the common endings of the μι verbs we saw in part 10.

INF εἶναι -ναι
1SG εἰμί -μι
2SG εἶ
3SG ἐστί(ν) -σι(ν)
1PL ἐσμέν -μεν
2PL ἐστέ -τε
3PL εἰσί(ν) -ασι(ν)


  • all but the INF and 2SG are enclitic
  • in the INF, 1SG, 1PL and 2PL we find the expected ending
  • the 3SG and 3PL are slightly different
  • the 2SG is lacking the ending all together
  • with all the endings removed, we sometimes have ἐσ and sometimes εἰ

Recall in part 9 we said that “it was not uncommon for Attic-Ionic to have σι for τι in other dialects” (a type of lenition). Perhaps the 3SG ending was originally τι(ν) and it just became σι(ν) in all the μι verbs except the copula.

And in part 11 we questioned “why the active 2SG and 3SG forms don’t end in σι and τι to mirror σαι and ται.” Well, what if they originally did and some change masked this?

The 3SG τι(ν) would be explained as an original τι with the occasional movable nu. The 3SG σι(ν) would just come from τι(ν) via the tendency for τι to become σι in Attic-Ionic.

The 2SG εἶ is perfectly explainable as coming from ἐσι with the intervocalic sigma dropping. In fact, we find ἐσσί in Homer, Pindar and other writings in older or more conservative dialects. If εἶ came from an older ἐσσί, that would not only suggest a -σι ending but a ἐσ stem. [EDIT: it’s also possible, or even likely given the evidence of other Indo-European languages, that the first sigma was dropped much earlier in Proto-Indo-European and the instances of ἐσσί are actually a reintroduction of a double sigma by analogy with the 3SG!]

Is it plausible that εἶναι came from ἐσ+ναι and εἰμί from ἐσ+μι? Absolutely! A sigma dropping and the preceding vowel lengthening would explain those forms. But why would we still find ἐσμέν rather than, say, εἰμέν? Well it turns out Homer and Herodotus do have εἰμέν. There is clearly tension between keeping the ἐσ and going to εἰ and different dialects went a different way even at the level of different cells in the paradigm.

In the 3PL, we do find that Homer (as well as εἰσί) has ἔᾱσι, following the 3PL ending of the other μι verbs, but much as the ω verb ending -ουσι comes from -οντι, we can explain εἰσί from ἐσ+ντι.

Further justification of earlier forms comes from comparison with other Indo-European languages but doing that would take us too far afield for this survey. For now, we’ll just summarize what we have for this new paradigm.

We’ll call this PA-10 but because of the ἐσ/εἰ alternation, we can’t really isolate distinguishers across the entire paradigm other than the full words themselves.

INF εἶναι ἐσ+ναι sigma-drop and compensatory lengthening
1SG εἰμί ἐσ+μι sigma-drop and compensatory lengthening
2SG εἶ ἐσ+σι sigma-drop (twice) and compensatory lengthening
3SG ἐστί(ν) ἐσ+τι
1PL ἐσμέν ἐσ+μεν
2PL ἐστέ ἐσ+τε
3PL εἰσί(ν) ἐσ+ντι lenition of tau, sigma and nu drop with compensatory lengthening

As always, I stress this is a historical explanation, not an explanation of what was going on in the minds of native Greek speakers nor the best way to initially learn the forms of the copula.

The μι/σι/τι/ντι pattern is fascinating, though; with its parallel to the middle μαι/σαι/ται/νται.

There are still, of course, open questions, like the relationship between these endings and those of the ω verbs that differ (not least of which -μι vs -ω itself!) Or the fact that our other μι verbs seemed to use a different vowel in the singular than the plural and there’s no sign of that in the copula. [EDIT: also as noted, ἐσσι as the original form is problematic; it was likely ἐσι in Proto-Greek.]

One earlier observation we can say a little bit more about now, though, is the alpha in the -ασι(ν) ending which previously seemed inexplicable. As we shall see later on, when a ν can’t be pronounced in a particular context, it often became an α rather than just dropping out completely. Given we reconstruct an ν in the 3PL ending, this ν becoming an α rather than dropping out entirely explains -ασι(ν) (with no compensatory lengthening). Because the μι verbs (unlike the ω verbs) have a 3SG ending in σι(ν), keeping the α around was useful to discriminate between the singular and plural. In the case of the copula, though, the 3SG retained the τ so there was less reason to keep the old ν (pronounced as α) around and it could just drop out entirely.

We’ve now covered the major present infinitive and indicative paradigms. In the next few posts in this series we’re going to step back a little and talk about the relationship between paradigms, the notion of lemmas and citation forms, some more about cell filling and class inference, and some statistics about the frequency of these different paradigms we’ve looked at. Then we’ll move beyond the present and look at a whole new set of paradigms!


A Tour of Greek Morphology: Part 15

Part fifteen of a tour through Greek inflectional morphology to help get students thinking more systematically about the word forms they see (and maybe teach a bit of general linguistics along the way).

In the previous two posts in this series (part 13 and part 14) we summarized the paradigms we’ve seen so far for the present infinitive and indicative both in the active and middle.

Do these paradigms cover all the forms in the Greek New Testament? Which paradigms are more common? Which are productive? We’ll explore these questions in the next few posts.

Let’s start with the active forms.

The first test is whether every present active infinitive and indicative verb in the MorphGNT SBLGNT matches with one of the patterns we’ve discussed GIVEN ITS MORPHOSYNTACTIC PROPERTY SET. We want to test, for example, whether every verb tagged as -PAN---- matches one of Xειν, Xεῖν, Xοῦν, Xᾶν, Xῆν, Xύναι, Xέναι, Xόναι, Xάναι, or εἶναι. Or whether every verb tagged as 2PAI-S-- matches one of Xεις, Xεῖς, Xοῖς, Xᾷς, Xῇς, Xυς, Xης, Xως, Xης, or εἶ.

Running a short Python script over the MorphGNT, it turns out there are 14 forms in 69 instances that do NOT match.

Three of these forms are φημί. The issue here is that φημί is enclitic in the indicative and so, even though it otherwise follows a PA-9 paradigm, the accentuation doesn’t match. If we want to capture the enclitic nature of φημί in its inflection class, we’ll need to create a variant of PA-9 that is enclitic.

INF Xάναι Xάναι
1SG Xημι Xημί
2SG Xης Xής
3SG Xησι(ν) Xησί(ν)
1PL Xαμεν Xαμέν
2PL Xατε Xατέ
3PL Xᾶσι(ν) Xασί(ν)

The 2SG appears more frequently as φῄς in Classical Greek but neither form appears in the SBLGNT so we’ll put that issue aside for now.

Another eight of these forms are compounds of the copula and so have different accentuation and breathing (but are otherwise identical to PA-10).

INF εἶναι Xεῖναι
1SG εἰμί Xειμι
2SG εἶ Xει
3SG ἐστί(ν) Xεστι(ν)
1PL ἐσμέν Xεσμεν
2PL ἐστέ Xεστε
3PL εἰσί(ν) Xεισι(ν)

The only additional variation here is εἰσίασιν in Hebrews 9.6 but this is not, in fact, derived from εἰς + εἰμί but rather εἰς + εἶμι. Let’s create a new paradigm for εἶμι even though it doesn’t appear in the the SBLGNT just so we can derive a paradigm for the compound case from it.

Here PA-11 and PA-11-COMPOUND are shown alongside PA-10 for comparison (note the italic forms don’t appear in the SBLGNT):

  PA-10 PA-11 PA-11-COMPOUND
INF εἶναι ἰέναι Xιέναι
1SG εἰμί εἶμι Xειμι
2SG εἶ εἶ Xει
3SG ἐστί(ν) εἶσι(ν) Xεισι(ν)
1PL ἐσμέν ἴμεν Xιμεν
2PL ἐστέ ἴτε Xιτε
3PL εἰσί(ν) ἴασι(ν) Xίασι(ν)

PA-11 and PA-11-COMPOUND are very similar to PA-6a through PA-9 except with ει/ι instead of υ/υ, η/ε, ω/ο, η/α. The INF being ιε is a little unexpected but outside the scope of the current discussion as we really are just wanting to capture the 3PL of PA-11-COMPOUND for now.

Note that εἰσιέναι in Acts 3.3 is also from εἰς + εἶμι but this slipped us by because we have a Xέναι pattern already. Similarly, we have ἐξιέναι in Acts 20.7 and 27.43. With the addition of PA-11-COMPOUND we now have a slight ambiguity with PA-7 (in the INF) and PA-10-COMPOUND (in the 1SG and 2SG). This isn’t a problem at the moment but will come up again (as will other ambiguities) in the next post.

Adding these paradigm variants covers 12 of our originally non-matching forms. The remaining two are the impersonal χρή and ἔνι which represent fossilized phrases with the copula elided. For our stats we’ll ignore them.

In the next post, we’ll see if we can categorize the lexemes in the SBLGNT into inflection classes based on these paradigms and therefore be able to study how frequent they are from both a type and token perspective.


More Vocabulary Statistics

With a boost in numbers on, this post looks at some slightly more detailed statistics from the first activity.

Just 5 days ago there were 82 sign ups with 52 people having completed the first activity. Now there have been a total of 116 signups and 79 people have done at least the first activity (with 44 having done more than one). Thank you very much everyone!

In my last post we looked at mean item difficulty (what proportion of people get an item correct) by frequency bucket.

We saw that the coarse frequency buckets had an okay correlation with item difficulty but not great. We’ll explore that a little more in the near future but in this post I want to introduce another dimension: the ability of the person being asked the item.

I should note that in psychometrics (and in item response theory in particular, which we’ll be getting to) the term “ability” is used in a specific sense of the measurement we’re trying to take of the person (with no assumption of whether it’s innate or even desirable). It’s just the person-specific construct we’re trying to measure.

As an initial proxy for this “ability” in the context of the first activity on the site, I’ve used the total percentage of items in that activity answered correctly by a given person. This is just the raw percentage of items answered correctly, not quite the same as the estimate of NT vocabulary coverage shown on the site. This raw percentage is then used to group people into buckets (just in the context of the first activity for now).

Now we can tabulate item frequency buckets vs person ability buckets with the following result:

First off, you can see we’re still somewhat lacking in numbers of people of beginning-intermediate ability.

But importantly, you can see how mean item difficulty (the number in each cell) varies by ability bucket (the column). We’ve already seen that mean item difficulty isn’t a great predicator of item frequency bucket. Splitting out different abilities like we do above makes discrimination easier in some cases. But the important thing to note in the table above is that the mean item difficulty WITHIN a frequency bucket (row) is a good indicator of a person’s overall ability bucket.

This is less the case in the bucket for the most frequent items (the row labeled 1), which makes ability buckets 20% and above difficult to discriminate. Similarly, the less frequent item buckets aren’t as good at discriminating between the lower ability buckets. This is what we would expect.

But overall, frequency buckets 2 through 5 (and especially 3 and 4) do an excellent job of discriminating each of the ability buckets above 20%. 5 seems particularly well suited for each of the buckets at 40% ability and above and 1 only really between the 0–20% bucket and the rest.

I suspect it’s going to be interesting to have more fine-grained item frequencies but even MORE interesting to put aside frequency all together and bucket them by overall difficulty. I’ll do that in a subsequent post once I’ve done the analysis. At some point I’ll also look at individual items and their ability to discriminate ability.

For now, though, I did want to share a finer-grained bucketing of ability, with ten buckets instead of five:

The lack of people below the 50% ability mark makes this a little less useful and there are adjacent ability buckets that cease to be discriminating at this level of granularity.

But the important pattern is still there, assuming for now frequency is a proxy for difficulty: if an item is easy, it can’t discriminate people of higher ability, although may be great at discriminating those of lower ability; and if an item is hard, it can’t discriminate people of lower ability, although may be great at discriminating those of higher ability.

Click here to safely unsubscribe from "J. K. Tauber: at the intersection of computing, linguistics, biblical greek and learning science."
Click here to view mailing archives, here to change your preferences, or here to subscribePrivacy
Email subscriptions powered by FeedBlitz, LLC, 365 Boston Post Rd, Suite 123, Sudbury, MA 01776, USA.