Observe that average word length appears to be a general property of English, since it has a recurrent value of variable counts space characters.) By contrast average sentence length and lexical diversity appear to be characteristics of particular authors.
The previous example also showed how we can access the "raw" text of the book Although Project Gutenberg contains thousands of books, it represents established literature.
An interesting property of this collection is its time dimension: Many text corpora contain linguistic annotations, representing POS tags, named entities, syntactic structures, semantic roles, and so forth.
NLTK provides convenient ways to access several of these corpora, and has data packages containing corpora and corpus samples, freely downloadable for use in teaching and research. For information about downloading them, see : Cumulative Word Length Distributions: Six translations of the Universal Declaration of Human Rights are processed; this graph shows that words having 5 or fewer letters account for about 80% of Ibibio text, 60% of German text, and 25% of Inuktitut text.
We'll use NLTK's support for conditional frequency distributions.
These are presented systematically in 2, where we also unpick the following code line by line.
Don't worry if you see an example that contains something unfamiliar; simply try it out and see what it does, and — if you're game — modify it by substituting some part of the code with a different text or word.
Often there is insufficient government or industrial support for developing language resources, and individual efforts are piecemeal and hard to discover or re-use.
The first handful of words in each of these texts are the titles, which by convention are stored as upper case.
In 1, we looked at the Inaugural Address Corpus, but treated it as a single text.
We examined some small text collections in 1., such as the speeches known as the US Presidential Inaugural Addresses.
This particular corpus actually contains dozens of individual texts — one per address — but for convenience we glued them end-to-end and treated them as a single text. also used various pre-defined texts that we accessed by typing This program displays three statistics for each text: average word length, average sentence length, and the number of times each vocabulary item appears in the text on average (our lexical diversity score).