15-Jun-2017 16:28

As just mentioned, a text corpus is a large body of text.

We will wait until later before exploring each Python construct systematically.

Don't worry if you see an example that contains something unfamiliar; simply try it out and see what it does, and — if you're game — modify it by substituting some part of the code with a different text or word.

This way you will associate a task with a programming idiom, and learn the hows and whys later.

For the moment, you can ignore the details and just concentrate on the output.

The Reuters Corpus contains 10,788 news documents totaling 1.3 million words.

An interesting property of this collection is its time dimension: Many text corpora contain linguistic annotations, representing POS tags, named entities, syntactic structures, semantic roles, and so forth.

