Lab 1: Creating a Corpus

Files for Analysis

In this class we will be analyzing various expressions of the human condition through language, art, and music. To assist us and make our work personally valuable, you will be gathering text, images, and sound files for the class to analyze in subsequent labs.

Uploading copyrighted materials is fine for this assignment, as they will only be distributed to students in the course and used for strictly educational purposes. Be sure not to further redistribute any of these materials.

That said, public domain materials are often easier to obtain. Some links to repositories of freely available materials are provided below.


Find a small poem written in English that holds meaning for you. You might find Public Domain Poetry useful for this purpose.

Save your poem as a plain text document with a short, meaningful file name. (with file extension .txt) Your file size for this poem should be no more than 20KB, and be saved using UTF-8 encoding.


To find statistical patterns in text data, we need a large amount of text.

Find a novel that has meaning to you, stored in an electronic format. You should either find the novel available without cost, or purchase the ebook (look for versions without DRM so we can access the raw data). I strongly recommend using Project Gutenberg to find a suitable book. It contains many classic novels in plain text format.

Save your book as a plain text document with a short, meaningful file name. (with file extension .txt) Your file size for this document should be no less than 150KB, and be saved using UTF-8 encoding.

If your book is in another file format, such as .epub, it must be converted to a plain text document. Calibre can assist with this conversion.


Find two images of paintings (as described below) that have meaning to you. Each image should be stored in either PNG or JPEG format. Each file must be at least 640x400 pixels in size.

The first image should be a painted portrait. The face should be fully visible, with two eyes, a nose, and a mouth visible in particular.

The second image should be a painted landscape.


Find two recordings of music (as described below) that have meaning to you. Each recording should be stored in the .mp3 file format. Both recordings should be between one and four minutes in length. An excerpt from a longer work is acceptable.

The first recording should be purely instrumental music with no singing. The second recording should include a vocal performance, sung in English. For the second recording, also submit the lyrics.

A good source for public-domain recordings of classical music is


For each of your selections, answer the following questions:

  • What makes this selection interesting to you?
  • What do you estimate is the reading level for this text? (only answer for Poem and Book)
  • Would you say this selection conveys an overall positive or negative sentiment?
  • Formulate a research question you hope to answer by analyzing this selection computationally.

Submit your answers to these reflection questions through the Creating a Corpus quiz in Teams.

What to Hand In

  • Answer the questions above in the Teams Assignment.
  • Upload your selections to the appropriate Kaggle data set.
    • For each item, use a filename describing it in a few words, ending with your last name. Separate the words with underscores. For example, I uploaded The Brothers Karamazov by Fyodor Dostoevesky. I used the filename karamazov_dostoevsky_ferrer.txt.