Lab 1: Creating a Corpus
Files for Analysis
In this class we will be analyzing various expressions of the human
condition through language, art, and music. To assist us and make our
work personally valuable, you will be gathering text, images, and sound
files for the class to analyze in subsequent labs.
Uploading copyrighted materials is fine for this assignment, as they will only be
distributed to students in the course and used for strictly educational purposes.
Be sure not to further redistribute any of these materials.
That said, public domain materials are often easier to obtain. Some links to
repositories of freely available materials are provided below.
Poem
Find a small poem written in English that holds meaning for you. You might find
Public Domain Poetry useful for this
purpose.
Save your poem as a plain text document with a short, meaningful file
name. (with file extension .txt
) Your file size for this poem should
be no more than 20KB, and be saved using UTF-8 encoding.
Book
To find statistical patterns in text data, we need a large amount of text.
Find a novel that has meaning to you, stored in an
electronic format. You should either find the novel available without
cost, or purchase the ebook (look for versions without DRM so we can
access the raw data). I strongly recommend using
Project Gutenberg to find a suitable book.
It contains many classic novels in plain text format.
Save your book as a plain text document with a short, meaningful file
name. (with file extension .txt
) Your file size for this document
should be no less than 150KB, and be saved using UTF-8 encoding.
If your book is in another file format, such as .epub
, it must be converted
to a plain text document. Calibre
can assist with this conversion.
Images
Find two images of paintings (as described below) that have meaning to you. Each
image should be stored in either PNG or JPEG format. Each file must be at least 640x400 pixels in size.
The first image should be a painted portrait. The face should be fully visible,
with two eyes, a nose, and a mouth visible in particular.
The second image should be a painted landscape.
Music
Find two recordings of music (as described below) that have meaning to you. Each
recording should be stored in the .mp3
file format. Both recordings should be
between one and four minutes in length. An excerpt from a longer work is acceptable.
The first recording should be purely instrumental music with no singing. The
second recording should include a vocal performance, sung in English. For the
second recording, also submit the lyrics.
A good source for public-domain recordings of classical music is imslp.org.
Reflection
For each of your selections, answer the following questions:
- What makes this selection interesting to you?
- What do you estimate is the reading level for this text? (only
answer for Poem and Book)
- Would you say this selection conveys an overall positive or negative
sentiment?
- Formulate a research question you hope to answer by analyzing this
selection computationally.
Submit your answers to these reflection questions through the Creating a Corpus
quiz in Teams.
What to Hand In
- Answer the questions above in the Teams Assignment.
- Upload your selections to the appropriate Kaggle data set.
- For each item, use a filename describing it in a few words, ending with your last name. Separate the words
with underscores. For example, I uploaded The Brothers Karamazov by Fyodor Dostoevesky. I used
the filename
karamazov_dostoevsky_ferrer.txt
.