Project 1: Language Analysis with Markov Chains
Overview
When a web browser visits a webpage, there is sometimes an option to
automatically translate the text into English from another language. How
does the browser detect other languages? One way is to pass the text of the
webpage through prebuilt Markov chains for many languages, and determine
which language has the highest probability of generating this text.
To replicate this functionality, you will implement the Markov Chain
machine learning algorithm, and apply it to the language recognition task.
We can take a step further and use Hidden Markov Models to try to understand
the structure of the phonemes in a languageās writing system. To that end, you
will use the hmmlearn
library to build Hidden Markov Models to explore language structure.
Files
The csci335 repository contains a
learning.markov
package that we will be using in this project.
Files you modify are marked with an asterisk (*).
It contains the following files:
LanguageGuesser
class: A GUI that allows the user to:
- Train a Markov Chain using a reference text. For your convenience,
reference texts in English, Spanish, French, and German are provided in the
books
directory.
- Type a sentence, and see a classification and probability distribution
for the classification.
MarkovChain
*: Implements a collection of Markov chains, one for each designated label:
count()
: Increases the count for the transition from prev
to next
.
probability()
: Returns the probability of the chain for label
generating sequence
.
labelDistribution()
: Returns a map with keys that are labels and values representing the probability of a sequence corresponding to that label.
bestMatchingChain()
: Calls labelDistribution()
and returns the label
with the highest probability.
MarkovLanguage
: Extends MarkovChain
with some utility methods to assist
with language classification.
SimpleMarkovTest
: Unit tests featuring some simple examples.
MajorMarkovTest
: Unit tests trained using the four provided books.
The Hidden Markov Model Language Analysis
Kaggle notebook contains Python code that we will use
for exploring Hidden Markov Models. You should clone this notebook for your own use.
- Obtain four sentences in each of English, Spanish, French, and German, and test and record how well
LanguageGuesser
classifies each sentence.
- You may obtain sentences by searching for them on the web, writing them
yourself, using Google Translate, etc.
- Select four other languages, and obtain four sentences in each of them.
- Each language must have a writing system that employs Latin characters.
- Run each sentence through
LanguageGuesser
. Given how it was trained,
how plausible are its guesses?
Hidden Markov Models
- We can use Hidden Markov Models to try to gain deeper insight into the dynamics
of a sequence.
- Follow the instructions in the the Hidden Markov Model Language Analysis
Kaggle notebook to build a Hidden Markov Model of
the dynamics of English text.
Paper
When you are finished with your experiments, write a paper summarizing your findings. Include the following:
- The URL for the private GitHub repository containing your code.
- A table containing the 32 sentences you gathered. The first column should give the language, the second column the sentence, the third column the best matching language, and the fourth through seventh columns should give the probability of each of English, Spanish, French, and German.
- An analysis of the performance of your implementation based on the data recorded in the table.
- An analysis of the plausibility of the classifications of the sentences from
languages other than those on which it was trained.
- Based on the performance of Markov chains for the language classification task,
for what other kinds of tasks do you believe this approach would be useful?
Carefully explain your answer.
- What do the two states in the Hidden Markov Model represent?
Assessment
- Level 1:
- Create a working implementation of the
LanguageGuesser
program.
- Share the repository with the instructor.
- Assess its performance with sentences in English, Spanish, French, and German.
- Submit a paper describing the results of the above.
- Level 2:
- Assess the performance of
LanguageGuesser
with the four alternative languages.
- Complete the Hidden Markov Model notebook implementation.
- Submit a paper including all items mentioned above.
- Level 3:
- For each of the four additional languages, obtain a book-length text file,
and train the Markov Chain on all eight languages. Then reassess its performance
and include this reassessment in your paper.
- I highly recommend selecting languages that have close
relationships, in order to investigate the ability of the
Markov chain to distinguish similar data sets.
- Some examples of closely related languages:
- English and Scots
- Dutch and Afrikaans
- Finnish and Estonian
- Polish and Slovak
- Samoan and Maori
- The languages Portuguese, Galician, Spanish, Catalan, Provencal, and French
developed from different points along a dialect continuum of languages descended
from Latin. Any two consecutive languages within this list would be an interesting
test case.
- Create Hidden Markov Models for Spanish, French, and German, and repeat the
two-state experiment we performed for English. In your paper, analyze the
similarities and differences among their outcomes.