Project 7: Handwriting Recognition/Sentiment Analysis with Decision Trees/Random Forests
Overview
You will implement the
decision tree and random forest
machine learning algorithms, and apply them to two tasks:
Handwriting Recognition and Sentiment Analysis.
Files
The csci335 repository contains a
learning.decisiontree
package that we will be using in this project.
Files you modify are marked with an asterisk (*).
It contains the following files:
DecisionTree
interface: Represents a decision tree node. Decision
trees are defined recursively with these two implementations of the interface:
DTLeaf
class: Represents a decision tree leaf.
DTInterior
class*: Represents an interior node:
classify()
: Recursively classifies based on the feature value. If the
targeted feature value is less than or equal to the maxFeatureValue, ask
the left subtree. Otherwise, ask the right subtree. DTTest.testInterior()
should pass when this works.
DTTrainer
class*: Performs the decision tree learning algorithm:
getGini()
: Calculates gini coefficient of a data set.
Should pass DTTest.testGini()
.
gain()
: Calculates gain of two children given a parent.
Should pass DTTest.testGain()
.
- ‘splitOn()’: Returns a duple of two new lists of training data. The first returned list should be
everything from this set for which
feature
has a value less than or equal to featureValue
.
The second returned list should be everything else from this list. Should pass DTTest.testSplit()
.
reducedFeatures()
: Call allFeatures.apply()
to get the feature list. Then shuffle the list,
retaining only targetNumber
features. Should pass DTTest.testReduced()
.
train()
: Implements the decision tree learning algorithm. Should pass DTTest.testTrain()
.
resample()
: Generates a new data set by sampling randomly with replacement. It should return
an ArrayList
that is the same length as data
, where each element is selected randomly
from data
. Should pass DTTest.testResample()
.
RandomForest
class*:
classify()
: Ask each tree root for its classification of the Drawing. Pick the plurality winner
as the winner. I recommend using a Histogram
.
Also examine the following files that show adaptation of decision trees and random forests to our target domains:
learning.handwriting.learners
:
DrawingTree
DrawingForest30
learning.sentiment.learners
:
SentimentTree
SentimentForest30
- Perform 4-way cross-validation to assess your decision tree on each of your handwriting data sets, as well as the
three sentiment analysis sets. How well does it perform in comparison with kNN and SOM?
- For the random forests, experiment with at least three different numbers of trees in the forest. Create a new class
for each forest size.
- Feel free to perform additional experiments to clarify any issues that may arise.
Visualization
A visualization has been provided for you for the handwriting problem, for both the regular Decision Tree and
the Random Forest. Be sure to employ the visualizations in your analysis.
Paper
When you are finished with your experiments, write a paper summarizing your findings. Include the following:
- An analysis and discussion of your data. (Be sure to include the data as well.)
- An analysis and discussion of the visualizations of the learned functions.
You are strongly encouraged to include images of the visualizations in support of your analysis.
- How well does decision tree learning perform for each task?
- Are random forests worth the trouble? Why or why not?
- How do these algorithms compare with k-nearest-neighbor, Naive Bayes, and self-organizing maps?
Assessment
- Level 1: Decision tree learning implemented and works. Paper includes analysis of decision tree learning for
the sentiment analysis and handwriting domains.
- Level 2: Random forest learning implemented and works. All of the above is included in the paper.
- Level 3: Both random forests and decision tree learning are applied to two real-world data sets you
have obtained beyond those employed for Levels 1 and 2.
- You may use the same real-world data sets as you may have used for Level 3 of the previous assignment.
- Your paper should include an analysis of the performance of these algorithms on those data sets.
- If you reused data sets from the previous assignment, you should also analyze their performance relative
to the algorithms from the previous assignment.