Project 7: Handwriting Recognition/Sentiment Analysis with Decision Trees/Random Forests

Overview

You will implement the decision tree and random forest machine learning algorithms, and apply them to two tasks: Handwriting Recognition and Sentiment Analysis.

Files

The csci335 repository contains a learning.decisiontree package that we will be using in this project. Files you modify are marked with an asterisk (*). It contains the following files:

  • DecisionTree interface: Represents a decision tree node. Decision trees are defined recursively with these two implementations of the interface:
    • DTLeaf class: Represents a decision tree leaf.
    • DTInterior class*: Represents an interior node:
      • classify(): Recursively classifies based on the feature value. If the targeted feature value is less than or equal to the maxFeatureValue, ask the left subtree. Otherwise, ask the right subtree. DTTest.testInterior() should pass when this works.
  • DTTrainer class*: Performs the decision tree learning algorithm:
    • getGini(): Calculates gini coefficient of a data set. Should pass DTTest.testGini().
    • gain(): Calculates gain of two children given a parent. Should pass DTTest.testGain().
    • ‘splitOn()’: Returns a duple of two new lists of training data. The first returned list should be everything from this set for which feature has a value less than or equal to featureValue. The second returned list should be everything else from this list. Should pass DTTest.testSplit().
    • reducedFeatures(): Call allFeatures.apply() to get the feature list. Then shuffle the list, retaining only targetNumber features. Should pass DTTest.testReduced().
    • train(): Implements the decision tree learning algorithm. Should pass DTTest.testTrain().
    • resample(): Generates a new data set by sampling randomly with replacement. It should return an ArrayList that is the same length as data, where each element is selected randomly from data. Should pass DTTest.testResample().
  • RandomForest class*:
    • classify(): Ask each tree root for its classification of the Drawing. Pick the plurality winner as the winner. I recommend using a Histogram.

Also examine the following files that show adaptation of decision trees and random forests to our target domains:

  • learning.handwriting.learners:
    • DrawingTree
    • DrawingForest30
  • learning.sentiment.learners:
    • SentimentTree
    • SentimentForest30

Assessing performance

  • Perform 4-way cross-validation to assess your decision tree on each of your handwriting data sets, as well as the three sentiment analysis sets. How well does it perform in comparison with kNN and SOM?
  • For the random forests, experiment with at least three different numbers of trees in the forest. Create a new class for each forest size.
  • Feel free to perform additional experiments to clarify any issues that may arise.

Visualization

A visualization has been provided for you for the handwriting problem, for both the regular Decision Tree and the Random Forest. Be sure to employ the visualizations in your analysis.

Paper

When you are finished with your experiments, write a paper summarizing your findings. Include the following:

  • An analysis and discussion of your data. (Be sure to include the data as well.)
  • An analysis and discussion of the visualizations of the learned functions. You are strongly encouraged to include images of the visualizations in support of your analysis.
  • How well does decision tree learning perform for each task?
  • Are random forests worth the trouble? Why or why not?
  • How do these algorithms compare with k-nearest-neighbor, Naive Bayes, and self-organizing maps?

Assessment

  • Level 1: Decision tree learning implemented and works. Paper includes analysis of decision tree learning for the sentiment analysis and handwriting domains.
  • Level 2: Random forest learning implemented and works. All of the above is included in the paper.
  • Level 3: Both random forests and decision tree learning are applied to two real-world data sets you have obtained beyond those employed for Levels 1 and 2.
    • You may use the same real-world data sets as you may have used for Level 3 of the previous assignment.
    • Your paper should include an analysis of the performance of these algorithms on those data sets.
    • If you reused data sets from the previous assignment, you should also analyze their performance relative to the algorithms from the previous assignment.