Project 6: Handwriting Recognition/Sentiment Analysis with kNN/Naive Bayes/SOM

Overview

You will implement the k-nearest-neighbor, Naive Bayesian Classifier, and self-organizing map machine learning algorithms, and apply them to two tasks: Handwriting Recognition and Sentiment Analysis.

Files

The csci335 repository contains a learning package that we will be using in this project. It contains many files. The files in which you will write code are:

  • handwriting.core.FloatDrawing: Alternative handwriting representation for SOM.
  • classifiers.Knn: Implementation of the k-nearest-neighbor algorithm. It should pass the unit tests in classifiers.KnnTest when complete.
  • classifiers.NaiveBayes: Implementation of the Naive Bayes classifier algorithm.
  • classifiers.SOMRecognizer: Supervised learning algorithm using a SOM.
  • som.SelfOrgMap: Implementation of the self-organizing map. It should pass the unit tests in som.SelfOrgMapTest when complete.
  • handwriting.learners: Configured handwriting learners go here. A few examples are provided.
  • sentiment.learners: Configured sentiment learners go here. A few examples are provided.

Other files of particular interest include:

  • handwriting.gui.DrawingEditor: Run this to create handwriting samples and test your learners.
  • sentiment.gui.SentimentViewer: Run this to test your learners on sentiment analysis problems.
  • core.Histogram: Implements a Histogram data type. This will be useful for kNN among other things.
  • sentiment.core.BagOfWordsFuncs: Distance functions for bags-of-words.
  • core.Classifier: Interface for our machine learning algorithms.
  • handwriting.core.Drawing: Data type for handwriting samples. Drawing::distance is used by kNN.
  • handwriting.core.SOMDrawingBridge: Classifier front-end that turns Drawing objects into FloatDrawing objects to make life easier for the self-organizing map.

Experiments

Handwriting Recognition

  • Complete the implementations of the above files as specified.
  • Using DrawingEditor, draw 20 samples each of two letters. For each drawing, click the “Record” button when it is complete. For the label, use the letter that you drew. Once this is complete, save the file (using the Save command on the File menu).
  • Test the performance of Knn3 on these two letters. Use the Assess menu option under the Learner menu, and perform 4-way cross-validation.
  • Expand your data set to train it to distinguish three letters. Save the expanded data set under a different filename.
  • Continue iterating this process until you can build a classifier that can distinguish at least eight different letters.
  • Compare the performance of k=3 against two other values of k.
  • Repeat this process with the self-organizing map. Compare performance with three different map sizes.

Sentiment Analysis

  • For the sentiment analysis problem, we will compare kNN and Naive Bayes.
  • Using the same values of k as before, evaluate the classification performance of each of these algorithms on the three sentiment analysis data sets: Amazon, Internet Movie Database, and Yelp. All three are found in the sentiment_labelled_sentences folder.
  • Again, perform 4-way cross-validation.

Paper

When you are finished with your experiments, write a paper summarizing your findings. Include the following:

  • An analysis and discussion of your data. (Be sure to include the data as well.)
  • What effect did variations of the value of k have?
  • How about variations in map size?
  • What insights did you gain from the SOM visualizations?
  • Compare the performance of kNN and the SOM for the handwriting classification tasks. Which worked better? Why?
  • Compare the performance of kNN and Naive Bayes on the sentiment analysis task. Which worked better? Why?
  • Compare the relative difficulty of the two tasks. What aspects of the tasks, in your view, contributed to this relative difficulty?
  • Beyond the actual results, what other issues are noteworthy?

Submission

  • Post your code on Github in your private repository. Make sure the instructor is added as a collaborator.
  • Upload your paper to Microsoft Teams.

Grading Criteria

  • Level 1
    • The kNN algorithm is implemented and functional.
    • The paper includes an analysis of kNN for each of the handwriting and sentiment analysis problems.
  • Level 2
    • The self-organizing map and naive bayes algorithms are implemented and functional.
    • The paper includes all of the above analysis.
  • Level 3
    • Find two real-world data sets that you would like to explore. Do the following:
      • Perform unsupervised learning on each data set with the self-organizing map and produce a visualization.
        • What insight do you get about the data from this process? Analyze in your paper.
      • Perform supervised learning on each data set comparing all three algorithms.
        • Which produced the best performance? Why?