Project 2: Data analysis

Haskell’s focus on functions and function composition, lazy evaluation, and support for wholemeal programming actually make it ideal for building data analysis pipelines.

For project 2, you will create a program which reads and analyzes a data set, displaying some summary information and allowing a user to query the data set.

Requirements

  1. You should choose a data set to use for your analysis. An ideal data set has the following characteristics:

    • Stored in a simple format such as .csv, to make it possible to read and analyze with a Haskell program. Something like .json is OK too. A data set in an Excel file is hopeless; in such a case you should first use Excel to export the data set to .csv.
    • Has enough rows that analyzing it by hand would be tedious or impossible (so that analyzing it with a program is interesting).
    • Contains a mix of different data types such as strings and integers.

    Here is a collection of many nice data sets that you are welcome to choose from, or you may select your own.

  2. Your program should read and parse the raw data into some appropriate data structure, i.e. something that represents the structure and meaning of the data, not just a list of lists of Strings or something similar. I recommend creating an algebraic data type (using record syntax) to represent one row of the data set; then read in a list of rows.

  3. Optionally, you may also wish to pre-process the data set into something more structured than a list of rows: for example, a Map that allows quick lookup of rows by primary key.

  4. For Level 2, your program must then prompt the user and allow them to choose among analyses or queries they would like (e.g. show them the average score among all rows, or the median of all departure times, or the sum of the salaries from a certain state, …)

Hints

A recursive IO action is the way to write a recurring menu! A very simple version might look like this:

menu :: IO ()
menu = do
  putStr "Your choice? "
  choice <- getLine
  case choice of
    "A" -> doThingA
    "B" -> doThingB
    _ -> pure ()
  when (choice /= "quit") menu

Sample project

I have provided a sample project to help get you started and give you some ideas. Note that it is only a bare minimum Level 1 project. If you’d like to take a look at it, you should download it and unzip it somewhere.

The sample project contains:

  • README.md: just a brief description of the project, and a link to the data set source.
  • flights-demo.cabal: describes the package in the standard Cabal format.
  • flights.csv: the actual data file. The first line is a list of colum labels. Each subsequent line is a single data point, with values separated by commas.
  • app-simple/Main.hs: an implementation of the project that uses basic tools like lines and splitAt to parse the data.
  • app-cassava/Main.hs: an implementation that uses the cassava package to parse the data.

You are welcome to use the sample project as a starting point for your project, or you can create a new project from scratch.

What to turn in

You should turn in a .zip or .tgz file containing an entire Cabal package, along with the data file(s) needed by your project.

Specification

  • Level 1:
    • Your own name and email appear in the author and maintainer fields in the .cabal file, and in the copyright notice in LICENSE.
    • Your project runs, loads the data file, parses it into an appropriate data structure, and displays some kind of summary or analysis to the user.
  • Level 2:
    • All requirements for Level 1.
    • Your code conforms to the style guidelines.
    • Your data types and functions are decomposed into reasonably small pieces.
    • Your code makes good use of wholemeal programming patterns such as function composition, and higher-order functions such as map, filter, Data.Map.fromListWith, etc.
    • Your program displays a menu giving the user at least 3 choices among different summaries or analysis; at least two of the choices depend on user input, e.g. displaying statistics about records containing a name entered by the user.
  • Level 3:
    • All requirements for Level 2.
    • At least one extra feature, such as (but not necessarily limited to):
      • A more complex data set involving multiple tables/.csv files, with relationships among them.
      • Allowing the user to edit the data set, then writing it back out to a file once complete
      • Especially if you took Programming Languages last year, it might be fun to allow the user to query the data set via a mini domain specific language. For example, the user might be able to enter things like (sum(X) + sum(Y)) / 2.