Module 06: The Arith language: parsing
======================================
* Write your team names here:
* You may again choose whoever you want to start as the driver. Write
your choice here:
In this module, we will explore some rudiments of parsing.
> {-# LANGUAGE GADTSyntax #-}
>
> import Data.Char -- for isSpace, isDigit
Tokenizing
----------
> data Token where
> TLit :: Integer -> Token
> TPlus :: Token
> TMinus :: Token
> TTimes :: Token
> LParen :: Token
> RParen :: Token
> deriving (Show, Eq)
* Write a function `tokenize :: String -> [Token]` which turns a
string such as `" ( 2+ 45)"` into a list of tokens like
`[LParen, TLit 2, TPlus, TLit 45, RParen]`. Your function should
ignore any whitespace (use the `isSpace` function). *Hint*: you may
also find the `span` function useful. Try calling `span isDigit` to
see what it does.
Of course, `tokenize` might fail if the provided `String` is not a
valid Arith expression. Ideally, we would handle the errors in a
principled way (*e.g.* by having `tokenize` return a `Maybe
[Token]`, or something even more informative). For now, your
function should simply crash when given invalid input. We will
consider error handling in more depth later in the semester.
Postfix
-------
* **ROTATE ROLES** and write the name of the new driver here:
Recall our representation of the Arith language from last time:
> data Op where
> Plus :: Op
> Minus :: Op
> Times :: Op
> deriving (Show, Eq)
>
> data Arith where
> Lit :: Integer -> Arith
> Bin :: Op -> Arith -> Arith -> Arith
> deriving (Show)
Consider the following alternative *postfix* syntax for Arith:
```
::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
::= +
::= '+' | '-' | '*'
::=
| ' ' ' '
```
Notice that each operator appears *after* its operands (instead of
*between* them).
* Write down three examples that match the grammar ``.
* Consider the following infix expressions. Write each as an
equivalent postfix expression. Be careful about the order of arguments!
+ `3 + 4 * 5`
+ `3 * 4 + 5`
+ `(2 + 3 - (4 - 5)) * 6`
* Get out a sheet of paper and draw each of the abstract syntax trees
corresponding to the above expressions. What is the relationship
between their trees and their postfix forms?
* Note that `` expressions do not contain any parentheses.
Will this ever cause ambiguity or incorrect parsing? If so, give an
example; if not, explain why not.
Shunting
--------
* **ROTATE ROLES** and write the name of the new driver here:
As discussed in class, the Shunting Yard Algorithm reorders a list
of tokens from infix to postfix notation by repeatedly applying the
following rules. It makes use of a temporary stack of tokens.
$\newcommand{\opin}{\mathit{op}_{\mathit{in}}}$
$\newcommand{\opstk}{\mathit{op}_{\mathit{stk}}}$
1. If the input is empty, move all operators on the temporary stack to
the output.
2. If the next input token is a literal number, move it to the output.
3. If the next input token is a left parenthesis, push it onto the
stack.
4. If the next input token is a right parenthesis:
* If the top of the stack is a left parenthesis, discard both
parens.
* Otherwise, pop the top token from the stack and move it to the
output (keeping the right paren in the input).
This has the effect of moving all operators on the stack to the
output up until the first matching left parenthesis.
5. If the next input token is an operator $\mathit{op}_{\mathit{in}}$:
* If the stack is empty, push $\mathit{op}_{\mathit{in}}$ on the stack.
* If the top of the stack is a token $\mathit{op}_{\mathit{stk}}$:
* If the precedence of $\mathit{op}_{\mathit{in}}$ is $\leq$ the precedence of
$\mathit{op}_{\mathit{stk}}$, pop $\mathit{op}_{\mathit{stk}}$ from the stack to the output and keep $\mathit{op}_{\mathit{in}}$
in the input.
* Otherwise, keep $\mathit{op}_{\mathit{stk}}$ on the stack and push $\mathit{op}_{\mathit{in}}$ on
top of it.
More concisely, when we see $\mathit{op}_{\mathit{in}}$ we pop operators from the
stack as long as they have precedence greater than $\mathit{op}_{\mathit{in}}$, until
either the stack is empty or the top operator on the stack has
precedence less than $\mathit{op}_{\mathit{in}}$; then we push $\mathit{op}_{\mathit{in}}$.
* Write a function `shunt :: [Token] -> [Token]` which converts an
infix expression into postfix, implementing the shunting yard
algorithm discussed in class.
* *Hint 1*: make another "helper" function with type `[Token] ->
[Token] -> [Token]`, which keeps track of both the input list of
tokens as well as the temporary operator stack.
* *Hint 2*: make a function of type `Token -> Int` which returns
the precedence of each token. Be sure to give parentheses the
*lowest* precedence (this means Rule 5 above does not need any
special cases for parentheses). It probably doesn't matter what
precedence you give to number literals, but logically they
should have the *highest* precedence.
* Where do you think the given algorithm would need to be modified to
accommodate both left- and right-associative operators?
Parsing
-------
* Write a function `parsePostfix :: [Token] -> Arith` which takes a
list of tokens in postfix form and builds a corresponding `Arith`
expression. For example, `parsePostfix [TLit 2, TLit 45, TPlus] =
Bin Plus (Lit 2) (Lit 45)`.
`parsePostfix` should work by maintaining an `[Arith]` stack.
Literal tokens are just pushed onto the stack. Each operator in
the postfix expression applies to the two previous subexpressions,
so when encountering an operator, pop the top two `Arith` trees
from the stack, make them into a new `Arith` tree with the
operator at the root, and push the result back on the stack.
Again, `parsePostfix` should work by just calling a recursive
helper function with an extra parameter for the `[Arith]` stack
(which starts off empty).
* Finally, put it all together: write a function `eval :: String ->
Integer` which works as follows:
Concrete syntax (infix) $\rightarrow$ *tokenize* $\rightarrow$
Token list (infix) $\rightarrow$ *shunt* $\rightarrow$ Token list
(postfix) $\rightarrow$ *parse* $\rightarrow$ Arith AST
$\rightarrow$ *interpret* $\rightarrow$ Integer value