In practice, one rarely writes parsers from scratch like we have done on previous modules. Typically, one uses some sort of tool or framework for constructing parsers, which hides a lot of the complexity of dealing with lexing/tokenizing, precendence and associativity, etc., and allows you to focus more directly on the grammar you wish to parse.
In this module, we will explore a Haskell library for constructing parsers called parsec.
parsec
If you are using repl.it you should not need to do anything special to be able to use parsec; just make sure that you upload the provided file Parsing.hs to your project. If you have Haskell installed on your local machine, to install parsec, you should open a command prompt and type
repl.it
Parsing.hs
cabal install --lib parsec
You should skim over the following code but you need not read it too closely at this point. Later parts of the module will direct you to read parts of it more carefully.
{-# LANGUAGE GADTSyntax #-} -- Hide some standard operators so we can use -- variants with more specific types (for now) import Prelude hiding ((<$>), (<$), (<*>), (<*), (*>)) -- Parsing is a module I have provided for you which wraps up some -- functionality of parsec into a somewhat easier/simpler interface. import Parsing -- Our old friend Arith data Arith where Lit :: Integer -> Arith Add :: Arith -> Arith -> Arith Sub :: Arith -> Arith -> Arith Mul :: Arith -> Arith -> Arith deriving (Show) interpArith :: Arith -> Integer interpArith (Lit i) = i interpArith (Add e1 e2) = interpArith e1 + interpArith e2 interpArith (Sub e1 e2) = interpArith e1 - interpArith e2 interpArith (Mul e1 e2) = interpArith e1 * interpArith e2 lexer :: TokenParser u lexer = makeTokenParser emptyDef parens :: Parser a -> Parser a parens = getParens lexer reservedOp :: String -> Parser () reservedOp = getReservedOp lexer integer :: Parser Integer integer = getInteger lexer whiteSpace :: Parser () whiteSpace = getWhiteSpace lexer parseArithAtom :: Parser Arith parseArithAtom = (Lit <$> integer) <|> parens parseArith parseArith :: Parser Arith parseArith = buildExpressionParser table parseArithAtom where table = [ [ Infix (Mul <$ reservedOp "*") AssocLeft ] , [ Infix (Add <$ reservedOp "+") AssocLeft , Infix (Sub <$ reservedOp "-") AssocLeft ] ] arith :: Parser Arith arith = whiteSpace *> parseArith <* eof eval :: String -> Maybe Integer eval s = case parse arith s of Left _ -> Nothing Right e -> Just (interpArith e)
The first thing to consider is lexer, which is a TokenParser. For now, we are using a simple default TokenParser; later we will see how to customize it. Essentially, lexer is an automatically generated collection of special parsers which do the low-level work of tokenizing. The functions getInteger, getParens, etc. extract these individual parsers from lexer. We have extracted four such token parsers and given them names to make them easier to use: one to parse integers, one for whitespace, one for operators, and one to parse parenthesized things. (See the definition of GenTokenParser for a full list of the available token parsers.)
lexer
TokenParser
getInteger
getParens
GenTokenParser
Parsers have a type like Parser a; for example, integer has type Parser Integer. This means it is a parser which consumes some part of a String and either returns a value of type Integer or fails with parse error. You can think of it like this:
Parser a
integer
Parser Integer
String
Integer
Parser a = String -> Maybe (a, String)
although the actual definition is quite a bit more complicated. Note also that a Parser a is NOT actually a function.
You can use the parseSome function (provided in the Parsing module) to try a parser and see what part of the input it consumes and what part is left to be consumed by subsequent parsers. In other words, parseSome turns a Parser into a function like the one described above.
parseSome
Parsing
Parser
Try the integer parser using parseSome. For example, you could evaluate parseSome integer "23" in GHCi. Explain how integer behaves. Be sure to try lots of different inputs: try spaces before and/or after an integer, try a negative sign, try non-digit characters before and after, etc. (You do not have to record the results of all your experiments.) Explain in detail what the integer parser does.
parseSome integer "23"
Try each of the following and explain what they do. Just as with integer, be sure to try many different inputs.
reservedOp
whiteSpace
parens
In general, how do token parsers like integer, reservedOp, and parens handle spaces?
The token parsers provide the primitive building blocks out of which we can construct more complicated parsers. Now we will explore some of the functions for building up more complex parsers. Such functions that allow building more complex things out of simpler parts are known as combinators.
Try integer <|> parens integer. (As before, “try” means "use the parseSome function to try it on various example inputs.)
integer <|> parens integer
What does integer <|> parens integer do?
What is the type of (<|>)?
(<|>)
What does (<|>) do in general?
Now consider Lit <$> integer.
Lit <$> integer
What does Lit <$> integer do?
What is the type of (<$>)? Explain what this type means in your own words.
(<$>)
What is the type of Lit?
Lit
What is the type of Lit <$> integer? What does it do?
Try integer *> integer.
integer *> integer
What does integer *> integer do?
What is the type of (*>)? Explain what this type means in your own words.
(*>)
What does (*>) do?
What do you think (<*) does? Guess before trying it.
(<*)
Now try it. Were you right?
Try "I" <$ integer.
"I" <$ integer
What does "I" <$ integer do?
What is the type of (<$)?
(<$)
What does (<$) do?
How is (<$) similar to (<*)? How is it different?
ROTATE ROLES and write the name of the new driver here:
Write a parser which parses two integers, discards the first one, and returns the second wrapped in a Lit constructor. Be sure to test it!
Write a parser which parses an integer surrounded by ~ on either side (for example, "~23~") and returns the integer.
~
"~23~"
Skim through the parsec documentation for the following modules:
Text.Parsec.Char
Text.Parsec.Expr
Text.Parsec.Combinator
Note, wherever you see something like ParsecT s u m a you should think of it as Parser a (ParsecT s u m a is a more general version of Parser a).
ParsecT s u m a
Pick one of the combinators you find and demonstrate its use. You will have to add an import statement to the top of this file of the form: import Text.Parsec.SomeModule (combinatorYouChose)
import
import Text.Parsec.SomeModule (combinatorYouChose)
Now we will explore how the token parsers and combinators are used to build a parser for the Arith language.
Note that we have defined two parsers of type Parser Arith, namely, parseArith and parseArithAtom (not counting arith, which we will discuss later). This is a common pattern when building parsers for languages with infix operators. The idea is that an “atomic” thing is something which forms an indivisible unit, which we know how to parse just by looking at the first token. A non-atomic thing might be more complicated; in particular, it might involve infix operators.
Parser Arith
parseArith
parseArithAtom
arith
Now look at the definition of parseArith. It uses a function provided by parsec called buildExpressionParser, which deals with parsing infix operators with various precedence levels and associativities (using similar algorithms to those explored in module 6).
buildExpressionParser
(Mul <$ reservedOp "*")
Notice that table consists of a list with two elements, each of which is itself a list. The first list has one element which refers to Mul; the second list has two elements referencing Add and Sub.
table
Mul
Add
Sub
What do you think this list signifies?
Test your theory by switching the order of the two lists, and doing some test parses. Were you right? (Then switch them back.)
Finally, take a look at the arith parser.
What does the eof parser do? (Try it on some examples, and/or read its documentation.)
eof
What does arith do?
We have now explored the entire Arith parser. Let’s modify it a bit.
Extend Arith with an exponentiation operator. Remember that exponentiation has higher precedence than multiplication and is right-associative. (The exponentiation operator in Haskell is (^).) You will have to modify the Arith data type, the interpreter, and the parser. Be sure to test that it works correctly.
(^)
Arith
Now extend Arith with prefix negation. For example, "-(2+3)" should evaluate to -5. Again, you should modify the Arith definition, parser, and interpreter. When modifying the parser, you can use Prefix instead of Infix. See the documentation for Text.Parsec.Expr.
"-(2+3)"
-5
Prefix
Infix
(<*>)
There is one more parser operator we need to learn about, namely, (<*>). It wasn’t needed in the Arith parser but will often be useful.
Unfortunately, the type of (<*>) does not give you a good sense of how to use it unless you have already spent a good deal of time thinking about higher-order functions, currying, and the like in functional programming. Instead, we’ll look at some examples.
add :: Integer -> Integer -> Integer add x y = x + y
(add <$> integer <*> integer)
(In fact, we could also have written ((+) <$> integer <*> integer), which would do exactly the same thing.)
((+) <$> integer <*> integer)
Write a function add3 which takes three Integers and adds them.
add3
Now test the parser (add3 <$> integer <*> integer <*> integer). What does it do?
(add3 <$> integer <*> integer <*> integer)
In general, what do you think (func <$> parser1 <*> parser2 <*> ... <*> parserN) does?
(func <$> parser1 <*> parser2 <*> ... <*> parserN)
Write a parser adder :: Parser Arith which expects to read two integer values, and returns an Arith value consisting of an Add node containing the two integers. For example:
adder :: Parser Arith
parseSome adder "32 9 xyz" Right (Add (Lit 32) (Lit 9), "xyz")
Recall the BIN language from module 4. As a reminder, the concrete syntax of BIN looks like this:
<bin> ::= '#' | '(' <bin> <bin> ')'
Write a parser for BIN using parsec. A few hints:
symbol = getSymbol lexer
#
##
How long would you estimate that you spent working on this module?
Were any parts particularly confusing or difficult?
Were any parts particularly fun or interesting?
Record here any other questions, comments, or suggestions for improvement.