Feval: Parsing a functional language with Parsec

How to use Parsec to parse a functional programming language

In the following we will discuss how to parse the grammar

x ::= ( a | b | ... | z )
      ( a | b | ... | z
      | A | B | ... | Z
      | 0 | 1 | ... | 9
      | _ | ' )*

v ::=   x
      | True | False
      | 0 | 1 | -1 | 2 | -2 | ...
      | Function x -> e
      | [] | [v, v, ..., v]

e ::=   v
      | (e)
      | e && e | e "||" e | !e
      | e + e | e - e | e * e | e / e | e % e
      | e = e | e < e | e <= e | e > e | e >= e
      | e e     -- Function application
      | If e Then e Else e
      | Let x x ... x = e In e
      | e; e
      | e : e | Case e Of [] -> e "|" (x : x) -> e

used for the programming language Feval by utilizing the Parsec library for Haskell. We will find that the daunting task of parsing is made relatively simple with parser combinators.

Basic Background

Parsing is primarily concerned with deriving meaning from strings, i.e. sequences of characters. In the domain of programming languages, this means that we want to read in a string (usually read from a file) that contains a program and convert it to something the interpreter can evaluate. For instance if we have the code 1 - 2 + 3 we would want to be able to convert it to its abstract syntax tree or AST which looks like

The AST tells the program how to execute the code, which it does in a post-order traversal, i.e. it evaluates the right and left children before the parent. The way that we represent the AST in Haskell is by defining a grammar for our language, and then an actual type. For the expression above we could define the grammar for our language to be

v ::= 1 | -1 | 2 | -2 | 3 | -3 | ...

e ::= e + e | e - e | v

In this case our values denoted by the variable v can be any integer, whereas our expressions, denoted by the variable e, can only be addition of any two expressions, subtraction of any two expressions, or any value. We can represent this grammar in type form as follows

data Expr =
       Add Expr Expr
     | Sub Expr Expr
     | Num Integer
     deriving Show

Usually when you have to do something with parsing you have a whole lot of baggage that comes along with it. Things like whitespace, the state of the parser (what line you are on, etc.), or even what to do when there is a syntax error on the part of the programmer, can be a lot to handle and annoying to have to deal with. Fortunately Parsec makes all this easy.

Parse on, Captain

First off, to deal with whitespace all we need to do is create a lexer, which takes our program and returns tokens, things like "45", "-12", "+", or "-" while removing whitespace. First we define the configuration for our lexer

import Text.Parsec hiding (Empty)
import Text.Parsec.String
import qualified Text.Parsec.Token as Token
import Text.Parsec.Language

lexerConfig = emptyDef { Token.reservedOpNames = words "+ -" }

This lexer config says that the symbols "+" and "-" are reserved operators. Then we create our lexer and extract some useful functions from it

lexer = Token.makeTokenParser lexerConfig

reservedOp = Token.reservedOp lexer
integer    = Token.integer lexer

Now we can get down to the gritty details. The Parsec parser is a monad that abstracts everything but the results of successful parsing away from us so that we don't have to deal with it. Because of the monad structure, we are easily able to combine smaller parsers together to form a larger parser. In this way we will find that our parser resembles its original grammar.

Lets first start our with our parser for a value. The only type of value we can have is an integer so therefore, our parser can be written

value :: Parser Expr
value = do
    n <- integer
    return $ Num n

The integer parser returns an integer monadically, which we extract to form a value Expr, namely an instance of Num Integer. In the case that the integer fails, i.e. cannot parse an input, the value parser will fail without consuming any input.

(It is important to note that this would not be the case if the value parser were made to consume something else before the integer, for instance Integer before an expression like Integer -78. In this case after consuming a string containing "Integer", if the integer parser were to fail, the "Integer" would still be consumed by the value parser even if it fails. If you need to fail without consuming input in this type of scenario you should wrap your parser inside of Parsec's try function.)

Now we need to parse expressions, specifically our expressions involving the operators + and -. In most cases this type of parsing is hard and annoying, if not downright impossible, but fortunately Parsec comes to the rescue once again! First we define a function which will construct a configuration that tells Parsec's expression library how to parse our expression

import Text.Parsec.Expr

binary name label assoc = Infix (do{ reservedOp name
                                   ; return (\x y -> label x y)
                                   }) assoc

This tells the expression parser that we are creating an infix operator with an associativity assoc and a parser for the operator, which must return a function to construct the expression parsers return type. In this case the parser parses one of our reserved operators name and returns the label type with the left and right parts of the expression applied to x and y. We can then define an operator table

opTable = [ [ binary "+" Add AssocLeft
            , binary "-" Sub AssocLeft ] ]

which says that both addition and subtraction are binary operations that associate left with names + and -, and types Add and Sub, respectively. Then our expression parser is

expr :: Parser Expr
expr = buildExpressionParser opTable value

which says that the parser for the right and left of an expression is either a value or another expression from the opTable.

Great, but how do I use it? You can easily parse both strings and files via

import Control.Applicative ((<*))

parseString :: Parser Expr -> String -> Either ParseError Expr
parseString e s = parse (e <* eof) "" s

parseFile :: Parser Expr -> FilePath -> IO (Either ParseError Expr)
parseFile e f = parseFromFile (e <* eof) f

Here, the eof is a parser which parses the end of a string or file, and the (<*) operator, which we can use since the Parsers are monads, which evaluates both parsers but returns only the result from the one on the left. ParseError simply contains information about what happened in the case of a parsing failure. Conveniently, it is an instance of Show.

Data Aquisition

In feval we can convert our grammar to a data type like so

data Expr
    = CInt Integer
    | CBool Bool
    | CVar String
    | Add Expr Expr
    | Sub Expr Expr
    | Mul Expr Expr
    | Div Expr Expr
    | Mod Expr Expr
    | And Expr Expr
    | Or Expr Expr
    | Not Expr
    | Equal Expr Expr
    | Less Expr Expr
    | LessEq Expr Expr
    | Great Expr Expr
    | GreatEq Expr Expr
    | Empty -- []
    | Cons Expr Expr
    | If Expr Expr Expr
    | Function String Expr
    | Appl Expr Expr
    | Let String [String] Expr Expr -- Let String String String ... = Expr In Expr
    | Semi Expr Expr -- Expr; Expr
    | Case Expr Expr String String Expr -- Case Expr Of [] -> Expr | (String, String) -> Expr
    deriving Show

To correspond with our grammar, we define the following lexer

names = words "True False Function If Then Else Let In Case Of" -- reserved names
opNames = words "-> && || ! + - * / % = ; < <= > >= :" -- reserved operations

lexerConfig = emptyDef { Token.commentStart = "/*" -- adding comments is easy
                       , Token.commentEnd = "*/"
                       , Token.commentLine = "#"
                       , Token.identStart = letter -- identifiers must start with a letter
                       , Token.identLetter = alphaNum <|> char '_' <|> char '\''
                       , Token.reservedNames = names
                       , Token.reservedOpNames = opNames
                       }

lexer = Token.makeTokenParser lexerConfig

The letter parser is a parser that parses any letter. The alphaNum parses any letter or number character. The char parser parses only it's character argument. The (<|>) operation runs the left-most parser and then the next until one succeeds; if all fail so does the overall parser. We export the following functions from the lexer

identifier = Token.identifier lexer -- parses a valid identifier in our language
symbol     = Token.symbol lexer     -- parses a symbol like "]"
reserved   = Token.reserved lexer   -- parses a reserved word like "If"
reservedOp = Token.reservedOp lexer -- parses a reserved operation like "<="
parens     = Token.parens lexer     -- parses parenthesis surrounding the parser passed to it
brackets   = Token.brackets lexer   -- parses brackets surrounding the parser passed to it
commaSep   = Token.commaSep lexer   -- parses some or no comma separated instances of
                                    -- the argument parser
integer    = Token.integer lexer    -- parses an integer
whiteSpace = Token.whiteSpace lexer -- parses whitespace

Expressions and an `Appl`

We next move on to our easiest parser, the operation expression parser (who would have thought...). All we need to do is to define a prefix function to complement our binary function described earlier

import Control.Applicative ((*>)) -- the opposite of (<*)

prefix name label = Prefix (reservedOp name *> return (\x -> label x))

And we can define a ton of expressions quite easily

opTable = [ [ prefix "!" Not ]
          , [ appl ]
          , [ binary "*" Mul AssocLeft
            , binary "/" Div AssocLeft
            , binary "%" Mod AssocLeft ]
          , [ binary "+" Add AssocLeft
            , binary "-" Sub AssocLeft
            ]
          , [ binary "=" Equal AssocLeft
            , binary "<" Less AssocLeft
            , binary "<=" LessEq AssocLeft
            , binary ">" Great AssocLeft
            , binary ">=" GreatEq AssocLeft
            ]
          , [ binary "&&" And AssocLeft ]
          , [ binary "||" Or AssocLeft ]
          , [ binary ":" Cons AssocRight ]
          , [ binary ";" Semi AssocLeft ]
          ]

Note that the operation lists at the top have the highest precedence (operations in the same list have the same precedence). But what is this appl in our tree (haha)? This is our operation for function application, which uses spaces between arguments instead of an operator, for example an expression like (Function x -> x + 1) 3. We need to be careful with how we define application, because there is usually a space between the operators and arguments in other expressions. Therefore, we define appl as

appl = Infix space AssocLeft
    where space = whiteSpace
            *> notFollowedBy (choice . map reservedOp $ opNames)
            *> return (\x y -> Appl x y)

The appl parser first consumes any whitespace, then it makes sure that it is not followed by an operation. We do this by taking the array of operator names and mapping the reserved operation parser function reservedOp over it. We then wrap this into a choice parser which parses at least one of the parsers in the list passed to it as an argument. Then we apply this to the parser notFollowedBy, which does not fail only if the next string (looking ahead) cannot be parsed by its argument.

Finally, we can define our operator expression parser as

opExpr :: Parser Expr
opExpr = buildExpressionParser opTable term

Terms of Service

First we define parsers for our basic constant types

import Control.Applicative ( (<$>)  -- This takes an argument on its right,
                                    -- in this case the value inside the
                                    -- monadic parser, and applies it to the function to
                                    -- the left (if the parser does not fail)
                           , (<$) ) -- This takes an monadic parser on its right
                                    -- and returns the value on the left (if the
                                    -- parser does not fail)

cint :: Parser Expr
cint = CInt <$> integer

cbool :: Parser Expr
cbool = CBool True <$ reserved "True"
    <|> CBool False <$ reserved "False"

cvar :: Parser Expr
cvar = CVar <$> identifier

Next we define our list parser

list :: Parser Expr
list = toCons <$> brackets (commaSep expr)
    where toCons [] = Empty
          toCons (x:xs) = Cons x (toCons xs)

This parser reads in some or no expr parsers separated by commas and encapsulated with brackets, and applies the resulting list to the toCons function before returning it monadically.

At last we can define the term parser that we used in our opExpr parser

term :: Parser Expr
term =  cint
    <|> cbool
    <|> cvar
    <|> list
    <|> parens expr -- parentheses surrounded expression

`Let`'s, `Case`'s, `Function`'s, and `If`'s

Now that we have our basic building blocks defining more complicated expressions is just a matter of combining them in the right way. We define our Let to be

letExpr :: Parser Expr
letExpr = reserved "Let" *> do -- parse the reserved word Let; return the do block
    s <- sepBy1 identifier whiteSpace
    reservedOp "="
    e <- expr
    reserved "In"
    e' <- expr
    case s of (x:xs) -> return $ Let x xs e e' -- we must have at least one

The sepBy1 parser parses one or more instances of the parser passed as its first argument, separated by one instance each of the parsers passed as its argument. Note that we need to have at least one identifier to be the name of the variable being defined. The rest of the do block just parses the rest of the expression before returning the result.

Our Case parser is

caseExpr :: Parser Expr
caseExpr = reserved "Case" *> do
    p <- expr
    reserved "Of" *> symbol "[]" *> reservedOp "->" -- parse an "Of", a "[]", then a "->"
    x <- expr
    reservedOp "|"
    (s, t) <- parens $ do{ s' <- identifier -- return the do block within parentheses
                         ; reservedOp ":"
                         ; t' <- identifier
                         ; return (s', t')
                         }
    reservedOp "->"
    y <- expr
    return $ Case p x s t y

Define our Function parser to be

import Control.Applicative ((<*>)) -- Adds an extra argument to (<$>)

function :: Parser Expr
function = reserved "Function" *> ((\x y -> Function x y)
    <$> identifier <*> (reservedOp "->" *> expr)

which parses the reserved word Function, then an identifier, a reserved operation ->, and an expression, and returns the Expr monadically.

Our If parser is then simply

ifExpr :: Parser Expr
ifExpr = reserved "If" *> ((\x y z -> If x y z)
    <$> expr <*> (reserved "Then" *> expr) <*> (reserved "Else" *> expr))

Finally, we can define our Expr parser

expr :: Parser Expr
expr =  function
    <|> letExpr
    <|> ifExpr
    <|> caseExpr
    <|> opExpr
    <|> term

Wait... are we? Yes, that is right. This is the entire parser. Just fire up our parseString and parseFile functions and test it out.

Afterword

Hopefully the preceding discussion has helped you learn how you can use Parsec. The parser we created is nearly verbatim what I wrote for Feval, except instead of a simple type Expr, I use the type Expr a since I want the type to be a functor, which I make recursive with the type Fix

data Expr a = Num Integer | ...

newtype Fix f = Fx (f (Fix f))

type NewExpr = Fix Expr

(More information on this in this article.)

Happy parsing!