Building Parsers

The relision::parser module provides very, very, very simple parser primitives for building recursive descent parsers.

The module provides three key structs and one specialized type.

  • ParseError encapsulates a parse error
  • Loc represents a location in the parse (line and column)
  • Parser provides parsing services around some stream
  • Result specializes the more general result type to parsers

Parser Primitives

There are two basic parser primitives.

  • You can peek at upcoming characters
  • You can consume characters

A simple application of this is recognizing an integer. Here's some pseudocode.

-- The best pseudocode is written in Ada for some reason.
while is_digit(peek) loop
    consume;
end loop;

Composites of these primitives are available to simplify things. For example, the implementation of the above pseudocode might look like the code below, which is complicated by the need to have peek return one of three things: Err(err) if there is an error, Ok(None) if we hit the end of file, and Ok(Some(ch)) when there is a next character.

fn parse_unsigned_integer<R: io::Read>(parser: &mut parser::Parser<R>) -> parser::Result<u64> {
    let mut result = String::new();
    while let Some(ch) = parser.peek()? {
        if ch.is_digit(10) {
            result.push(ch);
            parser.consume();
        } else {
            break;
        }
    }
    match result.parse::<u64>() {
        Ok(number) => Ok(number),
        Err(err) => Err(parser.error(err.to_string())),
    }
}

This is terrible. Fortunately Parser provides a take_while method that handles all this for us.

Here the integer parsing is broken out into a separate function so we can use the ? operator.

use std::io;
use relision::parser;
 
fn parse_unsigned_integer<R: io::Read>(parser: &mut parser::Parser<R>) -> parser::Result<u64> {
    let result = parser.take_while(|ch| ch.is_digit(10))?;
    match result.parse::<u64>() {
        Ok(number) => Ok(number),
        Err(err) => Err(parser.error(err.to_string())),
    }
}
 
fn main() {
    // Make a new parser and wrap the standard input.
    let mut parser = parser::Parser::new("console".to_string(), std::io::stdin());
    match parse_unsigned_integer(&mut parser) {
        Err(err) => {
            println!("{}", err);
        }
        Ok(value) => {
            println!("{}", value);
        }
    }
}

Here the take_while peeks at and consumes the next character while it is a digit. It works pretty well.

If nothing is entered we get console:1:1: cannot parse integer from empty string. If we enter a huge number like 99999999999999999999 (that's 20 nines) we get console:1:1: number too large to fit in target type. If we enter something like 65fred then we get 65, with the parser left pointing at the f.

Of course, this probably isn't what we want. Suppose we have a stream of identifiers and numbers, and we want to parse these.

We already know how to parse unsigned integers. Let's see how to parse identifiers.

fn parse_identifier<R: io::Read>(parser: &mut parser::Parser<R>) -> parser::Result<String> {
    let mut result = parser.take_while(|ch| ch.is_alphabetic())?;
    result = result + parser.take_while(|ch| ch.is_alphanumeric())?.as_str();
    Ok(result)
}

Okay, now let's create a type for our tokens.

#[derive(Debug)]
enum Thing {
    Number(u64),
    Id(String),
}

Now we can parse the sequence, and create a vector of Thing instances. Let's do that. We also add some error handling.

fn parse_sequence<R: io::Read>(parser: &mut parser::Parser<R>) -> parser::Result<Vec<Thing>> {
    let mut things = Vec::new();
    let _ = parser.consume_whitespace();
    while !parser.at_eof() {
        match parser.peek() {
            Ok(Some(ch)) => {
                if ch.is_digit(10) {
                    things.push(Thing::Number(parse_unsigned_integer(parser)?));
                } else if ch.is_alphabetic() {
                    things.push(Thing::Id(parse_identifier(parser)?));
                } else {
                    return Err(parser.unexpected_char("number or identifier", ch));
                }
            }
            Ok(None) => break,
            Err(err) => return Err(err),
        }
        let _ = parser.consume_whitespace();
    }
    Ok(things)
}

Now we just need a main function to pull this all together.

fn main() {
    let mut parser = parser::Parser::new("console".to_string(), std::io::stdin());
    match parse_sequence(&mut parser) {
        Err(err) => {
            println!("{}", err);
        }
        Ok(seq) => {
            println!("{:?}", seq);
        }
    }
}

That's pretty much it. If it seems like there's a lot of stuff to build in, that's because there is. Fixing that is the purpose of macros.