Monday, November 23, 2009

Using PLY for Parsing Without Using it for Lexing

Over the past week or so I've been struggling with attempting to write my own parser (or parser generator) by hand. A few days ago I finally decided to give up on this notion (after all the parser isn't my end goal) as it was draining my time from the interesting work to be done. However, I wanted to keep my existing lexer. I wrote the lexer by hand in the method I described in a previous post, it's fast, easy to read, and I rather like my handiwork, so I wanted to keep it if possible. I've used PLY before (as I described last year) so I set out to see if it would be possible to use it for parsing without using it for lexing.

As it turns out PLY expects only a very minimal interface from it's lexer. In fact it only needs one method, token(), which returns a new token (or None at the end). Tokens are expected to have just 4 attributes. Having this knowledge I now set out to write a pair of compatibility classes for my existing lexer and token classes, I wanted to do this without altering the lexer/token API so that if and when I finally write my own parser I don't have to remove legacy compatibility stuff. My compatibility classes are very small, just this:

class PLYCompatLexer(object):
def __init__(self, text):
self.text = text
self.token_stream = Lexer(text).parse()

def token(self):
try:
return PLYCompatToken(self.token_stream.next())
except StopIteration:
return None


class PLYCompatToken(object):
def __init__(self, token):
self.type = token.name
self.value = token.value
self.lineno = None
self.lexpos = None

def __repr__(self):
return "<Token: %r %r>" % (self.type, self.value)


This is the entirety of the API that PLY needs. Now I can write my parser exactly as I would normally with PLY.

2 comments:

  1. Being able to specify alternate lexers has always been an important part of PLY since the beginning. Prior to creating PLY, I was doing a lot of work on SWIG--which uses a yacc parser and a hand-written lexer. Also, writing a custom lexer would be about the only way to use PLY with streaming I/O sources such as pipes or sockets. I don't know how much people have explored that angle of PLY, but LALR(1) parsers can actually work pretty well on huge data sources or streams because they never do any kind of backtracking (e.g., once a token has been passed into to yacc, it can simply be discarded).

    My only wish for PLY is that I've sometimes thought about changing the lex/yacc interface to rely on generators (i.e., having lex define a generator). The only reason it doesn't use generators now is that PLY predates generators by a few years and I haven't gotten around to changing it.

    ReplyDelete
  2. @David Beazley--yes, generators, or more generally, iterators. Sounds like a good idea. The lexer interface described (method token() which returns a token or None) sounds very similar to an iterator (method next() which returns a value or raises StopIteration).

    It should be fairly straightforward to write an adapter to convert an iterator's API to the API that the lexer currently expects. Then we could write our lexers as iterators/generators, and use the adapter to plug them into PLY.

    ReplyDelete

Note: Only a member of this blog may post a comment.