1. Home
  2. About

JSON parsing with Python coroutines

Published 29 Dec 2011

A while ago, I read a blog post by Python developer Eli Bendersky on Co-routines as an alternative to state machines, in which he presents evidence to support a very interesting observation quoted below:

Co-routines are to state machines what recursion is to stacks.

which is to say, state machines can be expressed as concisely and elegantly through the use of coroutines as recursive algorithms through recursive functions (as opposed to using an explicit stack).

Now, coroutines you say?

To quote from the Wikipedia entry:

Coroutines are computer program components that generalize subroutines to allow multiple entry points for suspending and resuming execution at certain locations.

Which, in simpler terms, means a coroutine is like a function from which you may yield and resume control at arbitrary instructions — there is no longer a single entry point to the procedure expressed as a coroutine.

The emphasized statement above is probably the main argument in favor of using coroutines to implement state machines, as it makes restoring the coroutine’s execution context (that is, the state in a state machine) when new input arrives much simpler, without requiring boilerplate control structures such as if in state 1: … elif in state 2

While coroutines don’t seem to be a recent addition to Python (support for them has apparently been under discussion since 2005, when PEP 342, which describes the implementation, was written), looks like they haven’t achieved widespread usage, which is my way of kindly stating I had never heard of them until very recently.

I won’t really go into details with regard to the syntax for coroutines, as their use is quite well described in Eli Bendersky’s post and references, and their abuse is cleverly documented by David Beazley’s presentation Generator Tricks for Systems Programmers.

So over the last couple of days I’ve been experimenting with the ideas above in a pet project: a simple JSON parser using state machines implemented as coroutines. I must remark that this parser has little to no actual utility, as it may still be buggy and is much, much less robust and efficient than the json module in Python’s standard library. Even so, I hope it might be useful for illustration and educational purposes.

Basically I converted the railroad diagrams in the JSON website into state machines that operate on one character at a time from the input, and made a familiar loads() function on top of those to simplify parsing from a string.

Railroad diagram for JSON strings
— Railroad diagram for JSON strings

As an appetite whetter, let’s have a look at the string parser corresponding to the diagram above:

 1 @coroutine
 2 def string_fsm():
 3     '''
 4     A coroutine implementing a finite state machine for parsing JSON strings.
 5     Accepts the string one character at a time, and yields NOT_PARSED_YET until
 6     the string has been successfully parsed.
 7     Once the JSON string has been parsed, yields the corresponding Python
 8     string. The coroutine can't be used after that.
 9     May raise JSONParseError on malformed strings.
10     Expects data with no leading or trailing whitespace (i.e., the first and
11     last input characters should be ").
12     '''
14     value = []
16     c = (yield NOT_PARSED_YET)
17     if c != '"':
18         raise JSONParseError("JSON strings must start with a quote")
20     while True:
21         c = (yield NOT_PARSED_YET)
22         if c == '"':
23             # end of string
24             break
25         elif c == '\\':
26             c = (yield NOT_PARSED_YET)
27             if c == 'u':
28                 # unicode 4-digit hexadecimal
29                 hexval = ""
30                 for i in range(4):
31                     hexval += (yield NOT_PARSED_YET)
33                 value.append(unichr(int(hexval, 16)))
34             elif c == 'b': value.append('\b')
35             elif c == 'f': value.append('\f')
36             elif c == 'n': value.append('\n')
37             elif c == 'r': value.append('\r')
38             elif c == 't': value.append('\t')
39             elif c in ('"', '\\', '/'): value.append(c)
40             else: raise JSONParseError("Invalid escape character")
41         else:
42             value.append(c)
44     yield ''.join(value)

As you can see, there is a straightforward correspondence between the diagram and the code, with the yield expressions bringing in the characters that govern state transitions.

Writing this parser was a very pleasant experience that helped get my head around a convoluted construct. While I must concede that the parsing state machines do look pleasingly concise, I also felt that the use of coroutines also contributed to the complexity of the code in some aspects. For one, debugging resumable functions with multiple entry points certainly takes some time to get used to.

In case anyone wants to try it, the code is in my Github page, under the name jsonfsm. The repository contains the full code for the module and a few tests, some of which were borrowed from xmms2 (thanks!). Check it out!