Sunday, March 28, 2010

SAX Parsing with Python

The Simple API for XML (SAX) is a callback based API for parsing XML documents. An XML document is walked by a SAX parser which calls into a known API to report the occurrence of XML constructs (elements, text) in the source document as they are encountered. This will (hopefully) become clearer when we get to the examples later in this post.

SAX is a defacto standard, rather than a formal standard, based on an original Java implementation. http://www.saxproject.org provides the official website for SAX and includes some of the history of SAX's evolution and information on writing SAX based programs using the Java API. The book Sax2 also provides a good reference for parsing with SAX. Note that the book was published in 2002 but is still relevant today as SAX version 2 is still the current version of the API.

Python provides its SAX support in the xml.sax module. The official documentation is in subsections 19.9, 19.10, 19.11, and 19.12 of Chapter 19: Structured Markup Processing Tools. As mentioned in my previous post, this documentation can also be accessed directly from within the Python interactive interpreter as follows:

The entry point to parsing an XML document using SAX is the xml.sax.parse() function. This function takes two required arguments (source and content handler) and one optional argument (an error handler). The input source is a file like object that provides access to the source XML document. Functions provided by the supplied content handler are called by the parser as it encounters constructs in the source XML document. The optional third argument is there to provide a custom error handler.


The key part of the handling is the content handler. This is where your application specific code is informed of content sourced from the input XML document. Your content handler will be an object that provides the same interface as class xml.sax.ContentHandler. The methods defined in this class are what the SAX parser expects when invoking callbacks. The complete interface is:

The simplest way to provide this interface is to have your custom class extend xml.sax.ContentHandler and override just the methods that you are interested in receiving. The methods you don't implement will make use of the empty implementations provided by xml.sax.ContentHandler.

Arguably the most relevant methods to override are startElement, endElement and characters. These three methods will received just about all of the content from an XML document. The startElement method is called when the SAX parser encounters the opening element in a document. The name of the element and all the attributes are supplied. The endElement method is called when the closing tag for the element is encountered. The characters method receives all the content in between, though there is no requirement that all the text be provided in one call to characters. There may be multiple calls.

Enough abstract talk. Lets see what happens when parsing the following, simple, XML document.
The following code will parse this document (assumed stored in addressbook.xml) and echo the content as it is supplied to the custom content handler.

Worth noting here is the implementation of startElement. This is called upon each element in the source document being encountered. Only the address element has attributes so there is an explicit check for this element tag before trying to access the value of the type attribute.

This code run against the source document generates the following output:

startElement 'address-book'
characters '
'
characters ' '
startElement 'name'
characters 'Fred Fox'
endElement 'name'
characters '
'
characters ' '
startElement 'phone'
characters '1234567'
endElement 'phone'
characters '
'
characters ' '
startElement 'address'
 attribute type='postal'
characters 'PO Box 987, Anytown, EV'
endElement 'address'
characters '
'
characters ' '
startElement 'address'
 attribute type='street'
characters '34 Main St, Anytown, EV'
endElement 'address'
characters '
'
endElement 'address-book'

Note the multiple calls to character providing whitespace used for indenting and new lines.

There is a lot more to say about SAX parsing, but I've said enough for this post. A subsequent post will explore the namespace areas of the SAX API (startElementNS etc).

6 comments:

s said...

Good tutorial, got me up and running fast.

I think line 16 of your python code example should probably read:

print("characters '" + content + "'")

Anthony said...

Thanks for the comment and the fix. Sorry for the delay, I hadn't setup email notice when comments are made.

Kerlee said...

It has been very useful, but just a question, How can I handle or skip comments?. I can´t seem to find a solution.

Anthony said...

Hi Kerlee,

I just had a bit of a browse around and it seems you need to look into the LexicalHandler interface. It would be something like:

p = xml.sax.make_parser()
p.setContentHandler(MyContentHandler)
p.setProperty(xml.sax.handler.property_lexical_handler, MyLexicalHandler())
p.parse(input_file)

Where MyLexicalHandler is

class MyLexicalHandler(LexicalHandler):
def comment(self, text):
print(text)

But, LexicalHandler isn't in the standard documentation. I found a reference that says it exists in the PyXML library.

Haven't tested this information.

http://www.saxproject.org/faq.html

Unknown said...

""" Note the multiple calls to character providing whitespace used for indenting and new lines. """ -- how do I avoid these whitespace and newlines and extract only the value between tags? I cant seem to find solution for this

Lavanya said...

hi,very helpful