Monday, March 29, 2010

Blogging with reStructuredText and Pygments

I have been syntax highlighting my code in these blog posts using syntaxhighlighter. I am encountering two problems with this:

  1. I am relying on an external service to source the javascript providing the functionality. This hasn't been a problem to date, but it is a point of failure and is occasionally slow.
  2. The code isn't being shown in the RSS feed for this site.

An alternative approach is to convert the code directly to syntax highlighted HTML and paste the resulting content directly into my blog entries. This is a small part of Doug Hellman's workflow for writing technical documentation.

I don't (yet) need the complete workflow described by Doug. I just need the reStructuredText and Pygments capabilities for now. This post is the first output from this process.

Installing reStructuredText and Pygments is quite simple using easy_install. The docutils package provides reStructuredText tools and is installed by typing the following at a terminal prompt:

easy_install -U docutils

You might need an sudo at the front of that line to get some of the tools installed in system directories. With docutils installed you can start writing reST documents and have the tools to turn them into HTML, LaTeX and others formats. See the reStructuredText website for more information. Pygments provides source code syntax highlighting and is installed similarily to docutils:

easy_install -U pygments

A command line program pygmentize will be installed and can be used to manually turn source code into highlighted text in a number of formats including HTML. Consult the Pygments website for more information.

True documenting power comes from being able to include source code within a reST document and have Pygments automatically highlight the code. Getting this to work, however, is a little fiddly. Below is the sequence of steps I took to get this working.

  1. Obtain the from Pygments. I couldn't find this on my system and ended up cloning the Pygments Mercurial repository and grabbing the file from the external directory. The repository can be cloned using the command:

    hg clone pygments
  2. Follow the instructions for registering a directive. The docutils/parsers/rst/directives directory is located within /Library/Python/2.6/site-packages on my Snow Leopard Macbook. I renamed the file to within the directives directory and add the entry 'sourcecode':('sourcecode', 'Pygments') to the _directive_registry dictionary.

Update: I've now updated my docutils install (using easy_install -U docutils) to version 0.7. This appears to break the sourcecode directive installed using the previous instruction. It seems that version 0.7 is installed into a version specific directory within /Library/Python/2.6/site-packages. The directory is named docutils-0.7-py2.6.egg and within which are the directories mentioned above into which the sourcecode directive can be installed. After installing in this new directory my HTML generation was once again successful with regard to the source code blocks.

  1. Pygments wraps various parts of source code with span tags when generating HTML. These span tags are assigned a class and these classes are styled using CSS. The style used by Pygments can be written to file using the command line pygmentize -S default -f html -a .highlight > style.css. The .highlight corresponds to the highlight class assigned to the div surrounding the entire source code block. Specifying this ensures the Pygments specific CSS classes are only applied to elements within this top level div. Add the content of style.css to my Blogger template so that any Pygment generated HTML that I paste in will be correctly coloured.

All that is required now is to add some source code. Here is Hello World (what else) in C++.

#include <iostream>

int main(int argc, const char* argv[])
  std::cout << "Hello World!" << std::endl;

  return 0;

Typeset this using rst2html:

rst2html input.rst > output.html

and then copy the body content of output.html to Blogger as a new post.

Sunday, March 28, 2010

SAX Parsing with Python

The Simple API for XML (SAX) is a callback based API for parsing XML documents. An XML document is walked by a SAX parser which calls into a known API to report the occurrence of XML constructs (elements, text) in the source document as they are encountered. This will (hopefully) become clearer when we get to the examples later in this post.

SAX is a defacto standard, rather than a formal standard, based on an original Java implementation. provides the official website for SAX and includes some of the history of SAX's evolution and information on writing SAX based programs using the Java API. The book Sax2 also provides a good reference for parsing with SAX. Note that the book was published in 2002 but is still relevant today as SAX version 2 is still the current version of the API.

Python provides its SAX support in the xml.sax module. The official documentation is in subsections 19.9, 19.10, 19.11, and 19.12 of Chapter 19: Structured Markup Processing Tools. As mentioned in my previous post, this documentation can also be accessed directly from within the Python interactive interpreter as follows:

The entry point to parsing an XML document using SAX is the xml.sax.parse() function. This function takes two required arguments (source and content handler) and one optional argument (an error handler). The input source is a file like object that provides access to the source XML document. Functions provided by the supplied content handler are called by the parser as it encounters constructs in the source XML document. The optional third argument is there to provide a custom error handler.

The key part of the handling is the content handler. This is where your application specific code is informed of content sourced from the input XML document. Your content handler will be an object that provides the same interface as class xml.sax.ContentHandler. The methods defined in this class are what the SAX parser expects when invoking callbacks. The complete interface is:

The simplest way to provide this interface is to have your custom class extend xml.sax.ContentHandler and override just the methods that you are interested in receiving. The methods you don't implement will make use of the empty implementations provided by xml.sax.ContentHandler.

Arguably the most relevant methods to override are startElement, endElement and characters. These three methods will received just about all of the content from an XML document. The startElement method is called when the SAX parser encounters the opening element in a document. The name of the element and all the attributes are supplied. The endElement method is called when the closing tag for the element is encountered. The characters method receives all the content in between, though there is no requirement that all the text be provided in one call to characters. There may be multiple calls.

Enough abstract talk. Lets see what happens when parsing the following, simple, XML document.
The following code will parse this document (assumed stored in addressbook.xml) and echo the content as it is supplied to the custom content handler.

Worth noting here is the implementation of startElement. This is called upon each element in the source document being encountered. Only the address element has attributes so there is an explicit check for this element tag before trying to access the value of the type attribute.

This code run against the source document generates the following output:

startElement 'address-book'
characters '
characters ' '
startElement 'name'
characters 'Fred Fox'
endElement 'name'
characters '
characters ' '
startElement 'phone'
characters '1234567'
endElement 'phone'
characters '
characters ' '
startElement 'address'
 attribute type='postal'
characters 'PO Box 987, Anytown, EV'
endElement 'address'
characters '
characters ' '
startElement 'address'
 attribute type='street'
characters '34 Main St, Anytown, EV'
endElement 'address'
characters '
endElement 'address-book'

Note the multiple calls to character providing whitespace used for indenting and new lines.

There is a lot more to say about SAX parsing, but I've said enough for this post. A subsequent post will explore the namespace areas of the SAX API (startElementNS etc).

Saturday, March 27, 2010

Learning Python

Python is a relatively friendly programming language (subjective, I know) and a good one for the novice programmer to start with. There are numerous resources on the web providing introductory material through to teaching Computer Science with Python code providing the examplar implementation of the theory. Below are a few links to such resources:
  • The official Python documentation with a tutorial to the language and a complete reference to the libraries provided with Python by default.
  • Learning Python: A book providing a good introduction to the language. Does not cover the standard Python library in great detail. If you are new to the Python language and want to buy a book, this is a good choice.
  • Programming Python: This provides the library reference that "Learning Python" does not. A little old now (covering Python 2.5) but the Python library does not change greatly release to release so it is still a useful reference. Not for someone new to the language or programming.
  • Python Essential Reference (4th Edition): This book is like the previous two books in one, but shallower in both areas. Probably better suited to someone already somewhat familiar with a language and looking for a single, off line resource to reference.
  • Online, free book "Dive into Python". A little old and misses some newer language features but provides a useful and free introduction. Covers the language and selected libraries.
  • Online, free book "How To Think Like a Computer Scientist: Python Version." Teaches computer science using Python for code examples.
  • Practical Programming: An Introduction to Computer Science Using Python (Pragmatic Programmers): Another book taking a Computer Science approach and using Python as the language for examplifying the concepts.
  • MIT's 6.00 Introduction to Computer Science and Programming: Complete course material, including video lectures, from MIT 6.00 "Introduction to Computer Science and Programming". Uses Python as the programming language to illustrate the concepts.
If you have made is as far as installing Python on your computer then you have another great learning resource available to you: the Python interactive interpreter. Invoking python from the command line will start a shell into which you can enter and execute Python commands. Great for trying out Python language features or library functions without writing a full script.

Also available are two builtin functions: dir() and help(). The first, dir(), is provided with anything and will respond with all functions and variables available on the type of the object provided. For example, to get all the functions that can be invoked on a string, type dir(str) into the shell and you'll get back a list of function names. Providing a module name and the list will also contain all classes available within that module. Note that you must first import the module you are querying.

The second function, help(), returns builtin documentation for the object provided. This is a formatted, curated, and more informative set of information than is provided by dir() and corresponds to what is provided in the Python documentation. For those with some Unix knowledge, the output of help is roughly equivalent to man pages. Typing help(str) provides the complete reference for string objects including documentation for all functions. You can also get help on a specific function, for example, help(str.startswith).

The interactive shell and the two functions dir() and help() can get you a long way to writing what you need to write.

Another resource for learning Python will be the site you are reading now. I am somewhat familiar with Python but am by no means an export. Python is one of the tools I use frequently and is one I want to know better. I will be documenting here as I learn more.