Thursday, November 18, 2010

Project Euler Problem 1 in Clojure

Presented in this article is my progressively refined solution to Project Euler Problem 1.

Project Euler's Problem 1 is:

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below 1000.

My first cut reveals me structured programming background. I'm sure any functional programming experts will run screaming at the sight of it. The only redeeming feature (in my opinion) of my implementation is the use of iteration through recursion.

(defn div-3? [n] (= 0 (mod n 3)))
(defn div-5? [n] (= 0 (mod n 5)))
(defn div-3-or-5? [n] (or (div-3? n) (div-5? n)))

(defn euler1 [total count max-count]
  (if (>= count max-count)
    total
    (do
      (if (div-3-or-5? count)
        (recur (+ total count) (inc count) max-count)
        (recur total (inc count) max-count)))))

(def result (euler1 0 0 1000))

(println result)

My first refactoring of this code came about when I realised the div-3?, div-5? and div-3-or-5? functions are actually predicates (functions that return true or false) and can be passed to the Clojure supplied higher order function filter. Applying to the sequence of consecutive integers from 0 to 1000 yields the sequence of integers divisible by 3 or 5 which are the integers that need to be summed to yield the final result. The summation can be peformed with Clojure's reduce function.

(defn div-3? [n] (= 0 (mod n 3)))
(defn div-5? [n] (= 0 (mod n 5)))
(defn div-3-or-5? [n] (or (div-3? n) (div-5? n)))

(defn euler1 []
  (reduce +
    (filter div-3-or-5? (range 1000))))

(def result (euler1))

(println result)

My final refactoring is simply a reduction of the code into a single line. The major Clojure feature exhibited here is the use of an inline function provided to the filter function.

(println
  (reduce +
    (filter
      #(or (= 0 (mod % 3)) (= 0 (mod % 5)))
      (range 1000))))

This is where I left my solution to the problem, though I wasn't entirely happy with it. In particular, the (= 0 (mod % 3)) lines felt unclear.

A Google search revealed the clojure-euler Wiki, a set of solutions to the Project Euler problems written in Clojure. The solution page for Problem 1 revealed a few new features. Most notably (for me) it shows that I could rewrite the (= 0 (mod % 3)) clause as (zero? (rem % 3)) which I think is much more expressive.

Friday, April 2, 2010

XML Namespaces

XML Namespaces are identified by a URI reference. Both elements and attributes can be defined within namespaces. Namespaces are brought into a document using the reserved attributes xmlns, which defines the default namespace, or xmlns:<str> which binds <str> to the actual URI reference defining the namespace. <str> is referred to as the namespace name. Elements and attributes are prefixed with <str>: to indicate that they belong to the namespace that <str> has been bound to. The element or attribute name is referred to as a local name. The combination of <str>:<name> is referred to as a qualified name. Note that if <name> is in the default namespace then this is also the qualified name.

An example of using a default namespace:

<?xml version="1.0" ?>

<book xmlns="http://example.org/book">
  <title type="main">Hello World</title>
  <author>J.Q. Citizen</author>
</book>

The same example with a bound, named namespace:

<?xml version="1.0" ?>

<book:book xmlns:book="http://example.org/book">
  <book:title book:type="main">Hello World</book:title>
  <book:author>J.Q. Citizen</book:author>
</book:book>

Multiple namespaces can be introduced:

<?xml version="1.0" ?>

<product:book xmlns:product="http://example.org/product"
              xmlns:bookattr="http://example.org/bookattr">
  <bookattr:title bookattr:type="main">Hello World</bookattr:title>
  <bookattr:author>J.Q. Citizen</bookattr:author>
</product:book>

Namespaces have scope from the start element in which they are declared through to the corresponding end element.

Note that even though namespaces can look like URLs they are just treated as strings by an XML processor. A namespace aware XML processor will not reach out to the URL to get something when it encounters a namespace declaration.

Monday, March 29, 2010

Blogging with reStructuredText and Pygments

I have been syntax highlighting my code in these blog posts using syntaxhighlighter. I am encountering two problems with this:

  1. I am relying on an external service to source the javascript providing the functionality. This hasn't been a problem to date, but it is a point of failure and is occasionally slow.
  2. The code isn't being shown in the RSS feed for this site.

An alternative approach is to convert the code directly to syntax highlighted HTML and paste the resulting content directly into my blog entries. This is a small part of Doug Hellman's workflow for writing technical documentation.

I don't (yet) need the complete workflow described by Doug. I just need the reStructuredText and Pygments capabilities for now. This post is the first output from this process.

Installing reStructuredText and Pygments is quite simple using easy_install. The docutils package provides reStructuredText tools and is installed by typing the following at a terminal prompt:

easy_install -U docutils

You might need an sudo at the front of that line to get some of the tools installed in system directories. With docutils installed you can start writing reST documents and have the tools to turn them into HTML, LaTeX and others formats. See the reStructuredText website for more information. Pygments provides source code syntax highlighting and is installed similarily to docutils:

easy_install -U pygments

A command line program pygmentize will be installed and can be used to manually turn source code into highlighted text in a number of formats including HTML. Consult the Pygments website for more information.

True documenting power comes from being able to include source code within a reST document and have Pygments automatically highlight the code. Getting this to work, however, is a little fiddly. Below is the sequence of steps I took to get this working.

  1. Obtain the rst-directive.py from Pygments. I couldn't find this on my system and ended up cloning the Pygments Mercurial repository and grabbing the file from the external directory. The repository can be cloned using the command:

    hg clone http://dev.pocoo.org/hg/pygments-main pygments
    
  2. Follow the instructions for registering a directive. The docutils/parsers/rst/directives directory is located within /Library/Python/2.6/site-packages on my Snow Leopard Macbook. I renamed the rst-directive.py file to sourcecode.py within the directives directory and add the entry 'sourcecode':('sourcecode', 'Pygments') to the _directive_registry dictionary.

Update: I've now updated my docutils install (using easy_install -U docutils) to version 0.7. This appears to break the sourcecode directive installed using the previous instruction. It seems that version 0.7 is installed into a version specific directory within /Library/Python/2.6/site-packages. The directory is named docutils-0.7-py2.6.egg and within which are the directories mentioned above into which the sourcecode directive can be installed. After installing in this new directory my HTML generation was once again successful with regard to the source code blocks.

  1. Pygments wraps various parts of source code with span tags when generating HTML. These span tags are assigned a class and these classes are styled using CSS. The style used by Pygments can be written to file using the command line pygmentize -S default -f html -a .highlight > style.css. The .highlight corresponds to the highlight class assigned to the div surrounding the entire source code block. Specifying this ensures the Pygments specific CSS classes are only applied to elements within this top level div. Add the content of style.css to my Blogger template so that any Pygment generated HTML that I paste in will be correctly coloured.

All that is required now is to add some source code. Here is Hello World (what else) in C++.

#include <iostream>

int main(int argc, const char* argv[])
{
  std::cout << "Hello World!" << std::endl;

  return 0;
}

Typeset this using rst2html:

rst2html input.rst > output.html

and then copy the body content of output.html to Blogger as a new post.

Sunday, March 28, 2010

SAX Parsing with Python

The Simple API for XML (SAX) is a callback based API for parsing XML documents. An XML document is walked by a SAX parser which calls into a known API to report the occurrence of XML constructs (elements, text) in the source document as they are encountered. This will (hopefully) become clearer when we get to the examples later in this post.

SAX is a defacto standard, rather than a formal standard, based on an original Java implementation. http://www.saxproject.org provides the official website for SAX and includes some of the history of SAX's evolution and information on writing SAX based programs using the Java API. The book Sax2 also provides a good reference for parsing with SAX. Note that the book was published in 2002 but is still relevant today as SAX version 2 is still the current version of the API.

Python provides its SAX support in the xml.sax module. The official documentation is in subsections 19.9, 19.10, 19.11, and 19.12 of Chapter 19: Structured Markup Processing Tools. As mentioned in my previous post, this documentation can also be accessed directly from within the Python interactive interpreter as follows:

The entry point to parsing an XML document using SAX is the xml.sax.parse() function. This function takes two required arguments (source and content handler) and one optional argument (an error handler). The input source is a file like object that provides access to the source XML document. Functions provided by the supplied content handler are called by the parser as it encounters constructs in the source XML document. The optional third argument is there to provide a custom error handler.


The key part of the handling is the content handler. This is where your application specific code is informed of content sourced from the input XML document. Your content handler will be an object that provides the same interface as class xml.sax.ContentHandler. The methods defined in this class are what the SAX parser expects when invoking callbacks. The complete interface is:

The simplest way to provide this interface is to have your custom class extend xml.sax.ContentHandler and override just the methods that you are interested in receiving. The methods you don't implement will make use of the empty implementations provided by xml.sax.ContentHandler.

Arguably the most relevant methods to override are startElement, endElement and characters. These three methods will received just about all of the content from an XML document. The startElement method is called when the SAX parser encounters the opening element in a document. The name of the element and all the attributes are supplied. The endElement method is called when the closing tag for the element is encountered. The characters method receives all the content in between, though there is no requirement that all the text be provided in one call to characters. There may be multiple calls.

Enough abstract talk. Lets see what happens when parsing the following, simple, XML document.
The following code will parse this document (assumed stored in addressbook.xml) and echo the content as it is supplied to the custom content handler.

Worth noting here is the implementation of startElement. This is called upon each element in the source document being encountered. Only the address element has attributes so there is an explicit check for this element tag before trying to access the value of the type attribute.

This code run against the source document generates the following output:

startElement 'address-book'
characters '
'
characters ' '
startElement 'name'
characters 'Fred Fox'
endElement 'name'
characters '
'
characters ' '
startElement 'phone'
characters '1234567'
endElement 'phone'
characters '
'
characters ' '
startElement 'address'
 attribute type='postal'
characters 'PO Box 987, Anytown, EV'
endElement 'address'
characters '
'
characters ' '
startElement 'address'
 attribute type='street'
characters '34 Main St, Anytown, EV'
endElement 'address'
characters '
'
endElement 'address-book'

Note the multiple calls to character providing whitespace used for indenting and new lines.

There is a lot more to say about SAX parsing, but I've said enough for this post. A subsequent post will explore the namespace areas of the SAX API (startElementNS etc).

Saturday, March 27, 2010

Learning Python

Python is a relatively friendly programming language (subjective, I know) and a good one for the novice programmer to start with. There are numerous resources on the web providing introductory material through to teaching Computer Science with Python code providing the examplar implementation of the theory. Below are a few links to such resources:
  • http://docs.python.org/: The official Python documentation with a tutorial to the language and a complete reference to the libraries provided with Python by default.
  • Learning Python: A book providing a good introduction to the language. Does not cover the standard Python library in great detail. If you are new to the Python language and want to buy a book, this is a good choice.
  • Programming Python: This provides the library reference that "Learning Python" does not. A little old now (covering Python 2.5) but the Python library does not change greatly release to release so it is still a useful reference. Not for someone new to the language or programming.
  • Python Essential Reference (4th Edition): This book is like the previous two books in one, but shallower in both areas. Probably better suited to someone already somewhat familiar with a language and looking for a single, off line resource to reference.
  • http://diveintopython.org/: Online, free book "Dive into Python". A little old and misses some newer language features but provides a useful and free introduction. Covers the language and selected libraries.
  • http://openbookproject.net//thinkCSpy/: Online, free book "How To Think Like a Computer Scientist: Python Version." Teaches computer science using Python for code examples.
  • Practical Programming: An Introduction to Computer Science Using Python (Pragmatic Programmers): Another book taking a Computer Science approach and using Python as the language for examplifying the concepts.
  • MIT's 6.00 Introduction to Computer Science and Programming: Complete course material, including video lectures, from MIT 6.00 "Introduction to Computer Science and Programming". Uses Python as the programming language to illustrate the concepts.
If you have made is as far as installing Python on your computer then you have another great learning resource available to you: the Python interactive interpreter. Invoking python from the command line will start a shell into which you can enter and execute Python commands. Great for trying out Python language features or library functions without writing a full script.

Also available are two builtin functions: dir() and help(). The first, dir(), is provided with anything and will respond with all functions and variables available on the type of the object provided. For example, to get all the functions that can be invoked on a string, type dir(str) into the shell and you'll get back a list of function names. Providing a module name and the list will also contain all classes available within that module. Note that you must first import the module you are querying.

The second function, help(), returns builtin documentation for the object provided. This is a formatted, curated, and more informative set of information than is provided by dir() and corresponds to what is provided in the Python documentation. For those with some Unix knowledge, the output of help is roughly equivalent to man pages. Typing help(str) provides the complete reference for string objects including documentation for all functions. You can also get help on a specific function, for example, help(str.startswith).

The interactive shell and the two functions dir() and help() can get you a long way to writing what you need to write.

Another resource for learning Python will be the site you are reading now. I am somewhat familiar with Python but am by no means an export. Python is one of the tools I use frequently and is one I want to know better. I will be documenting here as I learn more.
 

Saturday, November 7, 2009

Testing syntaxhighlighter

Quick test of syntaxhighlighter. Particularly cool is that the site hosts the files, as explained here. One problem though, the code set inside <script> tags doesn't render in the preview. Using <pre class="..."> does work but you need to HTML encode all those special characters like less than, greater than and double quotes. It seems that the code within the <script> also does not appear when the RSS feed is viewed in Google Reader. Some Python: Some C++: Some XML:
<?xml version="1.0" ?>
<root>
</root>

Thursday, May 14, 2009

SCons: cleaning variant_dir

A large software project will typically contain a hierarchy of SConscript files, with a root file invoking the subsidiary files. Consider the following hierarchy: The SConscript in /src contains: Just calling SConscript('src/SConscript') from the root SConstruct file results in the build files residing with the source files in the src/ directory. It can be useful to have the build output in a separate directory as it can provide easier cleaning (just nuke the directory) and packaging. This is what the variant_dir keyword argument is for. The root SConstruct file: Now calling scons will result in the build output residing below a new 'build' directory. It also copies the source files across to support legacy tools that will only spit their output into the directory containing the input. This duplication can be turned off by adding the 'duplicate=0' keyword argument to the SConscript() call. To clean up the build output one uses the regular 'scons -c' command. However, this will not remove the newly created build directory. This might be a performance improvement to save having to copy files from src to build each time. However, the build directory is also not removed when 'duplicate=0' is specified, i.e., when the build directory contains no copied source files. If you want the build directory nuked when you call 'scons -c' you can identify it for cleaning using the Clean() function. This function takes two arguments: the first a target and the second the pathname to remove. This tells SCons to remove pathname when it considers the target in its dependency checking. Directories are implicitly targets and '.' is always considered so this provides our target. SConstruct becomes: Calling 'scons -c' now will remove the build directory. (tested on Mac OS X with SCons 1.2.0.d20090223).