Python and XML

Python and XML

Parse an XML Document using Python

Go to the profile of  Jonathan Banks
Jonathan Banks
4 min read

A little background before I show you how to do a simple example to pull node data out of XML using Python. I've used Perl for years and absolutely love it for accomplishing a wide variety of tasks ranging from complex object-oriented solutions to simple text parsing.

The power and simplicity of Perl is astounding but secretly I enjoy programming in it because the MIS folks I work with can't wrap their heads around the arcane syntax and who would want MIS guys monkeying around with real code? I'm kidding of course but Perl users must admit that their code is not easily maintainable.

Recently I've had reason to use Python to set up the backend of a Blackberry mobile project and I must say that I really like it. Initially I had trouble simply typing in code as muscle memory forces me to use bracketing for code blocks.

You can imagine the frustration having to constantly delete brackets, so I quickly solved this problem by downloading and installing the Pydev Eclipse plugin found here. Armed with code completion, syntax highlighting, and code analysis my development time has been reduced significantly.

Setup

An initial process I needed to automate using Python was simple parsing of XML content and I couldn't find a quick and simple example of how it's done (hence, I thought I'd help other out with this small tutorial). After some experimentation I determined that importing the minidom module would suitably accomplish what I wanted to do and is extremely lightweight so we'll use it for the example code. For the purposes of this article I'll use the following sample XML file courtesy of Microsoft and edited to save some space:

Gambardella, Matthew

Computer
44.95
2000-10-01
An in-depth look at creating applications 
with XML.


Ralls, Kim

Fantasy
5.95
2000-12-16
A former architect battles corporate zombies, 
an evil sorceress, and her own childhood to become queen 
of the world.

Parsing the XML

What we want to accomplish is parse out the child nodes of each book to get the data element of each for processing. With Python it's simple - import the DOM (includes the parse, and Node modules), read in the xml file using the parse() method, and then iterate through the childnodes. For simplicity's sake in the example I'll just print the data for each node to output. Here's the Python code to do just that:

from xml.dom import *
xmlDoc = parse("library.xml")
    for node1 in xmlDoc.getElementsByTagName("book"):
        for node2 in node1.childNodes:
            if node2.nodeType == Node.ELEMENT_NODE:
                print node2.childNodes[0].data 


As you can see, with just six lines of code we've read the entire XML file into memory using DOM, parsed it, pulled out the book elements and printed all the childnodes for each book to output. The check if the node is an ELEMENT_NODE is critical since we do not want to pull the TEXT_NODEs and iterate through them for this example. If you pull that test out the code will attempt to get childNodes from the TEXT_NODE and will fail with "IndexError: tuple index out of range" since the text node doesn't have childNodes.  

If you need to pull out the attributes from the Element you can iterate through them by using the attributes dictionary class member of the node. In this case, if node2 had attributes(and it doesn't in my example) you could assign a variable the attributes dictionary 'attributes = node2.attributes' then iterate through the attributes using the 'keys()' method on attributes.

Perl Counterexample

To parse the same XML in Perl you'd have to write the following:

my $file = 'library.xml';
my $parser = XML::DOM::Parser->new();
my $xmldoc = $parser->parsefile($file);
foreach my $book ($xmldoc->getElementsByTagName('book')){
   foreach my $tag ( $book->getChildNodes() ) {
      if($tag->getNodeType == ELEMENT_NODE){
         print $tag->getFirstChild->getNodeValue;		
      }
   }
}

Looks similar and in reality it is mostly identical code but in my opinion it is definitely not as clean as the Python code. Don't get me wrong - I love Perl and am not saying that other languages beat it out - if that were the case I'd have posted a better example to stoke the flames beneath the Python vs Perl zealots. Nonetheless, I must say I'm beginning to enjoy Python quite a bit.

Wrap up

If you came here looking for something other than a simple way to parse XML using Python, such as adding to the ridiculous flame wars that go on between Perl & Python evangelists, look here, here, and here for more on the debate. If you want a pretty good reference for using XML with Python look here

Monty Python

Side note, I always thought Python was named after the snake. However, according to the source itself it's really named after Monty Python's flying circus which really strikes a chord with me since I love several of the Python movies, particularly the Holy Grail. Maybe I should change the picture at the top of this post to a frame from this: