XML::Twig for Large XML Files in Perl

November 20, 2008. Filed under perl 4

Last week I was using XML::Simple to do some simple XML manipulation tasks in Perl. This week I needed to do the same thing, but as the file size went from 11k to over 80 megs I ran into a genre of problems I haven't ever dealt with before: really large files1.

You'd imagine that someone passing themself off as a professional developer would have experienced this particular pain before, but I hadn't. It's kind of a fun problem set, because it's when you start thinking seriously about time and space constraints. Out of curiosity my first approach was to just try running the XML::Simple script I wrote. The approach involved waiting a long time, a lot of RAM, and giving up. The problem is that XML::Simple doesn't use SAX or even DOM for handling XML files, it uses the user-friendly convert-to-native-datatype approach, which--despite being my favorite approach when workable--was like bringing a Barbie powerwheels to a demolition derby when I tried it on the 200k+ elements I needed to read-manpulate-write.

So I switched libraries from XML::Simple to XML::Twig, and it's been a pleasant transition. Instead of taking A Long Time to read in the 80 meg file, it's taking about five minutes to read in and write out the file (with a couple minutes added on for heavy processing if you do any); seven minutes is a markedly better deal than never. Memory consumption is quite reasonable, processing the 80 meg file starts out around 10 megs and creeps upwards around 20 as the script's duration extends, and CPU usage hovers close to 100% of one core2.

Selectvely Walking a Tree with XML::Twig

Let's say you have an XML that contains a variety of different element types, but you're just interested in reading the elements of types server and proxy.

Further, you want to print each server and proxy's ip, where the structure of the file looks like this:

<machines>
    <server>
        <ip>127.0.0.1</ip>
        <name>WeirdNamingSchemeA</name>
    </server>
    <proxy>
        <ip>127.0.0.2</ip>
        <etc>abc</etc>
    </proxy>
</machines>

Then we just need to string together this code:

use XML::Twig;

sub print_ip {
    my ($twig, $ele) = @_;
    my $ip = $ele->first_child('ip')->text;
    print "ip: $ip\n";
}

my $roots = { server => 1, proxy => 1 };
my $handlers = { 'machines/server' => \&print_ip,
                 'machines/proxy'  => \&print_ip };
my $twig = new XML::Twig(TwigRoots => $roots,
                         TwigHandlers => $handlers);
my $twig->parsefile('file.xml');

The TwigRoots parameter lets you pick which elements to handle, and the TwigHandlers lets you determine how to handle them.

Modfying Elements with XML::Twig

Although you may have to wander through the documentation a bit, you can do pretty much any manipulation you can imagine with your handler function.

In this snippet we'll trim some children nodes and then insert new ones. Here we'll just write the handler, but the surrounding code will be the same as the above example (just replace print_ip with tweak_server in $handlers).

sub tweak_server {
  my ($twig, $server) = @_;
  $ip = $server->first_child('ip');
  $server->cut_children;           # remove all children
  $server->del_atts;               # remove all attributes
  $server->set_atts({ ip => $ip}); # adds attributes

  # one way to add new nodes...
  $server->insert_new_elt('first_child', # position to insert
                          'processed_by', # name of new element
                          { name => "an attribute"}, # attributes
                          "value"); # inner text
  
  # another way to add new nodes
  $ele = new XML::Twig::Elt('a_sub_element');
  $ele->paste('first_child', $server);

  # print modified node to standard out,
  # and remove $server from memory
  $twig->flush;
}

It's not as intuitive as using native datastructures, but it's usable and quick.

Writing Elements with XML::Twig

The last ingredient in this stew is writing the modified file to disk.

use XML::Twig;

my $in = 'in.xml';
my $out = 'out.xml';
open FILE, '>', $out or die "failed to open $out";
 
sub handle_node {
    my ($twig, $ele) = @_;
    my $ip = $ele->first_child('ip')->text;
    $ele->cut_children;
    $ele->set_atts({ ip => $ip });
    $twig->flush(\*FILE);
}

my $roots = { server => 1, proxy => 1 };
my $handlers = { 'machines/server' => \&handle_node,
                 'machines/proxy'  => \&handle_node };
my $twig = new XML::Twig(TwigRoots => $roots,
                         TwigHandlers => $handlers);
my $twig->parsefile($in);
close FILE;

With these examples, you too can start writing Perl scripts that process 80 meg files in seven minutes. If that doesn't win you a promotion, I can't imagine what would.


  1. It also feels like most developers have ignored this problem. Even Emacs chokes on large XML files, especially if they don't have line-breaks. I think the readline and file paradigm is going to need to be replaced by the chunk and stream approach.

    Although, making it equally easy to use seems near impossible.

  2. I think it would be possible to parallize the three parts of the scripts: one thread reading in data, one thread manipulating data, one thread writing data. Given that the process is CPU bound, and assuming my code isn't horribly flawed, the easiest fix would probably be switching to C or Java. But who wants to do that?