XML::Twig for Large XML Files in Perl
Last week I was using XML::Simple to do some simple XML manipulation tasks in Perl. This week I needed to do the same thing, but as the file size went from 11k to over 80 megs I ran into a genre of problems I haven't ever dealt with before: really large files1.
You'd imagine that someone passing themself off as a professional developer would have experienced this particular pain before, but I hadn't. It's kind of a fun problem set, because it's when you start thinking seriously about time and space constraints. Out of curiosity my first approach was to just try running the XML::Simple script I wrote. The approach involved waiting a long time, a lot of RAM, and giving up. The problem is that XML::Simple doesn't use SAX or even DOM for handling XML files, it uses the user-friendly convert-to-native-datatype approach, which--despite being my favorite approach when workable--was like bringing a Barbie powerwheels to a demolition derby when I tried it on the 200k+ elements I needed to read-manpulate-write.
So I switched libraries from XML::Simple to XML::Twig, and it's been a pleasant transition. Instead of taking A Long Time to read in the 80 meg file, it's taking about five minutes to read in and write out the file (with a couple minutes added on for heavy processing if you do any); seven minutes is a markedly better deal than never. Memory consumption is quite reasonable, processing the 80 meg file starts out around 10 megs and creeps upwards around 20 as the script's duration extends, and CPU usage hovers close to 100% of one core2.
Selectvely Walking a Tree with XML::Twig
Let's say you have an XML that contains a variety of different
element types, but you're just interested in reading the elements
of types server
and proxy
.
Further, you want to print each server and proxy's ip, where the structure of the file looks like this:
<machines>
<server>
<ip>127.0.0.1</ip>
<name>WeirdNamingSchemeA</name>
</server>
<proxy>
<ip>127.0.0.2</ip>
<etc>abc</etc>
</proxy>
</machines>
Then we just need to string together this code:
use XML::Twig;
sub print_ip {
my ($twig, $ele) = @_;
my $ip = $ele->first_child('ip')->text;
print "ip: $ip\n";
}
my $roots = { server => 1, proxy => 1 };
my $handlers = { 'machines/server' => &print_ip,
'machines/proxy' => &print_ip };
my $twig = new XML::Twig(TwigRoots => $roots,
TwigHandlers => $handlers);
my $twig->parsefile('file.xml');
The TwigRoots
parameter lets you pick which
elements to handle, and the TwigHandlers
lets
you determine how to handle them.
Modfying Elements with XML::Twig
Although you may have to wander through the documentation a bit, you can do pretty much any manipulation you can imagine with your handler function.
In this snippet we'll trim some children
nodes and then insert new ones. Here we'll just write the
handler, but the surrounding code will be the same as the
above example (just replace print_ip
with tweak_server
in $handlers
).
sub tweak_server {
my ($twig, $server) = @_;
$ip = $server->first_child('ip');
$server->cut_children; # remove all children
$server->del_atts; # remove all attributes
$server->set_atts({ ip => $ip}); # adds attributes
# one way to add new nodes…
$server->insert_new_elt('first_child', # position to insert
'processed_by', # name of new element
{ name => "an attribute"}, # attributes
"value"); # inner text
# another way to add new nodes
$ele = new XML::Twig::Elt('a_sub_element');
$ele->paste('first_child', $server);
# print modified node to standard out,
# and remove $server from memory
$twig->flush;
}
It's not as intuitive as using native datastructures, but it's usable and quick.
Writing Elements with XML::Twig
The last ingredient in this stew is writing the modified file to disk.
use XML::Twig;
my $in = 'in.xml';
my $out = 'out.xml';
open FILE, '>', $out or die "failed to open $out";
sub handle_node {
my ($twig, $ele) = @_;
my $ip = $ele->first_child('ip')->text;
$ele->cut_children;
$ele->set_atts({ ip => $ip });
$twig->flush(*FILE);
}
my $roots = { server => 1, proxy => 1 };
my $handlers = { 'machines/server' => &handle_node,
'machines/proxy' => &handle_node };
my $twig = new XML::Twig(TwigRoots => $roots,
TwigHandlers => $handlers);
my $twig->parsefile($in);
close FILE;
With these examples, you too can start writing Perl scripts that process 80 meg files in seven minutes. If that doesn't win you a promotion, I can't imagine what would.
It also feels like most developers have ignored this problem. Even Emacs chokes on large XML files, especially if they don't have line-breaks. I think the readline and file paradigm is going to need to be replaced by the chunk and stream approach.
Although, making it equally easy to use seems near impossible.↩
I think it would be possible to parallize the three parts of the scripts: one thread reading in data, one thread manipulating data, one thread writing data. Given that the process is CPU bound, and assuming my code isn't horribly flawed, the easiest fix would probably be switching to C or Java. But who wants to do that?↩