I, like most people, never realized I'd be dealing with large files. Oh, I knew there would be some files with megabytes of data, but I never suspected I'd be begging Perl to process hundreds of megabytes of XML, nor that this week I'd be asking Python to process 6.4 gigabytes of CSV into 6.5 gigabytes of XML1.
As a few out-of-memory experiences will teach you, the trick for dealing with large files is pretty easy: use code that treats everything as a stream. For inputs, read from disk in chunks. For outputs, frequently write to disk and let system memory forge onward unburdened.
When reading and writing files yourself, this is easier to do correctly...
from __future__ import with_statement # for python 2.5
with open('data.in','r') as fin:
with open('data.out','w') as fout:
for line in fin:
fout.write(','.join(line.split(' ')))
...than it is to do incorrectly...
with open('data.in','r') as fin:
data = fin.read()
data2 = [ ','.join(x.split(' ')) for x in data ]
with open('data.out','w') as fout:
fout.write(data2)
...at least in simple cases.
Loading Large CSV Files in Python
Python has an excellent csv library, which can handle large files right out of the box. Sort of.
>> import csv
>> r = csv.reader(open('doc.csv', 'rb'))
>>> for row in r:
... print row
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_csv.Error: field larger than field limit (131072)
Staring at the module documentation2, I couldn't find anything
of use. So I cracked open the csv.py file and confirmed what the _csv
in the error message suggests: the bulk of the module's code (and the
input parsing in particular) is implemented in C rather than Python.
After a while staring at that error, I began dreaming of
how I would create a stream pre-processor using StringIO,
but it didn't take too long to figure out I would need to recreate
my own version of csv in order to accomplish that.
So back to the blogs, one of which held the magic
grain of information I was looking for: csv.field_size_limit.
>>> import csv
>>> csv.field_size_limit()
131072
>>> csv.field_size_limit(1000000000)
131072
>>> csv.field_size_limit()
1000000000
Yep. That's all there is to it. The sucker just works after that.
Well, almost. I did run into an issue with a NULL byte 1.5 gigs into the data. Because the streaming code is written using C based IO, the NULL byte shorts out the reading of data in an abrupt and non-recoverable manner. To get around this we need to pre-process the stream somehow, which you could do in Python by wrapping the file with a custom class that cleans each line before returning it, but I went with some command line utilities for simplicity.
cat data.in | tr -d '\0' > data.out
After that, the 6.4 gig CSV file processed without any issues.
Creating Large XML Files in Python
This part of the process, taking each row of csv and converting
it into an XML element, went fairly smoothly thanks to the
xml.sax.saxutils.XMLGenerator class. The API for creating
elements isn't an example of simplicity, but it is--unlike many
of the more creative schemes--predictable, and has one killer
feature: it correctly writes output to a stream.
As I mentioned, the mechanism for creating elements was a bit
verbose, so I made a couple of wrapper functions to simplify
(note that I am sending output to standard out, which lets me
simply print strings to the file I am generating, for example
creating the XML file's version declaration).
import sys
from xml.sax.saxutils import XMLGenerator
from xml.sax.xmlreader import AttributesNSImpl
g = XMLGenerator(sys.stdout, 'utf-8')
def start_tag(name, attr={}, body=None, namespace=None):
attr_vals = {}
attr_keys = {}
for key, val in attr.iteritems():
key_tuple = (namespace, key)
attr_vals[key_tuple] = val
attr_keys[key_tuple] = key
attr2 = AttributesNSImpl(attr_vals, attr_keys)
g.startElementNS((namespace, name), name, attr2)
if body:
g.characters(body)
def end_tag(name, namespace=None):
g.endElementNS((namespace, name), name)
def tag(name, attr={}, body=None, namespace=None):
start_tag(name, attr, body, namespace)
end_tag(name, namespace)
From there, usage looks like this:
print """<?xml version="1.0" encoding="utf-8'?>"""
start_tag(u'list', {u'id':10})
for item in some_list:
start_tag(u'item', {u'id': item[0]})
tag(u'title', body=item[1])
tag(u'desc', body=item[2])
end_tag(u'item')
end_tag(u'list')
g.endDocument()
The one issue I did run into (in my data) was some
pagebreak characters floating around (^L aka 12 aka x0c)
which were tweaking the XML encoder, but you can strip
them out in a variety of places, for example by rewriting
the main loop:
for item in some_list:
item = [ x.replace('\x0c','') for x in item ]
# etc
Really, the XMLGenerator just worked, even when dealing
with a quite large file.
Performance
Although my script created a different mix of XML elements than the above example, it wasn't any more complex, and had fairly reasonable performance. Processing of the 6.4 gig CSV file into a 6.5 gig XML file took between 19 - 24 minutes, which means it was able to read-process-write about five megabytes per second.
In terms of raw speed, that isn't particularly epic, but performing
a similar operation (was actually XML to XML rather than CSV to XML)
with Perl's XML::Twig it took eight minutes to process a ~100
megabyte file, so I'm pretty pleased with the quality of the Python
standard library and how it handles large files.
The breadth and depth of the standard library really makes Python a joy to work with for these simple one-shot scripts. If only it had Perl's easier to use regex syntax...
This is a peculiar nature of data, which makes it different from media: data files can--with a large system--become infinitely large. Media files, on the other hand, can be extremely dense (a couple of gigs for a high quality movie), but conform to predictable limits.
If you are dealing with large files, you're probably dealing with a company's logs from the last decade or the entire dump of their MySQL database.↩
I really want to like the new Python documentation. I mean, it certainly looks much better, but I think it has made it harder to actually find what I'm looking for. I think they've hit the same stumbling block as the Django documentation: the more you customize your documentation, the greater the learning curve for using your documentation.
I think the big thing is just the incompleteness of the documentation that gives me trouble. They are certain to cover all the important and frequently used components (along with helpful overviews and examples), but the new docs often don't even mention less important methods and objects.
For the time being, I am throwing around a lot more
dircommands.↩
I ended up using the SAX generator code in Python myself when I was doing large file transformations in code_swarm.
yieldmakes stream-based code pleasant and easy in Python. Plus, as you said, you're much more likely to end up with correct xml with an xml generation library than doing it by hand.Likewise, I'd suggest strongly against the
"\n".join(map(",".join,table))method of CSV generation. It works fine until there's a delimiter in your data. Please, think of the person that has to parse your data and use a library for generating data files.Why am I bitching about this? As luck, or perhaps some darker power, would have it, I've been writing a TSV parser*, and it appears as though AppleWorks 6 takes the lazy method and doesn't escape its delimiters, making it actually impossible to deterministically parse its output.
*Why am I not following my own advice and using a library? Apparently noone has ever written a TSV/CSV parser in javascript, so I've hacked together some awful code I should have known to do as a state machine from the beginning.
So.. how are you handling the non-deterministic parsing? If you have a list of headers in the first row, then you can identify the number of expected elements per row. So you know any row with more than N unescaped commas is malformed. What then?
(N/2)-1columns from the front, and the firstN/2columns from the rear, and then make everything inbetween into one column.That's a pretty good outline of the worst-case options, and I'll probably go with option 1 if my back's against the wall. At the moment, no one but me has tried AppleWorks 6 with the tool, so it might be a feature that nobody actually wants. Everyone else is using Excel, and I haven't yet seen it output nonsense, so here's hoping it never comes to that point.
Fundamentally, the lesson is that if you don't escape delimiters, you're going to destroy data.
In case you're curious, the app is at http://mqlx.com/reconciliation/recon.html though it won't be for too much longer.
Well, I'm glad that helped; ironically, I was barking completely up the wrong tree when I wrote that. I was in the process of converting a MySQL dump of all the world's airports into an SQLite database file, and I hit on the idea of trimming off the various sql statements from it and converting it to a tab-delimited text file, then using csvreader to extract the rows and writing them out to the db.
In fact, what happened was that the parsing didn't work and the whole db was stuffed into the first csvreader row, hence the row out of range error.
Throughput is quite good - I'm processing 29mb csv and inputting to a database via django. It takes about 30minutes for 200k+ records (there are a few joined tables on these too).
My app isn't time critical, but it does feel like it could be quicker*.
You need to turn debug off so that django isn't caching EVERY sql query as it goes.
I learned that the hard way trying to import some 9GB of data.
Re: Throwing around a lot more dir commands.
Personally, I'm a fan of the new Sphinx generated python documentation. For exploring docstrings and class attributes in interactive python console, you might try IPython.
In IPython:
Great post... the
withkeyword is killer.http://developer.berlios.de/projects/vos-project/
sorry for spamming, just, in case anyone need it.
THANKS!! saved my day!
You might also want to look at vtd-xml which has the next generation capability for handling huge XmL files,
the extended edition of vtd-xml supports documents up to 256GB in size and partial delay loading of XML