January 25, 2009.
This past week I was doing a lot of XML creation in Python, and I wasn't able to find a script for stripping illegal XML entities from a file using Python that didn't require installing additional packages (which can be a pain without root permissions on the machine).
This script is a simple command-line tool that takes a file as input, strips out the ascii entities that are illegal in XML (and also any additional characters you specify), and returns the fixed output to standard output.
Usage looks like this (you may specify input via standard input or by specifying a file):
chmod +x ./strip_xml_entities.py ./strip_xml_entities.py some_file.txt > cleaned.txt Line 22, removed 1 character. Line 27, removed 1 character. Line 46, removed 1 character. ... Line 962, removed 1 character. Line 996, removed 1 character. Line 1005, removed 1 character. Line 1010, removed 1 character. Line 1020, removed 1 character. Line 1027, removed 1 character. Line 1031, removed 1 character. Stripped 72 characters from 1032 lines. ./strip_xml_entities.py < some_file.txt > cleaned.txt 2> log.txt
If you wish to specify additional characters to remove,
-c parameter at the command line.
./strip_xml_entities.py -c 'abc' a.txt > a2.txt 2> log.txt ./strip_xml_entities.py -c '\x6e\xa9' b.txt > b2.txt 2> log.txt
The script deals with streams for both input and output, so it will work successfully on extremely large files (tested on a greater than 6 gig text file with over 42 million lines). As it stands it is "fast enough" for most uses. Here it strips 60,552 characters from 867072 lines in a 101 meg file:
time ./strip_xml_entities.py < data.txt > clean.txt 2> log.txt real 0m19.242s user 0m6.991s sys 0m1.407s
If anyone needs more performance, there is some tweaking that can be done, but already two-thirds of the time is being spent on system io, so it probably won't get much better.
#!/usr/bin/env python import sys, re, codecs from optparse import OptionParser # coarse hack for coercing input to utf-8 reload(sys) sys.setdefaultencoding('utf_8') def strip_chars(f,extra=u''): remove_re = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F%s]' % extra) i = 1 stripped = 0 for line in f: new_line, count = remove_re.subn('', line) if count > 0: plur = ((count > 1) and u's') or u'' sys.stderr.write('Line %d, removed %s character%s.\n' % (i, count, plur)) sys.stdout.write(new_line) stripped = stripped + count i = i + 1 sys.stderr.write('Stripped %d characters from %d lines.\n' % (stripped, i)) def main(): p = OptionParser("usage: strip_xml_entities.py file/to/parse.xml") p.add_option('-c','--chars',dest='chars', help="additional CHARS to strip", metavar="CHARS") (options, args) = p.parse_args() extra = options.chars or u'' # if positional arg, use that file, otherwise stdin fin = (len(args) and open(args, 'r')) or sys.stdin strip_chars(fin, extra) fin.close() if __name__ == '__main__': sys.exit(main())
Let me know if there are questions or problems.