Stripping Illegal Characters from XML in Python

01/25/2009

This past week I was doing a lot of XML creation in Python, and I wasn't able to find a script for stripping illegal XML entities from a file using Python that didn't require installing additional packages (which can be a pain without root permissions on the machine).

This script is a simple command-line tool that takes a file as input, strips out the ascii entities that are illegal in XML (and also any additional characters you specify), and returns the fixed output to standard output.

Usage looks like this (you may specify input via standard input or by specifying a file):

chmod +x ./strip_xml_entities.py
./strip_xml_entities.py some_file.txt > cleaned.txt
Line 22, removed 1 character.
Line 27, removed 1 character.
Line 46, removed 1 character.
...
Line 962, removed 1 character.
Line 996, removed 1 character.
Line 1005, removed 1 character.
Line 1010, removed 1 character.
Line 1020, removed 1 character.
Line 1027, removed 1 character.
Line 1031, removed 1 character.
Stripped 72 characters from 1032 lines.
./strip_xml_entities.py < some_file.txt > cleaned.txt 2> log.txt

If you wish to specify additional characters to remove, use the -c parameter at the command line.

./strip_xml_entities.py -c 'abc' a.txt > a2.txt 2> log.txt
./strip_xml_entities.py -c '\x6e\xa9' b.txt > b2.txt 2> log.txt

The script deals with streams for both input and output, so it will work successfully on extremely large files (tested on a greater than 6 gig text file with over 42 million lines). As it stands it is "fast enough" for most uses. Here it strips 60,552 characters from 867072 lines in a 101 meg file:

time ./strip_xml_entities.py < data.txt > clean.txt 2> log.txt
real 0m19.242s
user 0m6.991s
sys	 0m1.407s

If anyone needs more performance, there is some tweaking that can be done, but already two-thirds of the time is being spent on system io, so it probably won't get much better.

#!/usr/bin/env python
import sys, re, codecs
from optparse import OptionParser

# coarse hack for coercing input to utf-8
reload(sys)
sys.setdefaultencoding('utf_8')

def strip_chars(f,extra=u''):
    remove_re = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F%s]'
                           % extra)
    i = 1
    stripped = 0
    for line in f:
        new_line, count = remove_re.subn('', line)
        if count > 0:
            plur = ((count > 1) and u's') or u''
            sys.stderr.write('Line %d, removed %s character%s.\n'
                             % (i, count, plur))
        sys.stdout.write(new_line)
        stripped = stripped + count
        i = i + 1
    sys.stderr.write('Stripped %d characters from %d lines.\n'
                     % (stripped, i))

def main():
    p = OptionParser("usage: strip_xml_entities.py file/to/parse.xml")
    p.add_option('-c','--chars',dest='chars',
                 help="additional CHARS to strip",
                 metavar="CHARS")
    (options, args) = p.parse_args()
    extra = options.chars or u''

    # if positional arg, use that file, otherwise stdin
    fin = (len(args) and open(args[0], 'r')) or sys.stdin
    strip_chars(fin, extra)
    fin.close()

if __name__ == '__main__':
    sys.exit(main())

Let me know if there are questions or problems.

All Rights Reserved, Will Larson 2007 - 2014.