This past week I was doing a lot of XML creation in Python, and I wasn't able to find a script for stripping illegal XML entities from a file using Python that didn't require installing additional packages (which can be a pain without root permissions on the machine).
This script is a simple command-line tool that takes a file as input, strips out the ascii entities that are illegal in XML (and also any additional characters you specify), and returns the fixed output to standard output.
Usage looks like this (you may specify input via standard input or by specifying a file):
chmod +x ./strip_xml_entities.py
./strip_xml_entities.py some_file.txt > cleaned.txt
Line 22, removed 1 character.
Line 27, removed 1 character.
Line 46, removed 1 character.
...
Line 962, removed 1 character.
Line 996, removed 1 character.
Line 1005, removed 1 character.
Line 1010, removed 1 character.
Line 1020, removed 1 character.
Line 1027, removed 1 character.
Line 1031, removed 1 character.
Stripped 72 characters from 1032 lines.
./strip_xml_entities.py < some_file.txt > cleaned.txt 2> log.txt
If you wish to specify additional characters to remove,
use the -c parameter at the command line.
./strip_xml_entities.py -c 'abc' a.txt > a2.txt 2> log.txt
./strip_xml_entities.py -c '\x6e\xa9' b.txt > b2.txt 2> log.txt
The script deals with streams for both input and output, so it will work successfully on extremely large files (tested on a greater than 6 gig text file with over 42 million lines). As it stands it is "fast enough" for most uses. Here it strips 60,552 characters from 867072 lines in a 101 meg file:
time ./strip_xml_entities.py < data.txt > clean.txt 2> log.txt
real 0m19.242s
user 0m6.991s
sys 0m1.407s
If anyone needs more performance, there is some tweaking that can be done, but already two-thirds of the time is being spent on system io, so it probably won't get much better.
#!/usr/bin/env python
import sys, re, codecs
from optparse import OptionParser
# coarse hack for coercing input to utf-8
reload(sys)
sys.setdefaultencoding('utf_8')
def strip_chars(f,extra=u''):
remove_re = re.compile(u'[\x00-\x08\x0B-\x0C\x0E-\x1F\x7F%s]'
% extra)
i = 1
stripped = 0
for line in f:
new_line, count = remove_re.subn('', line)
if count > 0:
plur = ((count > 1) and u's') or u''
sys.stderr.write('Line %d, removed %s character%s.\n'
% (i, count, plur))
sys.stdout.write(new_line)
stripped = stripped + count
i = i + 1
sys.stderr.write('Stripped %d characters from %d lines.\n'
% (stripped, i))
def main():
p = OptionParser("usage: strip_xml_entities.py file/to/parse.xml")
p.add_option('-c','--chars',dest='chars',
help="additional CHARS to strip",
metavar="CHARS")
(options, args) = p.parse_args()
extra = options.chars or u''
# if positional arg, use that file, otherwise stdin
fin = (len(args) and open(args[0], 'r')) or sys.stdin
strip_chars(fin, extra)
fin.close()
if __name__ == '__main__':
sys.exit(main())
Let me know if there are questions or problems.
This is for... python 2.5? or 2.6? I'm pretty sure it's not 3.0...
This is written for 2.5. It is not 3.0, which doesn't seem to be in heavy usage yet.
Could you please explain why these characters are stripped? Why are these the exact set of illegal characters in XML? Do you have a link?
Thanks
Take a look here.
First off, I am new to Python.
I am having trouble with an XML file that for some reason contains weird characters. I ran your script on it and it did not find the characters even in a file I know for a fact contains some invalid characters. It's an InDesign inx file. And I'd need to find those illegal characters and eliminate them. A document I have states that in inx the disallowed characters are:
0x0000-0x0008 0x000B 0x000C 0x000E-0x001F 0xD800-0xDFFF
Is there a way to change your script to also check for that last range and eliminate everything falling within it? Been trying some stuff, but just managing to generate errors in the script... ups...
Thanks in advance,
Andreas
Thanks for the code. I get MemoryError after a few minutes of processing. I am running it on a XML file that is about 2 GB in size and on a computer with 16 GB RAM.