This past week I was doing a lot of XML creation in Python,
and I wasn't able to find a script for stripping illegal XML
entities from a file using Python that didn't require installing
additional packages (which can be a pain without root permissions
on the machine).
This script is a simple command-line tool that takes a
file as input, strips out the ascii entities that are
illegal in XML (and also any additional characters you
specify), and returns the fixed output to standard output.
Usage looks like this (you may specify input via standard
input or by specifying a file):
chmod +x ./strip_xml_entities.py
./strip_xml_entities.py some_file.txt > cleaned.txt
Line 22, removed 1 character.
Line 27, removed 1 character.
Line 46, removed 1 character.
Line 962, removed 1 character.
Line 996, removed 1 character.
Line 1005, removed 1 character.
Line 1010, removed 1 character.
Line 1020, removed 1 character.
Line 1027, removed 1 character.
Line 1031, removed 1 character.
Stripped 72 characters from 1032 lines.
./strip_xml_entities.py < some_file.txt > cleaned.txt 2> log.txt
If you wish to specify additional characters to remove,
use the -c parameter at the command line.
The script deals with streams for both input and output,
so it will work successfully on extremely large files
(tested on a greater than 6 gig text file with over 42
million lines). As it stands it is "fast enough" for
most uses. Here it strips 60,552 characters from 867072 lines
in a 101 meg file:
time ./strip_xml_entities.py < data.txt > clean.txt 2> log.txt
If anyone needs more performance, there is some tweaking that
can be done, but already two-thirds of the time is being spent
on system io, so it probably won’t get much better.