Deron Meranda
deron.meranda at gmail.com
Wed Jan 4 14:17:44 EST 2006
> >> 2. html(encode|decode) for translating all defined entity characters; > >> currently pythons default cgi.escape only translates "&", "<" and > >> ">", and xml.sax.saxutils.escape is both silly if you're not using > >> sax for anything else, and not convenient as you have to give it the > >> translation table. > > > > > > I probably understand your question wrong, but... > > &<> are the only characters that need escaping, so cgi.escape should > > be sufficient. International characters can be output using the right > > codepage in your header, or by using &...; markups. Remember that cgi.escape can take an optional second parameter, which when True will also escape the double-quote, ", character. That is quite useful when the string is part of an attribute value within an (X)HTML tag. Then, really, you should have no need to escape any other character. The best technique is to simply output UTF-8 documents, writing the raw unicode characters into the document. If using unicode strings in python code, just use the encode string method to get it into utf8, such as: u'hello'.encode('utf8') Using UTF-8, you don't need any entity references at all, other than the four: &, <, >, and " -- which is what cgi.escape does. > > > Also a simple method to unescape these is quite useful aswell. For a simple way, look at the standard htmlentitydefs module. It should be pretty easy to write a search-replace function which uses those dictionaries. Of course to do it correctly is pretty hard, especially in full XML where you may have things like CDATA sections and such. So usually you need to use a full XML parser. -- Deron Meranda
|