| Deron Meranda 
    deron.meranda at gmail.com Wed Jan 4 14:17:44 EST 2006 
 > >> 2. html(encode|decode) for translating all defined entity characters;
> >> currently pythons default cgi.escape only translates "&", "<" and
> >> ">", and xml.sax.saxutils.escape is both silly if you're not using
> >> sax for anything else, and not convenient as you have to give it the
> >> translation table.
> >
> >
> > I probably understand your question wrong, but...
> > &<> are the only characters that need escaping, so cgi.escape should
> > be sufficient. International characters can be output using the right
> > codepage in your header, or by using &...; markups.
Remember that cgi.escape can take an optional second parameter,
which when True will also escape the double-quote, ", character.
That is quite useful when the string is part of an attribute value
within an (X)HTML tag.  Then, really, you should have no need to escape
any other character.
The best technique is to simply output UTF-8 documents, writing
the raw unicode characters into the document.  If using unicode
strings in python code, just use the encode string method to get
it into utf8, such as: u'hello'.encode('utf8')
Using UTF-8, you don't need any entity references at all, other than the
four: &, <, >, and " -- which is what cgi.escape
does.
> > > Also a simple method to unescape these is quite useful aswell.
For a simple way, look at the standard htmlentitydefs module.
It should be pretty easy to write a search-replace function which
uses those dictionaries.
Of course to do it correctly is pretty hard, especially in full XML
where you may have things like CDATA sections and such.  So
usually you need to use a full XML parser.
--
Deron Meranda
 |