Deron Meranda
deron.meranda at gmail.com
Mon Aug 28 15:54:39 EDT 2006
On 8/28/06, Julien Cigar <jcigar at ulb.ac.be> wrote: > On the project I'm currently working on, everything is in unicode : > - locales on the server (LANG=en_US.UTF-8) > - the PostgreSQL database Always remember that a Unicode string is an abstract concept. A UTF-8 encoded byte stream is one (among many) possible concrete representations of a Unicode string. In the Python language, the 'unicode' type is also abstract. You have to (currently) convert it to a 'str' type to be able to do any I/O on it. > I'm using the Psycopg2 module to interact with PostgreSQL, and SimpleTAL > for the template engine. > Those two libraries requires type unicode instead of type str, otherwise > I get errors (ContextContentException: Found non-unicode string in > Context! for SimpleTal, and a "Can't adapt ...." error with psycopg2). > It's still a little obscure for me why it doesn't work with type str ... I'm not familiar with SimpleTal, but I suspect that it is trying to prevent the caller from making dumb mistakes. By only accepting 'unicode' strings it is in full control over any encoding representation it needs to do internally to talk to PostgreSQL. If you pass it just a 'str' it has to assume that the caller has already performed some sort of encoding, which could be a rather unsafe assumption. > The solution I found (which works) was to .decode('utf-8') or > unicode(mystr, 'utf-8') the POSTed data, but I wondered if it's not > dangerous or incorrect to do like that ? Encoding unicode into a UTF-8 str is always safe: u'\u2022'.encode('utf8') -> '\xe2\x80\xa2' Decoding a UTF-8 str back into unicode is sometimes safe: '\xe2\x80\xa2'.decode('utf8') -> u'\u2022' but '\xf2\x81\x88'.decode('utf8') -> UnicodeDecodeError Also, depending on how your Python was compiled, it may only be able to represent the BMP portion of the Unicode defined characters (the first 65536 characters). So this may or may not work: '\xf4\x8f\xbf\xbf'.decode('utf8') -> u'\U0010ffff' Doing the decode the way you are doing it is probably the best thing to do. But do be aware that it is possible for a malicious UA to intentionally send you bogus data which would cause the decode to fail. Either just let the UnicodeDecodeError bubble up, or catch it and send back an HTTP 400 error (or something else). Another thing you need to consider is trying to make sure that the POSTed data is in fact UTF-8 encoded to begin with. Unfortunately although HTTP provides a way to explicitly send the encoding to the browser (via the Content-Type encoding parameter), there is not a very good way for the reverse direction. Actually it could have been done, but in practice it's not. Usually the best insurance is to just always send all HTML or XML pages to the browser already in UTF-8 whether they need to be or not. Do this and you'll rarely get bit. Browsers will always encode POSTed text data in the same character encoding as the page it received which contains the <form> element. Although there is also the accept-charset attribute of <form>, it's best not to try to use it. > To my knowledge, Apache does not make conversion of encoding, > so it should be done at the mod_python level, right ? Conversion should actually be at the application layer. The content is really just a sequence of bytes. The interpretation of those into characters is something that Apache should not do, and it is questionable even if mod_python should do it (except perhaps for the PSP template part of mod_python rather than the core part). As far as Apache or mod_python are concerned the content could even be something other than text, so character conversion may not even be a defined operation. If your byte streams are UTF-8 encoded (which you can control via the Content-Type header, etc.) then using the .decode() and .encode() methods is perhaps the best way convert them to/from python unicode objects. If you're using a web framework on top of mod_python, then any such conversion may be done or simplified by that layer. IMHO, the real source of confusion is that Python's str type is quite ill-defined. Really there should only be one character string type (basically the same as the unicode type) and a byte string type, which is not mapped to characters in any way at all. There has been periodic discussion on the python dev list, but for now we have to live with the str type. > Is there a cleaner solution, which works in all cases ? Not really. At least not without being quite risky and prone to subtle errors. Of course being Python, you can write your own simplifying wrappers or decorators since you know your application's needs. -- Deron Meranda
|