Martin
gzlist at googlemail.com
Sat Jun 23 14:13:32 EDT 2007
Have seen that Graham has responded now, but a few things need clarifying. On 20/06/07, Anastasios Hatzis <ah at hatzis.de> wrote: > So, when calling this page the HTML output for first section is rendered with > umlaut (ä, ü, ...). Value is <type 'str'> ... well, why not, as long as it is > UTF-8... Important to remember that the utf-8 encoding is not python 'unicode' - to get a utf-8 byte-string into a unicode object, you must decode it. > I do not understand why I'm getting this error. How does 'ascii' come into > this? Do sys.getdefaultencoding() from within mod_python on your server, you should see that is 'ascii'. This is the right thing - unless you specify an explict codec, python will use the 'lowest common denominator' codec (that only deals with the range(0,128) and throws an exception otherwise) rather than risk silent corruption. So, let's re-create your problem in the console: >>> import unicodedata >>> unicodedata.name(u"\u00e4") 'LATIN SMALL LETTER A WITH DIAERESIS' >>> u"\u00e4".encode('utf8') '\xc3\xa4' >>> u"something" + _ Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) Whenever you try to treat string objects as unicode objects or visa versa, python has to implictly encode or decode something. Generally you want to be explict about which you are using, and not mix them together. For instance, you might get bytes from the wire, decode, do program logic on unicode, encode to file. > How do I know which encoding is really applied (in param.value and in > msg)? This is the fun bit - you don't. If the page the form the data is coming from is a specific encoding, the form will generally be posted with the same encoding - but you can't rely on this. One thing not to do is trust that byte strings you are given are in some encoding, and then put them straight in output page/file/database. So, while I don't know the specifics of the PSP handler, you should in general rather than doing: > msg = u'' > for param in store.list: > msg += param.name + ': ' + param.value + '\r\n' # UnicodeDecodeError! Instead do something like: buf = [] for param in store.list: buf.extend([param.name, ': ', param.value, '\r\n']) # operating on bytes try: msg = "".join(buf).decode('utf-8') # creating msg as a unicode object except UnicodeDecodeError: # really need to send a nice message back to the poster of the form # telling them to fix their encoding, but this will do instead raise apache.SERVER_RETURN(apache.HTTP_BAD_REQUEST) And then if you need to put msg in the output page, you'd do msg.encode('utf-8') before writing it out. Martin
|