[mod_python] UnicodeDecodeError with util.FieldStorage(req).Field.value

Sat Jun 23 14:13:32 EDT 2007

Have seen that Graham has responded now, but a few things need clarifying.

On 20/06/07, Anastasios Hatzis <ah at hatzis.de> wrote:
> So, when calling this page the HTML output for first section is rendered with
> umlaut (ä, ü, ...). Value is <type 'str'> ... well, why not, as long as it is
> UTF-8...

Important to remember that the utf-8 encoding is not python 'unicode'
- to get a utf-8 byte-string into a unicode object, you must decode
it.

> I do not understand why I'm getting this error. How does 'ascii' come into
> this?

Do sys.getdefaultencoding() from within mod_python on your server, you
should see that is 'ascii'. This is the right thing - unless you
specify an explict codec, python will use the 'lowest common
denominator' codec (that only deals with the range(0,128) and throws
an exception otherwise) rather than risk silent corruption.

So, let's re-create your problem in the console:

>>> import unicodedata
>>> unicodedata.name(u"\u00e4")
'LATIN SMALL LETTER A WITH DIAERESIS'
>>> u"\u00e4".encode('utf8')
'\xc3\xa4'
>>> u"something" + _
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal
not in range(128)

Whenever you try to treat string objects as unicode objects or visa
versa, python has to implictly encode or decode something. Generally
you want to be explict about which you are using, and not mix them
together. For instance, you might get bytes from the wire, decode, do
program logic on unicode, encode to file.

> How do I know which encoding is really applied (in param.value and in
> msg)?

This is the fun bit - you don't. If the page the form the data is
coming from is a specific encoding, the form will generally be posted
with the same encoding - but you can't rely on this. One thing not to
do is trust that byte strings you are given are in some encoding, and
then put them straight in output page/file/database.

So, while I don't know the specifics of the PSP handler, you should in
general rather than doing:
> msg = u''
> for param in store.list:
>     msg += param.name + ': ' + param.value + '\r\n' # UnicodeDecodeError!

Instead do something like:
buf = []
for param in store.list:
    buf.extend([param.name, ': ', param.value, '\r\n']) # operating on bytes
try:
    msg = "".join(buf).decode('utf-8') # creating msg as a unicode object
except UnicodeDecodeError:
    # really need to send a nice message back to the poster of the form
    # telling them to fix their encoding, but this will do instead
    raise apache.SERVER_RETURN(apache.HTTP_BAD_REQUEST)

And then if you need to put msg in the output page, you'd do
msg.encode('utf-8') before writing it out.

Martin