[mod_python] Params submitted using utf-8

Wed Jan 7 07:32:31 EST 2004

On Wed, 2004-01-07 at 02:40, jalil at securia.com wrote:
> - When I read in a parameter value and print the type of the string, it 
> is "str" and not "unicode". I know unicode is not "utf-8" and I think 
> this is fine

Yes, this should be a utf-8 encoded string. Personally I think no
application should convert this data to a unicode object because you
always want to know the encoding it came from. For example utf-8 ->
unicode object -> iso8859-15 could mean loss of characters.

> - When I try to decode the value into "utf-8" and turn it into unicode 
> in python, I get an exception (decoding error - invalid data). Why is 
> that? HTML uses Unicode codepoints and I am sending in utf-8 encoding, 
> so why I get invalid data?

Are you using the unicode constructor or an decode method? Also, a bad
browser could send something that is not unicode, you should always
expect unicode errors here.

Here is a nice introduction to the whole encode/decode confusion:

http://www.vandervossen.net/2003/07/unicode_in_python

> Second, althought I set the charset in my returned data to utf-8, 
> the browser doesn't select utf-8 as encoding.

If the document isn't utf-8 encoded and the headers claim it is, the
browser could try to correct it and show the proper encoding.

> So, I thought maybe I 
> should try to convert into unicode b/f putting the data into my table. 
> Is this right? Should I  do anything b/f storing and/or sending the data?

Yes, you should check the input. You don't have to do this by converting
the string into a unicode object, but you can do it by checking the byte
ordering (see the utf-8 specifications for more information).

Manfred