[mod_python] StringField encoding?

Sun Jun 19 14:09:57 EDT 2005

Nicolas Lehuen wrote:
> Two things to remember :
> 
> 1) str are sequences of bytes (think bytes, not chars)

This removes a wrong idea I had. Thanks

> 2) unicode strings are sequences of unicode characters
> 
> An encoding is simply a mapping (sometimes an incomplete mapping)
> between the two forms.
> 
> The UTF-8 encoding of a unicode string is a sequence of byte that can
> perfectly be stored in a str. There is no information loss, you could
> store japanese kanji or even klingon in UTF-8 form in a str instance.
> In fact, any binary encoding of unicode can be stored into a str,
> since a str can contain any binary content.
> 
> There are two ways of loosing information :
> 
> 1) You pick an encoding which don't know how to encode your unicode
> characters. For example, the ASCII encoding won't be able to encode
> european accuated characters, not mentioning Japanese Kanji.
> Fortunately, UTF-8 and UTF-16 can encode any unicode string, so there
> should not be any encoding problem here.
> 
> 2) You forget what encoding was used to get the sequence of bytes
> found in a str and try to decode it with the wrong encoding. This may
> be the problem you have with StringField : you get a sequence of byte,
> but you don't know the encoding you should use (the parameter you
> should pass to str.decode()). If you think that Apache and mod_python
> are passing you UTF-8 encoded strings, then decode("UTF8") should be
> sufficient. If it's not, then it means that the encoding used by
> Apache and/or mod_python is not UTF8...
> 

Ok, now I see the light. Now I understand exactly what is going on with the 
string and I can decode('utf-8') it. Thanks again :)

-- 
dharana