[mod_python] StringField encoding?

Nicolas Lehuen nicolas.lehuen at gmail.com
Sun Jun 19 13:42:31 EDT 2005


Two things to remember :

1) str are sequences of bytes (think bytes, not chars)
2) unicode strings are sequences of unicode characters

An encoding is simply a mapping (sometimes an incomplete mapping)
between the two forms.

The UTF-8 encoding of a unicode string is a sequence of byte that can
perfectly be stored in a str. There is no information loss, you could
store japanese kanji or even klingon in UTF-8 form in a str instance.
In fact, any binary encoding of unicode can be stored into a str,
since a str can contain any binary content.

There are two ways of loosing information :

1) You pick an encoding which don't know how to encode your unicode
characters. For example, the ASCII encoding won't be able to encode
european accuated characters, not mentioning Japanese Kanji.
Fortunately, UTF-8 and UTF-16 can encode any unicode string, so there
should not be any encoding problem here.

2) You forget what encoding was used to get the sequence of bytes
found in a str and try to decode it with the wrong encoding. This may
be the problem you have with StringField : you get a sequence of byte,
but you don't know the encoding you should use (the parameter you
should pass to str.decode()). If you think that Apache and mod_python
are passing you UTF-8 encoded strings, then decode("UTF8") should be
sufficient. If it's not, then it means that the encoding used by
Apache and/or mod_python is not UTF8...

Regards,
Nicolas

2005/6/19, dharana <dharana at dharana.net>:
> But if I understand it correctly (I doubt it anyway). How are you going to store
> an UTF-8 string into a str without loosing information? Let's say, a japanese kanji?
> 
> If every string is converted to str (when assigning to StringField) instead of
> unicode it won't be possible to recover the original string. Please tell me I'm
> wrong.
> 
> Nick wrote:
> > You could try explicity setting the media type in the content type to be
> > ISO-8859-1 (or whatever character set you want to use) instead of unicode.
> >
> > Nick
> >
> > dharana wrote:
> >
> >> Hello,
> >>
> >> I have a form in one page. I send it with accented chars. Apache is
> >> configured to send content as UTF-8 and browser is Firefox so I
> >> presume Modpython gets utf-8 encoded data.
> >>
> >> StringField inherits from str so, in this case, what kind of encoding
> >> should I pressume it has? I'm having trouble trying to decode('utf-8')
> >> the StringField instance, that's why I ask.
> >
> >
> >
> >
> 
> --
> dharana
> 
> _______________________________________________
> Mod_python mailing list
> Mod_python at modpython.org
> http://mailman.modpython.org/mailman/listinfo/mod_python
>



More information about the Mod_python mailing list