Nicolas Lehuen
nicolas.lehuen at gmail.com
Sun Jun 19 13:42:31 EDT 2005
Two things to remember : 1) str are sequences of bytes (think bytes, not chars) 2) unicode strings are sequences of unicode characters An encoding is simply a mapping (sometimes an incomplete mapping) between the two forms. The UTF-8 encoding of a unicode string is a sequence of byte that can perfectly be stored in a str. There is no information loss, you could store japanese kanji or even klingon in UTF-8 form in a str instance. In fact, any binary encoding of unicode can be stored into a str, since a str can contain any binary content. There are two ways of loosing information : 1) You pick an encoding which don't know how to encode your unicode characters. For example, the ASCII encoding won't be able to encode european accuated characters, not mentioning Japanese Kanji. Fortunately, UTF-8 and UTF-16 can encode any unicode string, so there should not be any encoding problem here. 2) You forget what encoding was used to get the sequence of bytes found in a str and try to decode it with the wrong encoding. This may be the problem you have with StringField : you get a sequence of byte, but you don't know the encoding you should use (the parameter you should pass to str.decode()). If you think that Apache and mod_python are passing you UTF-8 encoded strings, then decode("UTF8") should be sufficient. If it's not, then it means that the encoding used by Apache and/or mod_python is not UTF8... Regards, Nicolas 2005/6/19, dharana <dharana at dharana.net>: > But if I understand it correctly (I doubt it anyway). How are you going to store > an UTF-8 string into a str without loosing information? Let's say, a japanese kanji? > > If every string is converted to str (when assigning to StringField) instead of > unicode it won't be possible to recover the original string. Please tell me I'm > wrong. > > Nick wrote: > > You could try explicity setting the media type in the content type to be > > ISO-8859-1 (or whatever character set you want to use) instead of unicode. > > > > Nick > > > > dharana wrote: > > > >> Hello, > >> > >> I have a form in one page. I send it with accented chars. Apache is > >> configured to send content as UTF-8 and browser is Firefox so I > >> presume Modpython gets utf-8 encoded data. > >> > >> StringField inherits from str so, in this case, what kind of encoding > >> should I pressume it has? I'm having trouble trying to decode('utf-8') > >> the StringField instance, that's why I ask. > > > > > > > > > > -- > dharana > > _______________________________________________ > Mod_python mailing list > Mod_python at modpython.org > http://mailman.modpython.org/mailman/listinfo/mod_python >
|