dharana
dharana at dharana.net
Sun Jun 19 14:09:57 EDT 2005
Nicolas Lehuen wrote: > Two things to remember : > > 1) str are sequences of bytes (think bytes, not chars) This removes a wrong idea I had. Thanks > 2) unicode strings are sequences of unicode characters > > An encoding is simply a mapping (sometimes an incomplete mapping) > between the two forms. > > The UTF-8 encoding of a unicode string is a sequence of byte that can > perfectly be stored in a str. There is no information loss, you could > store japanese kanji or even klingon in UTF-8 form in a str instance. > In fact, any binary encoding of unicode can be stored into a str, > since a str can contain any binary content. > > There are two ways of loosing information : > > 1) You pick an encoding which don't know how to encode your unicode > characters. For example, the ASCII encoding won't be able to encode > european accuated characters, not mentioning Japanese Kanji. > Fortunately, UTF-8 and UTF-16 can encode any unicode string, so there > should not be any encoding problem here. > > 2) You forget what encoding was used to get the sequence of bytes > found in a str and try to decode it with the wrong encoding. This may > be the problem you have with StringField : you get a sequence of byte, > but you don't know the encoding you should use (the parameter you > should pass to str.decode()). If you think that Apache and mod_python > are passing you UTF-8 encoded strings, then decode("UTF8") should be > sufficient. If it's not, then it means that the encoding used by > Apache and/or mod_python is not UTF8... > Ok, now I see the light. Now I understand exactly what is going on with the string and I can decode('utf-8') it. Thanks again :) -- dharana
|