Nicolas Lehuen
nicolas at lehuen.com
Tue Feb 14 14:23:57 EST 2006
PSP should encode the unicode strings in the same character set the PSP page is written in. Now you've got two problems :). Somehow, the PSP calling code should pass an encoding name to the PSP evaluator, which should do some stringify values like that : def stringify(value,encoding): if value is None: # I guess it does that, I never used PSP :) return '' elif isinstance(value,unicode): return value.encode(encoding) else: return str(value) There are three places where the encoding could be defined : 1) In the PSP file, thanks to a special tag. XML files use the <?xml version="1.0" encoding="iso-8859-1"?> scheme, Python files use the # -*- coding: iso-8859-1 -*-, HTML files use the META content-type header, and so on and so forth. The PSP constructor would extract the encoding name from the file and use it when transforming unicode strings into bytes. 2) When the PSP file is compiled, as an argument to the PSP constructor. 3) When the PSP file is ran, as a dictionary member. This one is not a good idea since there are no reason the static part of the PSP file has a static encoding and changing the output encoding at runtime is bound to pose problems. In any case, you are bound to have problem when using non-unicode strings, should their encoding differ from the one used in the PSP file. The best thing to do IMHO would be to transform everything in Unicode, then convert it into an encoding supported by the client. That is to say, parse and store the PSP file as unicode strings, decoding the bytes from the file according to the encoding obtained from one of the first two methods above. Then turn input values into unicode (byte strings being converted to unicode according to their encoding) and build a fully unicode result document. mod_python would then have to select an encoding (based on content negociation with the browsers) to write the unicode document as bytes, not forgetting to specify the encoding in the "Content-Type: text/html; charset=XXXX" header. But then again, this complicates method 2 since if another encoding is selected, the encoding information which is inside the document should be modified. In other words, forget about byte strings, do everything in Unicode, and try to never automatically convert str into unicode or vice-versa, because this is bound to fail. This automatic conversion is really a weak spot of Python, it is really a shame. Anyway... Unicode isn't really complicated. You have to forget that unicode and str are nearly the same in Python. They are not. Unicode strings are arrays of thingies that represent characters - don't even think of these thingies as 16 or 32 bits integers. They are abstract values that each represent a given character. The problem is that "thingies" do not play well with electronics, so you have to convert them into bits and bytes to store or exchange them. Encodings are simply a mapping from thingies to bytes and vice versa. The trick is that encoders are built so that they are somewhat optimal for a given set of languages. As a consequence, most encoders cannot encode the whole set of Unicode characters. For example, the ASCII encoding can only encode 128 different characters, namely numbers, upper and lower case non-accentuated latin alphabet (as used in, surprise, the United States). ASCII only require 7 bits to do so, which was good for antique communication systems. Nowadays, ASCII is OK if you're restricted to the English-speaking word, but as soon as a damn French guy wants to write you about how his café his better than the stuff you get at Starbucks, well, he cannot, because the Unicode "é" thingy cannot be encoded into ASCII. So you don't have to hear his rambling, and you're much better like that. But that's another story. A big bunch of different encodings use 8 bits to represent, say, a majority of characters used in the west european languages (ISO-8859-1 AKA ISO-Latin-1 and its cousin ISO-Latin-15 which switches the 0x80 value from an unused whitespace to the very useful euro sign), cyrillic alphabets (the Russian even have multiple different and mutually incompatible encoding for their alphabet), and so on. Then, there is a set of more universal encodings, UTF-16 (16 bits per character, but it cannot encode every Unicode character), UTF-32 (32 bits per character) and the famous UTF-8, which occidental developers love because latin characters only use 8 bits, and oriental developers hate because their characters sometimes require 24 bits. An important thing to remember is that a piece of text can be represented either as a Unicode string, or as a byte array + the name of the encoding chosen to encode it. A BYTE ARRAY WITHOUT ENCODING INFORMATION CANNOT BE CONSIDERED AS TEXT. And that's the big problem of Python (which is not alone here, if that's a relief) : byte arrays (str in Python parlance) as considered as text in some default encoding, which varies considerably from place to place. As soon as you start exchanging byte arrays with someone without specifying the encoding in one way (the Content-Type header) or another (some content-type information embedded in the array, assuming that the encoding name is itself encoded in ASCII), then mayhem is sure to follow. The Windows platform is a good example : you get one encoding in the command processor (in France it is CP850), another in the GUI (usually it is ANSI, which is quite the same, but not equal to ISO-8859-1), and sometimes another in your Python source file (most of the standard library is written in ASCII, which is fortunately compatible with ANSI and ISO-8859-1). But the fun part is that when running a Python program (as opposed to typing stuff in the interactive prompt), the default encoding becomes ASCII ! As a result, accentuated character output is nearly always wrong when console output is performed, unless the programmer has thought about converting strings to CP850. I wish the situation was as simple as in the Java world, where all strings are Unicode strings, period... Joel Spolsky has written a very good article about Unicode : "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" http://www.joelonsoftware.com/articles/Unicode.html His explanation are even better than mine, I don't know why I have written this, and why you read it ;) Regards, Nicolas 2006/2/14, Gregory (Grisha) Trubetskoy <grisha at modpython.org>: > > I'm a bit unicode-ignorant - what should PSP do? The idea was that a > variable referred to in a PSP page would be an object that could stringify > itself by implementing a __str__(), but obviously this doesn't work with > unicode at all. But I'm not sure how self-representation works in the > unicode world... > > Grisha > > On Mon, 13 Feb 2006, Dan Eloff wrote: > > > Actually I was just about to post a question about this. The psp generated > code surrounds everything with str() before writing it, so it doesn't work > with unicode at all. > > -Dan > > _______________________________________________ > Mod_python mailing list > Mod_python at modpython.org > http://mailman.modpython.org/mailman/listinfo/mod_python >
|