Behnam Esfahbod ZWNJ
behnam at zwnj.org
Sat Jan 3 17:49:07 EST 2009
Thanks Clodoaldo. UTF-8 works fine. What I was reporting is that non-UTF charsets are not supported in Publisher (and probably some other handlers as well). As a result of this problem, you'll get python errors or wrong results when dealing with non-ASCII characters in a browser/site with non-UTF *default* charset. Example: - You have IE6 and you haven't changed the charset. (it must beMS or ISO Western) - Open this page: http://asiwg.org/~behnam/tr31_zwnj/ - Copy Á (U+00C1 LATIN CAPITAL LETTER A WITH ACUTE) to the field and press submit - You'll get this python error: """ File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 0: unexpected end of data """ What I say is that it's not so hard to detect the content encoding from the HTTP request header and pass that encoding to the decoder. And I say it's important, because not all the time you get the exception error; but sometimes you just get the wrong answer. For example, try a non-Latin (Cyrillic, Arabic, etc) in the example above. -Behnam On Sun, Jan 4, 2009 at 1:47 AM, Clodoaldo Pinto Neto <clodoaldo.pinto.neto at gmail.com> wrote: > 2009/1/3 Behnam Esfahbod ZWNJ <behnam at zwnj.org>: >> Hi list, >> >> When browsers need to send Unicode characters (i.e. U+06FA, EXTENDED >> ARABIC-INDIC DIGIT ONE) in a non-Unicode (i.e. Western ISO-8859-1) >> encoded HTTP request, they escape Unicode characters in HTML escape >> formats. For example above, the string "۱" will be sent to the >> server. > > iso-8859-1 is 256 bytes long only. If you want all the unicode code > points represented you should use utf-8. utf-32 also can represent all > unicode code points but consumes more bandwidth and i don't know if it > is as well supported as utf-8, which is universal. > >> >> I'm using mod_pythons's Publisher handler, and in these cases, i get >> the escaped string, not the original Unicode text. Is it a bug in >> mod_python, or a non-standard feature of common browsers/app-servers, >> or both? > > Try to use utf-8 and see what you get. > > Regards, Clodoaldo > >> >> Best, >> -Behnam >> >> Hint: U+06FA, EXTENDED ARABIC-INDIC DIGIT ONE = ۱ >> >> >> -- >> ' بهنام اسفهبد >> ' Behnam Esfahbod >> ' >> * .. http://behnam.esfahbod.info >> * ` * >> * o * http://zwnj.org >> >> _______________________________________________ >> Mod_python mailing list >> Mod_python at modpython.org >> http://mailman.modpython.org/mailman/listinfo/mod_python >> > -- ' بهنام اسفهبد ' Behnam Esfahbod ' * .. http://behnam.esfahbod.info * ` * * o * http://zwnj.org
|