[mod_python] Decoding HTML escape characters in HTTP Requests

Sat Jan 3 17:49:07 EST 2009

Thanks Clodoaldo.  UTF-8 works fine.  What I was reporting is that
non-UTF charsets are not supported in Publisher (and probably some
other handlers as well).

As a result of this problem, you'll get python errors or wrong results
when dealing with non-ASCII characters in a browser/site with non-UTF
*default* charset.

Example:
- You have IE6 and you haven't changed the charset. (it must beMS or
ISO Western)
- Open this page: http://asiwg.org/~behnam/tr31_zwnj/
- Copy Á (U+00C1 LATIN CAPITAL LETTER A WITH ACUTE) to the field and
press submit
- You'll get this python error: """
  File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 0:
unexpected end of data
"""

What I say is that it's not so hard to detect the content encoding
from the HTTP request header and pass that encoding to the decoder.
And I say it's important, because not all the time you get the
exception error; but sometimes you just get the wrong answer.  For
example, try a non-Latin (Cyrillic, Arabic, etc) in the example above.

-Behnam

On Sun, Jan 4, 2009 at 1:47 AM, Clodoaldo Pinto Neto
<clodoaldo.pinto.neto at gmail.com> wrote:
> 2009/1/3 Behnam Esfahbod ZWNJ <behnam at zwnj.org>:
>> Hi list,
>>
>> When browsers need to send Unicode characters (i.e. U+06FA, EXTENDED
>> ARABIC-INDIC DIGIT ONE)  in a non-Unicode (i.e. Western ISO-8859-1)
>> encoded HTTP request, they escape Unicode characters in HTML escape
>> formats.  For example above, the string "&#1777;" will be sent to the
>> server.
>
> iso-8859-1 is 256 bytes long only. If you want all the unicode code
> points represented you should use utf-8. utf-32 also can represent all
> unicode code points but consumes more bandwidth and i don't know if it
> is as well supported as utf-8, which is universal.
>
>>
>> I'm using mod_pythons's Publisher handler, and in these cases, i get
>> the escaped string, not the original Unicode text.  Is it a bug in
>> mod_python, or a non-standard feature of common browsers/app-servers,
>> or both?
>
> Try to use utf-8 and see what you get.
>
> Regards, Clodoaldo
>
>>
>> Best,
>> -Behnam
>>
>> Hint: U+06FA, EXTENDED ARABIC-INDIC DIGIT ONE = &#1777;
>>
>>
>> --
>>    '     بهنام اسفهبد
>>    '     Behnam Esfahbod
>>   '
>>  *  ..   http://behnam.esfahbod.info
>>  *  `  *
>>  * o *   http://zwnj.org
>>
>> _______________________________________________
>> Mod_python mailing list
>> Mod_python at modpython.org
>> http://mailman.modpython.org/mailman/listinfo/mod_python
>>
>

-- 
    '     بهنام اسفهبد
    '     Behnam Esfahbod
   '
  *  ..   http://behnam.esfahbod.info
 *  `  *
  * o *   http://zwnj.org