[mod_python] encoding

Joshua "Jag" Ginsberg listspam at flowtheory.net
Tue Aug 29 10:40:54 EDT 2006

> >
> > So when you .decode('utf8') a string encoded in UTF-8 you are taking a
> >   
> That was my question, how can be sure that a string is always encoded in 
> UTF-8 when the user submit the form ?

>From (shameless plug) http://starboard.flowtheory.net/blog/?q=node/206 :

"... So it all originates to the question: what character set did the
member's web browser encode the form data in?

You may be unaware of the fact that your browser can be configured to
use different character set encodings. But it makes a big difference. By
default, most browsers will use the UTF-8 character set encoding, in
which case form data that includes the character "é" will submit it as
"%C3%A9". However, if your browser uses the Latin1 encoding, it will
submit the same letter as "%E9". So how do we tell which character set
your browser was using?

The HTTP 1.1 standard requires that when providing posted form data to a
web server, the request must specify the character set used as a part of
the "Content-Type" header. However, none of the major browsers on the
market do this because too many i18n-unaware programmers have made
server-side scripts that don't understand the syntax used. So Microsoft
employed a quite clever hack: if you add a hidden form variable named
_charset_ to your form, it will automatically populate the value of this
form variable with the character set encoding the client is using. And
in fact, Mozilla followed suit. But other browsers, such as Konqueror,
have not.

So effectively, to handle all cases properly, you may have to put some
UTF-8 encoded characters into a hidden form field and wait for the user
to submit the form. If the UTF-8 encoded characters come back the same,
then they submitted the form as UTF-8. If not, you can use
trial-and-error to attempt to elucidate what encoding they used.



More information about the Mod_python mailing list