Joshua "Jag" Ginsberg
listspam at flowtheory.net
Tue Aug 29 10:40:54 EDT 2006
> > > > So when you .decode('utf8') a string encoded in UTF-8 you are taking a > > > > That was my question, how can be sure that a string is always encoded in > UTF-8 when the user submit the form ? >From (shameless plug) http://starboard.flowtheory.net/blog/?q=node/206 : "... So it all originates to the question: what character set did the member's web browser encode the form data in? You may be unaware of the fact that your browser can be configured to use different character set encodings. But it makes a big difference. By default, most browsers will use the UTF-8 character set encoding, in which case form data that includes the character "é" will submit it as "%C3%A9". However, if your browser uses the Latin1 encoding, it will submit the same letter as "%E9". So how do we tell which character set your browser was using? The HTTP 1.1 standard requires that when providing posted form data to a web server, the request must specify the character set used as a part of the "Content-Type" header. However, none of the major browsers on the market do this because too many i18n-unaware programmers have made server-side scripts that don't understand the syntax used. So Microsoft employed a quite clever hack: if you add a hidden form variable named _charset_ to your form, it will automatically populate the value of this form variable with the character set encoding the client is using. And in fact, Mozilla followed suit. But other browsers, such as Konqueror, have not. So effectively, to handle all cases properly, you may have to put some UTF-8 encoded characters into a hidden form field and wait for the user to submit the form. If the UTF-8 encoded characters come back the same, then they submitted the form as UTF-8. If not, you can use trial-and-error to attempt to elucidate what encoding they used. ..." -jag
|