Deron Meranda
deron.meranda at gmail.com
Wed May 17 11:02:03 EDT 2006
On 5/16/06, Wouter van Marle <wouter at squirrel-systems.com> wrote: > For html compatibility reasons, I store all the search data in the MySQL > database in html-quoted format. The term in question that I ran into is: > "Jubiläumsmodell", stored in the database as "Jubiläumsmodell". > > The select in the html source, as presented to the browser, is fine (the > html line is abbreviated for here; it contains many more options): > <select name=model> > <option value="Jubiläumsmodell 40 Jahre">Jubiläumsmodell 40 > Jahre</option> > </select> > > It is rendered correctly of course by the browser. > Now when the user clicks go, and the POST is generated, the character is > sent back in what I guess is UTF-8 encoding >From my experience, browsers will POST in the same character set/ encoding as whatever the HTML page containing the <form> was encoded in. So if you output your HTML page in utf-8, then the POST will also be UTF-8. You can of course affect that somewhat if you explicitly output the Accept-Encoding HTTP header, or the optional accept-charset attribute of the <form> element. Regardless though, you should use one of the above methods rather than leaving it up to the browser. My personal opinion is to always use UTF-8 when communicating with browsers. It is the best supported across all browsers, and will cause you the least grief. > (I assume it's that, as that's my default encoding) (again only a snippet): > &model=Jubil%C3%A4umsmodell+40+Jahre Yes, [0xc3, 0xa4] is in fact the two-byte UTF-8 encoding of the unicode character U+00E4 (the a-diaeresis). > So the end of the whole story, the actual question is now: how do I get > from this utf-8 encoding back to html quoted encoding, for searching in > the database? What makes this more difficult is that you are using entity names rather than character references. For example, in HTML, all of these "escape" strings can be used to reference the same character: ä ä è The second decimal-numeric character reference is the most universal. In fact it is what you must use if you ever want to use some other XML-based languages rather than HTML. Anyway, the first thing to do is to get the UTF-8 encoded string turned back into a Python unicode string, such as s = s.decode('utf8') Now that two-byte sequence has turned into a single unicode character, \u00e8 (displayed as \xe8 when you print it). Then you probably want to define a function which HTM-encodes a Unicode string, turning all non-ASCII characters into references, perhaps something like def encodechar( c ): codepoint = ord(c) if codepoint >= 128 or codepoint < 32: return '&#%d;' % codepoint else: return c And HTML-encode your string like: encoded_s = ''.join( [encodechar(c) for c in s] ) If you are really tied to using entities rather than character references, you may be able to do something like this instead: import htmlentitydefs def encodechar( c ): codepoint = ord(c) entityname = htmlentitydefs.codepoint2name.get( codepoint, None ) if entityname: return '&%s;' % entityname elif codepoint >= 128 or codepoint < 32: return '&#%d;' % codepoint else: return c Keep in mind though that not all browser versions understand the complete list of HTML entity names. That's why I would strongly recommend against using entity names. Also note that if you need apostrophes escaped too, you'll need to put in another special case for that as the htmlentitydefs does not have an entry for "apos" for some reason. > And no, I'm not going to put the info in the database in > utf-8, that must remain html quoted for many other locations where it > goes wrong otherwise. Your choice. Actually though, MySQL handles Unicode brilliantly, and if you could ever convert to that you'll make things a lot easier on yourself in the future. -- Deron Meranda
|