Wouter van Marle
wouter at squirrel-systems.com
Wed May 17 11:17:29 EDT 2006
Dear Deron, Thank you for the comments. I understand your ideas; unfortunately it does not solve my problem. The info I get from another website, this origin gives me the info in the ampersand form (that third party site is a Netscape server by the way! Didn't know they are still in use, very remarkable). And I like that. The main reason to continue using that format is the " (double-quotes) and ' (single quotes). These characters are used in the data that I try to store in the mysql base, and that fantastically messes up with the queries.... imagine: s = "this is 'a' string" then say query = """ SELECT * FROM base WHERE field = "%s";"""% s But what about when s can be 'this is "a" string' or s = """this is a 5", 'b' sized thing""". Now if I keep those " and ' characters in ampersand quoted form, no problem. And as a result I (have to - conversion issue; urllib.unquote() is not selective) keep the rest of the strange characters (usually umlauts and so) also in that form. If there is no easier solution I'll have to just use urllib.unquote() and code into utf-8 for storing, and then replace the " characters again with the quoted form. May be the best solution. Wouter. On Wed, 2006-05-17 at 11:02 -0400, Deron Meranda wrote: > On 5/16/06, Wouter van Marle <wouter at squirrel-systems.com> wrote: > > For html compatibility reasons, I store all the search data in the MySQL > > database in html-quoted format. The term in question that I ran into is: > > "Jubiläumsmodell", stored in the database as "Jubiläumsmodell". > > > > The select in the html source, as presented to the browser, is fine (the > > html line is abbreviated for here; it contains many more options): > > <select name=model> > > <option value="Jubiläumsmodell 40 Jahre">Jubiläumsmodell 40 > > Jahre</option> > > </select> > > > > It is rendered correctly of course by the browser. > > Now when the user clicks go, and the POST is generated, the character is > > sent back in what I guess is UTF-8 encoding > > From my experience, browsers will POST in the same character set/ > encoding as whatever the HTML page containing the <form> was > encoded in. So if you output your HTML page in utf-8, then the POST > will also be UTF-8. > > You can of course affect that somewhat if you explicitly output the > Accept-Encoding HTTP header, or the optional accept-charset > attribute of the <form> element. > > Regardless though, you should use one of the above methods > rather than leaving it up to the browser. My personal opinion > is to always use UTF-8 when communicating with browsers. > It is the best supported across all browsers, and will cause you > the least grief. > > > (I assume it's that, as that's my default encoding) (again only a snippet): > > &model=Jubil%C3%A4umsmodell+40+Jahre > > Yes, [0xc3, 0xa4] is in fact the two-byte UTF-8 encoding of the > unicode character U+00E4 (the a-diaeresis). > > > So the end of the whole story, the actual question is now: how do I get > > from this utf-8 encoding back to html quoted encoding, for searching in > > the database? > > What makes this more difficult is that you are using entity names rather > than character references. For example, in HTML, all of these > "escape" strings can be used to reference the same character: > > ä > ä > è > > The second decimal-numeric character reference is the most universal. > In fact it is what you must use if you ever want to use some other > XML-based languages rather than HTML. > > Anyway, the first thing to do is to get the UTF-8 encoded string > turned back into a Python unicode string, such as > > s = s.decode('utf8') > > Now that two-byte sequence has turned into a single unicode > character, \u00e8 (displayed as \xe8 when you print it). > > Then you probably want to define a function which HTM-encodes > a Unicode string, turning all non-ASCII characters into references, > perhaps something like > > def encodechar( c ): > codepoint = ord(c) > if codepoint >= 128 or codepoint < 32: > return '&#%d;' % codepoint > else: > return c > > And HTML-encode your string like: > > encoded_s = ''.join( [encodechar(c) for c in s] ) > > If you are really tied to using entities rather than character references, > you may be able to do something like this instead: > > import htmlentitydefs > > def encodechar( c ): > codepoint = ord(c) > entityname = htmlentitydefs.codepoint2name.get( codepoint, None ) > if entityname: > return '&%s;' % entityname > elif codepoint >= 128 or codepoint < 32: > return '&#%d;' % codepoint > else: > return c > > Keep in mind though that not all browser versions understand > the complete list of HTML entity names. That's why I would > strongly recommend against using entity names. > > Also note that if you need apostrophes escaped too, you'll need to > put in another special case for that as the htmlentitydefs does > not have an entry for "apos" for some reason. > > > And no, I'm not going to put the info in the database in > > utf-8, that must remain html quoted for many other locations where it > > goes wrong otherwise. > > Your choice. Actually though, MySQL handles Unicode brilliantly, > and if you could ever convert to that you'll make things a lot > easier on yourself in the future.
|