[mod_python] Problem with html quoted/unquoted

Wed May 17 11:02:03 EDT 2006

On 5/16/06, Wouter van Marle <wouter at squirrel-systems.com> wrote:
> For html compatibility reasons, I store all the search data in the MySQL
> database in html-quoted format. The term in question that I ran into is:
> "Jubiläumsmodell", stored in the database as "Jubil&auml;umsmodell".
>
> The select in the html source, as presented to the browser, is fine (the
> html line is abbreviated for here; it contains many more options):
> <select name=model>
>         <option value="Jubil&auml;umsmodell 40 Jahre">Jubil&auml;umsmodell 40
> Jahre</option>
> </select>
>
> It is rendered correctly of course by the browser.
> Now when the user clicks go, and the POST is generated, the character is
> sent back in what I guess is UTF-8 encoding

>From my experience, browsers will POST in the same character set/
encoding as whatever the HTML page containing the <form> was
encoded in.  So if you output your HTML page in utf-8, then the POST
will also be UTF-8.

You can of course affect that somewhat if you explicitly output the
Accept-Encoding HTTP header, or the optional accept-charset
attribute of the <form> element.

Regardless though, you should use one of the above methods
rather than leaving it up to the browser.  My personal opinion
is to always use UTF-8 when communicating with browsers.
It is the best supported across all browsers, and will cause you
the least grief.

> (I assume it's that, as that's my default encoding) (again only a snippet):
> &model=Jubil%C3%A4umsmodell+40+Jahre

Yes, [0xc3, 0xa4] is in fact the two-byte UTF-8 encoding of the
unicode character U+00E4 (the a-diaeresis).

> So the end of the whole story, the actual question is now: how do I get
> from this utf-8 encoding back to html quoted encoding, for searching in
> the database?

What makes this more difficult is that you are using entity names rather
than character references.  For example, in HTML, all of these
"escape" strings can be used to reference the same character:

   &auml;
   &#228;
  &#xe8;

The second decimal-numeric character reference is the most universal.
In fact it is what you must use if you ever want to use some other
XML-based languages rather than HTML.

Anyway, the first thing to do is to get the UTF-8 encoded string
turned back into a Python unicode string, such as

   s = s.decode('utf8')

Now that two-byte sequence has turned into a single unicode
character, \u00e8 (displayed as \xe8 when you print it).

Then you probably want to define a function which HTM-encodes
a Unicode string, turning all non-ASCII characters into references,
perhaps something like

def encodechar( c ):
    codepoint = ord(c)
    if codepoint >= 128 or codepoint < 32:
        return '&#%d;' % codepoint
    else:
        return c

And HTML-encode your string like:

   encoded_s = ''.join( [encodechar(c) for c in s] )

If you are really tied to using entities rather than character references,
you may be able to do something like this instead:

import htmlentitydefs

def encodechar( c ):
    codepoint = ord(c)
    entityname = htmlentitydefs.codepoint2name.get( codepoint, None )
    if entityname:
        return '&%s;' % entityname
    elif codepoint >= 128 or codepoint < 32:
        return '&#%d;' % codepoint
    else:
        return c

Keep in mind though that not all browser versions understand
the complete list of HTML entity names.  That's why I would
strongly recommend against using entity names.

Also note that if you need apostrophes escaped too, you'll need to
put in another special case for that as the htmlentitydefs does
not have an entry for "apos" for some reason.

> And no, I'm not going to put the info in the database in
> utf-8, that must remain html quoted for many other locations where it
> goes wrong otherwise.

Your choice.  Actually though, MySQL handles Unicode brilliantly,
and if you could ever convert to that you'll make things a lot
easier on yourself in the future.
-- 
Deron Meranda