[mod_python] Unicode convenience functions

Sun Nov 5 13:09:22 EST 2006

Because m_p was unicode-crippled last time I checked I first wrote
error-prone code, then added a gazillion .encode('utf8')s, then finally
wrote these helper functions and refactored for them.

To me they're a lot more comfortable than bare mod_python behaviour,
particularly since I'm writing something that occasionally needs to
handle unicode both for input and output, so I'm posting this in case
this is useful to someone else.

It uses UTF8, period. It should perhaps be configurable, set on the
req object and have utf8 merely as a default, but even hardcoded
it's fairly sane since aside from utf16 and perhaps gb18030 nothing
actually encodes even nearly enough codepoints to be considered
a practical encoding for unicode-as-in-*unicode*.

The major thing here is wrapping req.write with a filtering function.
On the hander side of things this means you add one line
and can forget the trouble existed:

def utf8write(req,s):
    """ Outputs str type()s unchanged, unicode as UTF8.
        The thing that gets returned from u8writer and you should
assign to your req.write.
        Currently str()s anything not a string, which is probably a
little too dynamic
        and not real feature, but useful for debugging your apps. """
       #raise TypeError('req.write only takes strings')
    if type(s)==str:
        req.oldwrite(s)
    elif type(s)==unicode:
        req.oldwrite(s.encode('utf8'))
    else:
        req.oldwrite( str(s) )
        #a stricter version would probably raise a ValueError

def utf8writer(req,mime='text/html;charset=utf-8'):
   """ You can use this to replace req.write with a unicode-capable
writer (using utf8).
       Use by putting the following in the handler before any writing:
         req.write = utils.utf8writer(req)
       Sets content_type to HTML using charset utf-8 (note the dash!).
       If you want something else, use e.g.
         req.write = utils.utf8writer(req,mime='text/plain;charset=utf-8')
       Since utf8 is currently hardcoded, you always need that charset bit.
   """
   req.content_type=mime #note this is only here to keep this a single
line in your code
   req.oldwrite=req.write #keep reference to the actual writer around
   return lambda s: utf8write(req,s)

Because I'm bad and py.xml is misdesigned (string based rather than
object based, which bites one in the arse in this sort of case),
I write my html as strings and so need functions for url encoding,
basically drop-in replacements for urllib.quote and urllib.urlencode.

def utf8quote(s):
    """ Returns string as url-encoded UTF8 bytes (that is,
urllib.quote(s.encode('utf8')) ) """
    return urllib.quote(s.encode('utf8'))

def utf8dictquote(d,joinOn='&amp;'):
    """ Acts like urllib.urlencode (url encode for dict) but encodes
vars and val as utf8 """
    parts=[]
    for var in d:
        val=d[var]
        if type(var) != unicode:
            var=unicode(var)
        if type(val) != unicode:
            val=unicode(val)
        parts.append( '%s=%s'%(utf8quote(var), utf8quote(val)) )
    return joinOn.join(parts)

And, because I like to be robust to input and some browsers may still send
form values in the outdated but once standard latin1.
Actually, the reason I did this is not so much forms, but the fact
that characters
added to the browser's location bar got encoded as latin1 way even when the
page (and possibly browser) default clearly wasn't.

def getfirst_unicode(form,var,ifAbsent=None):
    """ like form.getfirst(), and decodes utf8 (tries latin1 if that fails).
        Returns what you pass it in ifAbsent if there is no such
variable in the form *OR* if it didn't decode nicely.
        (ifAbsent is None by default, but making it u'' or 0 may be
convenient for you) """
    s=form.getfirst(var)
    if s==None:
        return ifAbsent
    s=utf8_or_latin1_to_unicode(s)
    if s==None:
        return ifAbsent
    return s

the ifAbsent allows me to handle absence of parameters quickly:
  r = getfirst_unicode(form,'regular')
  i = int(getfirst_unicode(form,'amount',0))
  s = getfirst_unicode(form,'strrring',u'') #, etc.

The integer case thing would otherwise be something like:
  i = form.getfirst(var)
  if i==None:
     i=0
  i=int(i)
...and I got tired of typing about five cases like that whenever I
wanted to be robust to bad users.

The utf8-or-latin1 function mentioned is failry trivial:

def utf8_or_latin1_to_unicode(s):
    """ Tries to decode a string as utf8 first, then as Latin1 (iso8859-1).
        Returns None if both fail."""
    try:
        return s.decode('utf8')
    except UnicodeDecodeError:
        try:
            return s.decode('latin1')
        except UnicodeDecodeError: #I believe this *is* technically possible
            return None
An alternative is replacing all high bytes, but you probably don't want that to
happen without being told, so you get to do that yourself...

Comments are welcome.
It's possible I broke some code, I was editing it in the gmail composer:)

Cheers,
--Bart Alewijnse