Bart
scarfboy at gmail.com
Sun Nov 5 13:09:22 EST 2006
Because m_p was unicode-crippled last time I checked I first wrote error-prone code, then added a gazillion .encode('utf8')s, then finally wrote these helper functions and refactored for them. To me they're a lot more comfortable than bare mod_python behaviour, particularly since I'm writing something that occasionally needs to handle unicode both for input and output, so I'm posting this in case this is useful to someone else. It uses UTF8, period. It should perhaps be configurable, set on the req object and have utf8 merely as a default, but even hardcoded it's fairly sane since aside from utf16 and perhaps gb18030 nothing actually encodes even nearly enough codepoints to be considered a practical encoding for unicode-as-in-*unicode*. The major thing here is wrapping req.write with a filtering function. On the hander side of things this means you add one line and can forget the trouble existed: def utf8write(req,s): """ Outputs str type()s unchanged, unicode as UTF8. The thing that gets returned from u8writer and you should assign to your req.write. Currently str()s anything not a string, which is probably a little too dynamic and not real feature, but useful for debugging your apps. """ #raise TypeError('req.write only takes strings') if type(s)==str: req.oldwrite(s) elif type(s)==unicode: req.oldwrite(s.encode('utf8')) else: req.oldwrite( str(s) ) #a stricter version would probably raise a ValueError def utf8writer(req,mime='text/html;charset=utf-8'): """ You can use this to replace req.write with a unicode-capable writer (using utf8). Use by putting the following in the handler before any writing: req.write = utils.utf8writer(req) Sets content_type to HTML using charset utf-8 (note the dash!). If you want something else, use e.g. req.write = utils.utf8writer(req,mime='text/plain;charset=utf-8') Since utf8 is currently hardcoded, you always need that charset bit. """ req.content_type=mime #note this is only here to keep this a single line in your code req.oldwrite=req.write #keep reference to the actual writer around return lambda s: utf8write(req,s) Because I'm bad and py.xml is misdesigned (string based rather than object based, which bites one in the arse in this sort of case), I write my html as strings and so need functions for url encoding, basically drop-in replacements for urllib.quote and urllib.urlencode. def utf8quote(s): """ Returns string as url-encoded UTF8 bytes (that is, urllib.quote(s.encode('utf8')) ) """ return urllib.quote(s.encode('utf8')) def utf8dictquote(d,joinOn='&'): """ Acts like urllib.urlencode (url encode for dict) but encodes vars and val as utf8 """ parts=[] for var in d: val=d[var] if type(var) != unicode: var=unicode(var) if type(val) != unicode: val=unicode(val) parts.append( '%s=%s'%(utf8quote(var), utf8quote(val)) ) return joinOn.join(parts) And, because I like to be robust to input and some browsers may still send form values in the outdated but once standard latin1. Actually, the reason I did this is not so much forms, but the fact that characters added to the browser's location bar got encoded as latin1 way even when the page (and possibly browser) default clearly wasn't. def getfirst_unicode(form,var,ifAbsent=None): """ like form.getfirst(), and decodes utf8 (tries latin1 if that fails). Returns what you pass it in ifAbsent if there is no such variable in the form *OR* if it didn't decode nicely. (ifAbsent is None by default, but making it u'' or 0 may be convenient for you) """ s=form.getfirst(var) if s==None: return ifAbsent s=utf8_or_latin1_to_unicode(s) if s==None: return ifAbsent return s the ifAbsent allows me to handle absence of parameters quickly: r = getfirst_unicode(form,'regular') i = int(getfirst_unicode(form,'amount',0)) s = getfirst_unicode(form,'strrring',u'') #, etc. The integer case thing would otherwise be something like: i = form.getfirst(var) if i==None: i=0 i=int(i) ...and I got tired of typing about five cases like that whenever I wanted to be robust to bad users. The utf8-or-latin1 function mentioned is failry trivial: def utf8_or_latin1_to_unicode(s): """ Tries to decode a string as utf8 first, then as Latin1 (iso8859-1). Returns None if both fail.""" try: return s.decode('utf8') except UnicodeDecodeError: try: return s.decode('latin1') except UnicodeDecodeError: #I believe this *is* technically possible return None An alternative is replacing all high bytes, but you probably don't want that to happen without being told, so you get to do that yourself... Comments are welcome. It's possible I broke some code, I was editing it in the gmail composer:) Cheers, --Bart Alewijnse
|