Anthony L.
anthony at ataribaby.org
Wed Nov 23 16:33:25 EST 2005
Thanks for the suggestions. I am looking into the xml.sax.saxutils modules Jim mentioned as well as Python's HTML parsing module. I recalled an earlier technique I used for blocking undesirable characters using JavaScript, and came up with the following: x = 'U at s%er#_N$a^m!e%-<' set = '0123456789 _- abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.' print ''.join( [c for c in x if c in set] ) This is very simple, easily extensible by customizing the set variable. It allows me to keep in all my allowable characters including underscores and dashes. If I wanted to use this to sanitize a user input for an email address, I could add the period and amphora to the set variable. This feels pretty fast, though I can't quantify speed, and I'm tempted to use it everywhere I don't need to be conscious of markup code and SQL injection. But is this a good Pythonic way of doing things? Also, unicode. I'd like to allow input using characters from the Latin-1 set, but I can't figure out how. I did the following: x = u'Üsernåmë' set = u'åëüÜabcdefghijklmnopqrstuvwxyz' print ''.join( [c for c in x if c in set] ) But it's not enough and will return a UnicodeEncodeError. Of course, it's probably not even proper to include the actual decoded characters within the variable. Can someone point me to an exhaustive source for unicode on python that has many examples? Anthony
|