[SPAM] Re: [mod_python] Sanitizing user input... but not totally.

Wed Nov 23 16:33:25 EST 2005

Thanks for the suggestions. I am looking into the xml.sax.saxutils  
modules Jim mentioned as well as Python's HTML parsing module.

I recalled an earlier technique I used for blocking undesirable  
characters using JavaScript, and came up with the following:

	x = 'U at s%er#_N$a^m!e%-<'

	set = '0123456789 _- 
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.'

	print ''.join( [c for c in x if c in set] )

This is very simple, easily extensible by customizing the set  
variable. It allows me to keep in all my allowable characters  
including underscores and dashes. If I wanted to use this to sanitize  
a user input for an email address, I could add the period and amphora  
to the set variable. This feels pretty fast, though I can't quantify  
speed, and I'm tempted to use it everywhere I don't need to be  
conscious of markup code and SQL injection. But is this a good  
Pythonic way of doing things?

Also, unicode. I'd like to allow input using characters from the  
Latin-1 set, but I can't figure out how. I did the following:

	x = u'Üsernåmë'

	set = u'åëüÜabcdefghijklmnopqrstuvwxyz'

	print ''.join( [c for c in x if c in set] )

But it's not enough and will return a UnicodeEncodeError. Of course,  
it's probably not even proper to include the actual decoded  
characters within the variable. Can someone point me to an exhaustive  
source for unicode on python that has many examples?

Anthony