[mod_python] Problem with PSP and unicode

Thu Feb 16 12:16:04 EST 2006

Hi Nicolas, as usual you're right on the money. I meant to post this
yesterday, but I always forget to check the To: field when I reply and it
always ends up going somewhere other than the list (to Gustavo this time,
sorry about that Gustavo.)

Because of the somewhat unusual way in which I use psp, I'm making my own
implementation right now. I treat psp as templates, and I nest them, and
this unfortuantly writes the templates from the innermost to the outermost,
not what I want. I've been getting around this by replacing req.write with a
buffered function (only when running a psp template) and then setting it
back when I'm done. This combined with the unciode issues, and some issues
with the way indentation is handled has persuaded me it's time to do things
myself so I can make sure it works for me.

The way I'm handling the issues with encoding is by simply adding two
optional arguments (input and output encoding) to the constructor that both
default to utf-8 (I thought about auto detecting the input encoding, but I'd
rather not force that, you can always do an auto detect and pass that as the
input encoding anyway.)

With the same input and outut encoding, there isn't any conversions done to
the static content, because I encode static content directly into the output
encoding. I've also done things in such a way that regular strings can be
encoded as utf-8 without any conversion as well. For now I do this by
assuming the string is ascii and therefore valid utf-8. I'm 90% sure that's
fine, because passing a string in any other encoding is just looking for
trouble, there's no way I can guess what it's encoded as so I could fix it
or raise an error, this puts the responsibility for encoding regular strings
on the developer, where it should be. But for any other object I convert to
unicode and then encode it with the output encoding. This ensures
__unicode__ is called preferentially over __str__.

My trouble currently is how to implement the encoding and decoding, the only
part I have working is converting the unicode objects to the output encoding
(by using PyString_AsEncodedObject). I'd like to do the conversion using
python's encoding/decoding abilites and without creating redundant extra
copies (like those created by copying a char buffer into a python string and
then converting.)

I'd love to hear what you think about this, I will send you my code when
it's finished and you can pull it apart and reuse the bits you like.

-Dan

On 2/16/06, Dan Eloff <dan.eloff at gmail.com> wrote:
>
> Gustavo, here's an example. Suppose some code enforces a maximum length on
> a string. If it's counting on a default encoding of 1 byte per char, and
> does something like len(s) <= 15. For ascii or iso-8859-1 this would work.
> Or the code might use indices or slices (and a lot of code does!) If
> suddenly you have utf-8 encoded chinese, your string is going to triple in
> length, and those functions will have unpredictable behaviour. You could
> think of any number of scenarios, even in the python library. I just
> wouldn't feel confortable about changing the default encoding, you never
> know where it will come back to haunt you. What's so hard about using
> unicode strings in your program and then encoding when you send output
> somewhere?
>
> Of course, like we've noticed in this thread, there is third party code
> that just isn't unicode safe like psp in mod_python that will break if you
> don't mess with the default encoding. It all boils down to seeing what will
> work best in your situation. For myself I would rather do it the right way
> up front in the hopes of saving hassles down the road.
>
> -Dan
>
> On 2/16/06, Gustavo Córdova Avila <gustavo.cordova at q-voz.com> wrote:
> >
> > Dan Eloff wrote:
> >
> > > Gustavo
> > >
> > > "So, if you can configure your default encoding"
> > >
> > > That would be a bad idea. You don't want to force the interpreter to
> > > use a different encoding, it could cause you all manner of grief with
> > > code that wasn't written for that, and if it's third party code you
> > > can't do much about anything that does break.
> >
> >
> > Hi Dan, thanks for replying.
> >
> > Do you have any examples of the above?  I haven't had any troubles
> > whatsoever using a default encoding, even when the script source is
> > different from the default encoding (I always use the "-*- coding: xxx
> > -*-" bit at the top of the source file), so while I don't doubt at all
> > what you're saying, having a practical case where fiddling with the
> > default encoding breaks things is good.
> >
> > Good morning, y'all.
> >
> > -gus
> >
> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mm_cfg_has_not_been_edited_to_set_host_domains/pipermail/mod_python/attachments/20060216/ffef88cf/attachment.html