[mod_python] Problem with PSP and unicode

Thu Feb 16 14:46:48 EST 2006

2006/2/16, Dan Eloff <dan.eloff at gmail.com>:
> With the same input and outut encoding, there isn't any conversions done to
> the static content, because I encode static content directly into the output
> encoding. I've also done things in such a way that regular strings can be
> encoded as utf-8 without any conversion as well. For now I do this by
> assuming the string is ascii and therefore valid utf-8. I'm 90% sure that's
> fine, because passing a string in any other encoding is just looking for
> trouble, there's no way I can guess what it's encoded as so I could fix it
> or raise an error, this puts the responsibility for encoding regular strings
> on the developer, where it should be. But for any other object I convert to
> unicode and then encode it with the output encoding. This ensures
> __unicode__ is called preferentially over __str__.
>
> My trouble currently is how to implement the encoding and decoding, the only
> part I have working is converting the unicode objects to the output encoding
> (by using PyString_AsEncodedObject). I'd like to do the conversion using
> python's encoding/decoding abilites and without creating redundant extra
> copies (like those created by copying a char buffer into a python string and
> then converting.)
>
> I'd love to hear what you think about this, I will send you my code when
> it's finished and you can pull it apart and reuse the bits you like.
>
> -Dan

If you are using the C API, I don't think there are more efficient
ways than using PyString_AsEncodedObject and try not to think about
the various extra objects that are created in the process...

In-place unicode to string conversion is quite difficult to achieve,
because in the general case you may end up with more bytes than
unicode characters. The Java NIO library doesn't encode nor decode
in-place, and given the efforts they made to remove data copy, I think
it's pretty safe to say that's it's not easily feasable.

I really think you should first worry about having the input and
output coding right, then profile your code to see where the
performance problems are. Most likely, you'll find that they are in
mod_python rather than in the psp module ;).

To me, the fact that PSP is implemented in C is a perfect case of
premature optimisation. It's a quite complicated piece of C code,
parts of it being generated by flex. As a result, it's difficult to
maintain and solving the Unicode issues in this context could be quite
difficult. And it's not like it's proven that writing this module in
pure C is really useful as far as global performance are concerned.

If it were up to me, I would reimplement it in pure Python, and I
don't think the performance loss would be so big. Granted, compilation
would be a little slower, but once the PSP is compiled, performance
should be exactly the same - and implementing a compiler cache is
extremely easy. Plus, the code being more easy to maintain, we could
easily optimize it. Call it "doing it the PyPy way" if you like ;).

Regards,
Nicolas