[mod_python] Problem with PSP and unicode

Thu Feb 16 15:17:21 EST 2006

"in-place unicode to string conversion is quite difficult to achieve"

Oh you're right about that, I'd never attempt it, I was merely hoping to
avoid redundant copies like creating a python string by copying the
character buffer bit for bit when it could be reused.

You're right of course about doing the psp module in C or C++. Now that I
think about I don't know why I am rewriting it in C++ instead of python,
other than because mod_python did that. Now I feel stupid, lol. The parsing
would be much faster in C++, which means very little unless you have to
reparse it frequently (depends how often the apache process gets killed and
restarted.) Parsing the psp files is probably the most expensive part of my
website initialization, simply because I have so many psp pages (and
growing.) I haven't profiled it, but even just reading in all those files
should be more expensive than the rest of the initialization process.

The flex business was a mess, I was planning to just ignore it all and
handcoded the parser, I only use a subset of the psp features anyway, so I
only needed support for <%%> and <%=%>.

The C++ code will still run slightly faster I imagine, mostly because the
write function is then implemented natively and doesn't have the overhead of
python function calls inside the function. Big deal, it wouldn't even be
noticed in the face of all the other things going on in serving up a page in
mod_python and in running the psp and database queries etc.

So I'm going to leave the C++ code now, and do this in Python. I'm just
annoyed at myself for having prototyped everything in C++ before writing it
as python. It shouldn't take me long at least.

Thanks a lot for your input!

-Dan

On 2/16/06, Nicolas Lehuen <nicolas at lehuen.com> wrote:
>
> 2006/2/16, Dan Eloff <dan.eloff at gmail.com>:
> > With the same input and outut encoding, there isn't any conversions done
> to
> > the static content, because I encode static content directly into the
> output
> > encoding. I've also done things in such a way that regular strings can
> be
> > encoded as utf-8 without any conversion as well. For now I do this by
> > assuming the string is ascii and therefore valid utf-8. I'm 90% sure
> that's
> > fine, because passing a string in any other encoding is just looking for
> > trouble, there's no way I can guess what it's encoded as so I could fix
> it
> > or raise an error, this puts the responsibility for encoding regular
> strings
> > on the developer, where it should be. But for any other object I convert
> to
> > unicode and then encode it with the output encoding. This ensures
> > __unicode__ is called preferentially over __str__.
> >
> > My trouble currently is how to implement the encoding and decoding, the
> only
> > part I have working is converting the unicode objects to the output
> encoding
> > (by using PyString_AsEncodedObject). I'd like to do the conversion using
> > python's encoding/decoding abilites and without creating redundant extra
> > copies (like those created by copying a char buffer into a python string
> and
> > then converting.)
> >
> > I'd love to hear what you think about this, I will send you my code when
> > it's finished and you can pull it apart and reuse the bits you like.
> >
> > -Dan
>
> If you are using the C API, I don't think there are more efficient
> ways than using PyString_AsEncodedObject and try not to think about
> the various extra objects that are created in the process...
>
> In-place unicode to string conversion is quite difficult to achieve,
> because in the general case you may end up with more bytes than
> unicode characters. The Java NIO library doesn't encode nor decode
> in-place, and given the efforts they made to remove data copy, I think
> it's pretty safe to say that's it's not easily feasable.
>
> I really think you should first worry about having the input and
> output coding right, then profile your code to see where the
> performance problems are. Most likely, you'll find that they are in
> mod_python rather than in the psp module ;).
>
> To me, the fact that PSP is implemented in C is a perfect case of
> premature optimisation. It's a quite complicated piece of C code,
> parts of it being generated by flex. As a result, it's difficult to
> maintain and solving the Unicode issues in this context could be quite
> difficult. And it's not like it's proven that writing this module in
> pure C is really useful as far as global performance are concerned.
>
> If it were up to me, I would reimplement it in pure Python, and I
> don't think the performance loss would be so big. Granted, compilation
> would be a little slower, but once the PSP is compiled, performance
> should be exactly the same - and implementing a compiler cache is
> extremely easy. Plus, the code being more easy to maintain, we could
> easily optimize it. Call it "doing it the PyPy way" if you like ;).
>
> Regards,
> Nicolas
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mm_cfg_has_not_been_edited_to_set_host_domains/pipermail/mod_python/attachments/20060216/cd90aea9/attachment-0001.html