[mod_python] Background threads in mod_python

Mon Jul 31 15:09:09 EDT 2006

On 7/31/06, Christian Gross <christianhgross at gmail.com> wrote:
> Mike Looijmans wrote:
> > It really depends on WHAT you intend to do. I can think of no reason
> > whatsoever to start a thread in the handler that somehow "survives"
> > the request. Once that thread has created some "answer", where can it
> > send that answer?
> >
> There is lots of needs for this in an Ajax world. For example Jetty
> (http://www.mortbay.com/MB/log/gregw/?permalink=ScalingConnections.html),
> Apache 2.2, and IIS a few years can do asynchronous processing. There
> are two ways to do asynchronous processing. The first is to lock onto
> the request and hold it and move it into a "secondary" processing area.
> The second is to start a task and then ask if any data has been
> generated. The running task will store in a cache that is then picked up
> by another request at another time.
>
> The big idea here is to mimic an architecture where the server "calls"
> the client, which is very popular in Ajax.

I am still failing to see why Ajax makes this different.  And I don't
understand what you mean by the server "calling" the client.

I think you may be influenced too much by the "Java way"?  Apache httpd is a
web server.  It is not an all-encompassing kitchen sink included
application runtime environment. So the core function of Apache is to
service individual HTTP requests.

HTTP does not have "asynchronous" calls as you're using that term.
Each HTTP request is synchronous and self-contained.  And each HTTP
request is (mostly) independent of any others.

If you want to have decidely non-HTTP semantics, such as having one
HTTP request start up a background activity, and have a later HTTP
request check up on its progress, you've got to do so outside the
scope of Apache...because it's outside the scope of HTTP.

The common way to do this is to use a non-Apache process to perform
that work, and use a communication mechanism between Apache and that
process.  There are dozens of kinds of IPC (inter-process
communication) mechanisms to do this sort of workflow.  Although using
an nested RPC-XML is one way, you can also use things like named
pipes, or even passing work requests around inside a database.

The point is that you should keep web transactions inside Apache and
non-web stuff outside of Apache.  It's compartmentalization, and it
can have lots of benifits.

> > When Apache is using multiple processes, it will terminate child
> > processes for various reasons (for example, when
> > max_requests_per_child has been reached). That will also terminate any
> > thread you created in that process.
> >
> Yeah I was looking at this and it is pain. On Windows Apache does not do
> this. They use only two processes and use threads within those processes.

IMHO, Windows has an over-reliance on threads which can lead to some
poor architectural designs.  If everything is a thread, then every
problem starts looking like a needle (or something like that :)

You need to keep in mind that with Apache not only can it create and
destroy processes and threads at its own whim (between requests), but
it can also spawn many different processes and you have no control
over which HTTP request will get sent to which process or thread.
This is of course a frequent source of confusion over how to store
persistent server-side state.

> > I suspect your software gets a "job" from the user, and reports back
> > to the user that the job has been started. The job itself must keep
> > running for some time after that. The job creates a file or something
> > similar which can make the user conclude that the job has finished.
> >
> I would not say "job". I would say long running tasks generating data.
> For example I like to read real-time feeds, and want to generate the
> data. But the client is in control of the task using parameters that are
> sent to server.

The real question is if the computation is within the scope of a
single HTTP request or not (regardless of how long it takes, although
most user agents, proxies, and even Apache impose an upper limit on
the time of a request).  If the results of assembling your feeds are
to be given back to the UA within the same request, then just do it
(there's no implicit need to launch off additional threads).  However
if it really is a background activity that is a side-effect of the
request (in that the result of the request is not dependent upon the
result of the task), then it is best to communicate that to some
service which lives outside of Apache.

> > A web service used within a web server is not redundant, it's just a
> > way to delegate tasks to other machines or processes. This is
> > typically done for security reasons.
> >
> I don't really buy this argument. Let's say I do what you recommend and
> that is call an XML-RPC service. Why, do I need Apache in the first
> place? While I might get security I don't get any added value. If I am
> using web services then Apache is pretty well useless anyways because
> most web services don't use HTTP security. Most web service
> infrastructures use the WS-* specs, or some home-backed tokens that are
> added to the XML package.

You have to answer the question of whether you need Apache.  Apache is
a tool for handling HTTP requests, and a very nice and efficient one
at that.  But it is not a CORBA ORB (or it's poor cousin, a web
service thingamabob), nor is it a TPM.

The whole web services fad really abuses the semantics of HTTP beyond
recognition anyway, so perhaps Apache really has no use to you.  I
find it unfortunate that they call it "web" services, since it really
has so very little to do with the "web".  But that's a different rant.

BTW, do you have any mod_python specific questions, or is this really
more a general Apache architecture discussion?  If you want to know
how to do IPC and such I'm sure we can help.
-- 
Deron Meranda