blais at furius.ca
Wed Jan 4 11:19:45 EST 2006
On 1/4/06, Graham Dumpleton <grahamd at dscpl.com.au> wrote: > > The stack trace is a bit bogus from what I can tell. In the various MPMs > I looked at, the ap_graceful_stop_signalled() function simple sets a > variable and returns. It doesn't go calling apr_pool_destroy(). I'm still not sure how the normal termination communication message between the apache parent and its child happens, I thought it was supposed to be via the scoreboard, but the stack trace seems to indicate via a signal. When I attach gdb on the running child prior to stopping the app, when I stop apache it gets SIGTERM right away, and not when a timeout occurs. I guess I should dig in apache and libc now (won't happen for another 2 weeks, I have some important work to move on to now for a deadline), to find out how the normal termination is supposed to occur. Note that if I attach gdb before I shutdown apache I can't reproduce this stack trace. I need to attach after I terminate apache to get to this. > Anyway, seeing the stack trace I can see where the problem lies and can > simulate the situation with a test case. > > What it all comes down to is the signal handler for a SIGTERM in the > child process is registered as: > > apr_signal(SIGTERM, just_die); > > Thus when the SIGTERM is received it calls just_die(). The just_die() > function calls clean_child_exit(), which if there is found to be a > memory pool in existence for the child process calls apr_pool_destroy() > on that memory pool. > > The problem then is that mod_python registers a cleanup handler > associated with that memory pool, namely python_finalize(). Ie., it > calls: > > apr_pool_cleanup_register(p, NULL, python_finalize, > apr_pool_cleanup_null); > > This means that when that memory pool is destroyed, the > python_finalize() > function is being called, which is wrong in that situation for a couple > of reasons. Maybe we should change the way python_finalize() is being triggered. Any ideas? > > The first reason is that complex things should not be done from inside > of > signal handlers unless the code which is called is heavily protected > against being called by signal handlers when in critical sections. There > is no way that general Python API functions are going to fall into that > category. Indeed. > The second reason is that at the time that the signal occurs, the main > program thread is already deep within Python code and probably has > various > locks acquired. When the signal handler calls into Py_Finalize() it is I don't know about that, the trace does not indicate that we're processing a request at all. But it could happen I suppose. What we could/should do on that signal is to simply mark a variable for later exiting the wait-loop. That must be somewhere within the apache libs. This way we could terminate properly without being in a signal handler. > most likely reaching a point where it wants to acquire the same lock > as the main program thread has and it effectively deadlocks as the > signal handler can't proceed until it gets the lock, but the main > program thread can't give it up while the signal handler is running. > > At least this is the case on UNIX systems, where signal handlers > interrupt the execution of the main program thread, unlike Win32 where > signal handlers are a distinct thread in their own right. > > My immediate question is why does Py_Finalize() even need to be called > within the context of the child process if it is simply being killed off > anyway. I know that for the Apache main process if doing a restart that > Py_Finalize() needs to be called as the same process is kept around, > but for a child process I don't see the point except maybe to flush > out stderr/stdout which aren't typically used in mod_python anyway. I'm still not convinced if it is being killed off or asked to gracefully go down. > Time now to work out why python_finalize() needs to be called. Maybe > it can't simply not do anything when called in the context of the child > process. > Anyway, one could put: > > if (child_init_pool) > return APR_SUCCESS; > > at the start of python_finalize() and that would at least avoid any > problems > with the signal handler trying to do complicated stuff like call into > Python > and cause a deadlock. > Can you at least try the above little addition to python_finalize() and > see if it makes any difference in your specific case. Oh yes. Problem completely gone... but then again python_finalize is not being called for any of the children (I checked with some logging traces), whilst before some of the children managed to terminate gracefully. Hmm, I think we either need to find a way to terminate outside of a signal handler, or to forego calling Py_Finalize entirely (I don't like the latter "solution").