[mod_python] Re: child process XXXX still did not exit, sending a SIGTERM

Graham Dumpleton grahamd at dscpl.com.au
Mon Jan 2 18:15:38 EST 2006


Maybe mod_python has to be a bit more orderly as far as how it ends
interpreters. Ie., maybe it should end each interpreter it has created
explicitly using Py_EndInterpreter().

This though may be dependent on first getting changes described in:

  http://issues.apache.org/jira/browse/MODPYTHON-77

incorporated. Specifically, the attachment:

  http://issues.apache.org/jira/secure/attachment/12320489/grahamd_20051105.tar.gz

Actually, this change may in itself fix the issue as it saves away the
first interpreter state as "main_interpreter" and that will then be the
interpreter state used to do the cleanup. This isn't the case at the
moment.

Can you apply that patch and see if things improve?

Beyond that, as I suggest, maybe it needs to be more orderly and
end each interpreter besides "main_interpreter" and then finally call
Py_Finalize().

Graham


Martin Blais wrote ..
> I think there is a deadlock somewhere, I start apache2 in debug mode
> and eventually manage to interrupt in this stack frame (where it
> remains stuck, cont'inuing does not help):
> 
> 
>     #0  0xb7b9b054 in __pthread_sigsuspend () from /lib/libpthread.so.0
>     #1  0xb7b9ae98 in __pthread_wait_for_restart_signal () from
> /lib/libpthread.so.0
>     #2  0xb7b9bd6b in sem_wait@@GLIBC_2.1 () from /lib/libpthread.so.0
>     #3  0xb79d943f in PyThread_acquire_lock () from
> /usr/lib/apache2/modules/mod_python.so
>     #4  0xb79d262e in PyThreadState_Delete () from
> /usr/lib/apache2/modules/mod_python.so
>     #5  0xb79d21e8 in PyInterpreterState_Delete () from
> /usr/lib/apache2/modules/mod_python.so
>     #6  0xb79d3318 in Py_Finalize () from /usr/lib/apache2/modules/mod_python.so
>     #7  0xb794eeea in python_finalize (data=0x0) at mod_python.c:1823
>     #8  0xb7c7a6ed in apr_pool_cleanup_run () from /usr/lib/libapr-0.so.0
>     #9  0xb7c79f8d in apr_pool_destroy () from /usr/lib/libapr-0.so.0
>     #10 0x0806860c in ap_graceful_stop_signalled ()
>     #11 0xb7b9def5 in __pthread_sighandler () from /lib/libpthread.so.0
>     #12 <signal handler called>
>     #13 0xb7b240b8 in poll () from /lib/libc.so.6
>     #14 0xb7c7b2af in apr_poll () from /usr/lib/libapr-0.so.0
>     #15 0x08068952 in ap_graceful_stop_signalled ()
>     #16 0x08067f0d in ap_graceful_stop_signalled ()
>     #17 0x08067954 in ap_mpm_run ()
>     #18 0x0806f02f in main ()
> 
> 
> So I've added traces to python_finalize():
> 
> 
> static apr_status_t python_finalize(void *data)
> {
>     interpreterdata *idata;
>     pid_t pid = getpid();
> 
>     idata = get_interpreter(NULL, NULL);
> 
>     fprintf(stderr, "python_finalize %d\n", pid);
>     fflush(stderr);
> 
>     if (idata) {
> 
>       fprintf(stderr, "  python_finalize %d - before \n", pid);
>       fflush(stderr);
> 
>       Py_Finalize();
> 
>       fprintf(stderr, "  python_finalize %d - after\n", pid);
>       fflush(stderr);
> 
> #ifdef WITH_THREAD
>         PyEval_ReleaseLock();
> #endif
> 
>       fprintf(stderr, "  python_finalize %d - after\n", pid);
>       fflush(stderr);
>     }
> 
> 
>     return APR_SUCCESS;
> 
> }
> 
> 
> 
> And I get this log, which confirms that indeed the child never leaves
> Py_Finalize().
> 
> 
> 
> ==> /var/log/apache2/error_log <==
> [Mon Jan 02 17:29:17 2006] [error] python_cleanup 29677 - interp
> [Mon Jan 02 17:29:17 2006] [error] python_cleanup 29678 - interp
> [Mon Jan 02 17:29:17 2006] [error] python_cleanup 29677 - before
> python_finalize 29680
>   python_finalize 29680 - before
>   python_finalize 29680 - after
>   python_finalize 29680 - after
> python_finalize 29681
>   python_finalize 29681 - before
>   python_finalize 29681 - after
>   python_finalize 29681 - after
> python_finalize 29686
>   python_finalize 29686 - before
>   python_finalize 29686 - after
>   python_finalize 29686 - after
> [Mon Jan 02 17:29:17 2006] [error] python_cleanup 29678 - before
> python_finalize 29679
>   python_finalize 29679 - before
>   python_finalize 29679 - after
>   python_finalize 29679 - after
> python_finalize 29688
>   python_finalize 29688 - before
>   python_finalize 29688 - after
>   python_finalize 29688 - after
> dracohandler.py: child 29677 finishing childExit
> [Mon Jan 02 17:29:18 2006] [error] python_cleanup 29677 - after
> [Mon Jan 02 17:29:18 2006] [error] python_cleanup 29677 - released
> python_finalize 29677
>   python_finalize 29677 - before
> dracohandler.py: child 29678 finishing childExit
> [Mon Jan 02 17:29:18 2006] [error] python_cleanup 29678 - after
> [Mon Jan 02 17:29:18 2006] [error] python_cleanup 29678 - released
> python_finalize 29678
>   python_finalize 29678 - before
>   python_finalize 29677 - after
>   python_finalize 29677 - after
> python_finalize 29687
>   python_finalize 29687 - before
>   python_finalize 29687 - after
>   python_finalize 29687 - after
> [Mon Jan 02 17:29:21 2006] [warn] child process 29678 still did not
> exit, sending a SIGTERM
> [Mon Jan 02 17:29:23 2006] [warn] child process 29678 still did not
> exit, sending a SIGTERM
> [Mon Jan 02 17:29:25 2006] [warn] child process 29678 still did not
> exit, sending a SIGTERM
> [Mon Jan 02 17:29:27 2006] [error] child process 29678 still did not
> exit, sending a SIGKILL
> [Mon Jan 02 17:29:28 2006] [info] removed PID file
> /var/run/apache2.pid (pid=29673)
> [Mon Jan 02 17:29:28 2006] [notice] caught SIGTERM, shutting down
> 
> 
> 
> 
> 
> 
> 
> On 1/2/06, Martin Blais <blais at furius.ca> wrote:
> > Thanks Graham.
> >
> > Looks like a new directive only available in 2.2...   will update now.
> >
> > More info:  I'm starting to hack into mod_python to find out what's
> > happening there, and python_finalize() is not called for the offending
> > processes.
> >
> >
> >
> >
> >
> > On 1/2/06, Graham Dumpleton <grahamd at dscpl.com.au> wrote:
> > > Instead of using "apachectl restart", try using "apachectl graceful"
> to see
> > > if that makes any difference.
> > >
> > > You may also want to adjust GracefulShutdownTimeout for this case to
> > > only allow a certain time for child processes to shutdown.
> > >
> > >   http://httpd.apache.org/docs/2.2/mod/mpm_common.html#gracefulshutdowntimeout
> > >
> > > By fiddling with this, you may be able to see if given time that the
> child
> > > processes do actually shutdown, or whether they actually hang. If they
> > > hang, then you may be able to attach using gdb to the child process
> and
> > > work out where it is hanging.
> > >
> > > It may just be a case that with Python loaded it takes longer to shutdown
> > > than Apache expects it to and so it abruptly kills it. I haven't been
> able
> > > to find any directive for specifying a shutdown timeout for "restart"
> which
> > > it waits before it kills the child processes.
> > >
> > > Graham
> > >
> > > Martin Blais wrote ..
> > > > On 1/2/06, Martin Blais <blais at furius.ca> wrote:
> > > > > Before someone asks, I did not get seg faults in my log, so that's
> not
> > > > > the problem.
> > > > >
> > > > > On closer inspection though, I do have a few SSL error messages
> like
> > > > these:
> > > > >
> > > > > [Mon Jan 02 15:07:30 2006] [info] (70014)End of file found: SSL
> > > > > handshake interrupted by system [Hint: Stop button pressed in
> > > > > browser?!]
> > > > > [Mon Jan 02 15:07:30 2006] [info] Connection to child 0 closed
> with
> > > > > abortive shutdown(server furius.dyndns.biz:443, client 127.0.0.1)
> > > > > [Mon Jan 02 15:07:31 2006] [info] Connection to child 9 established
> > > > > (server furius.dyndns.biz:443, client 127.0.0.1)
> > > > > [Mon Jan 02 15:07:31 2006] [info] Seeding PRNG with 136 bytes of
> entropy
> > > > >
> > > > > ... the number of which does not match the number of zombie children
> > > > > (does not mean that it's not the problem though).
> > > > >
> > > > > Any information/tips welcome.
> > > >
> > > > It's not the SSL errors, I just reproduced the zombie child error
> > > > without getting the SSL errors (it's my automated test code that
> > > > somehow creates the SSL problems, fiddling a lot with the browser
> does
> > > > not cause SSL errors, but will result in children not wanting to
> die).
> > > >
> > > >
> > > > >
> > > > >
> > > > > On 1/2/06, Martin Blais <blais at furius.ca> wrote:
> > > > > > Hi
> > > > > >
> > > > > > I'm getting messages like this in my apache2 log when I stop
> it after
> > > > > > running requests to a mod_python-based web app:
> > > > > >
> > > > > > ==> /var/log/apache2/error_log <==
> > > > > > [Mon Jan 02 13:04:12 2006] [warn] child process 21806 still did
> not
> > > > > > exit, sending a SIGTERM
> > > > > > [Mon Jan 02 13:04:12 2006] [warn] child process 21768 still did
> not
> > > > > > exit, sending a SIGTERM
> > > > > > [Mon Jan 02 13:04:12 2006] [warn] child process 21792 still did
> not
> > > > > > exit, sending a SIGTERM
> > > > > > [Mon Jan 02 13:04:14 2006] [warn] child process 21806 still did
> not
> > > > > > exit, sending a SIGTERM
> > > > > > [Mon Jan 02 13:04:14 2006] [warn] child process 21768 still did
> not
> > > > > > exit, sending a SIGTERM
> > > > > > [Mon Jan 02 13:04:14 2006] [warn] child process 21792 still did
> not
> > > > > > exit, sending a SIGTERM
> > > > > > [Mon Jan 02 13:04:16 2006] [warn] child process 21806 still did
> not
> > > > > > exit, sending a SIGTERM
> > > > > > [Mon Jan 02 13:04:16 2006] [warn] child process 21768 still did
> not
> > > > > > exit, sending a SIGTERM
> > > > > > [Mon Jan 02 13:04:16 2006] [warn] child process 21792 still did
> not
> > > > > > exit, sending a SIGTERM
> > > > > > [Mon Jan 02 13:04:18 2006] [error] child process 21806 still
> did not
> > > > > > exit, sending a SIGKILL
> > > > > > [Mon Jan 02 13:04:18 2006] [error] child process 21768 still
> did not
> > > > > > exit, sending a SIGKILL
> > > > > > [Mon Jan 02 13:04:18 2006] [error] child process 21792 still
> did not
> > > > > > exit, sending a SIGKILL
> > > > > > [Mon Jan 02 13:04:19 2006] [info] removed PID file
> > > > > > /var/run/apache2.pid (pid=21754)
> > > > > > [Mon Jan 02 13:04:19 2006] [notice] caught SIGTERM, shutting
> down
> > > > > >
> > > > > > I've been tracking the problem, and on exit, the cleanup function
> (in
> > > > > > Python) that I registered with the mod_python server does get
> called,
> > > > > > and exits.
> > > > > >
> > > > > > I saw somewhere that recompiling expat, and then all the rest
> (in my
> > > > > > case, python-2.4.2, apache2, mod_python-3.1.4, psycopg2) might
> fix
> > > > the
> > > > > > issue (it would be due to a library incompatibility).  I did
> this,
> > > > and
> > > > > > the problem still occurs.  I also tried using mod_python-3.2.5b
> and
> > > > > > the problem persists.  This is all running on Linux (Gentoo).
> > > > > >
> > > > > > Any clue on how to debug this issue further?  I'm currently trying
> > > > to
> > > > > > run gdb on a non-daemon apache2, or trying to attach to a running
> > > > > > child.  The problem is a bit tricky to fix, since it does not
> happen
> > > > > > always -- but always happens at least after I run a lot of requests
> > > > > > e.g. my automated test suite for my web app.
> > > > > >
> > > > > > Any hints would be appreciated.
> > > > > >
> > > > > > cheers,
> > > > > >
> > > > >
> > > >
> > > > _______________________________________________
> > > > Mod_python mailing list
> > > > Mod_python at modpython.org
> > > > http://mailman.modpython.org/mailman/listinfo/mod_python
> > >
> >


More information about the Mod_python mailing list