Martin Blais
blais at furius.ca
Mon Jan 2 17:31:29 EST 2006
I think there is a deadlock somewhere, I start apache2 in debug mode and eventually manage to interrupt in this stack frame (where it remains stuck, cont'inuing does not help): #0 0xb7b9b054 in __pthread_sigsuspend () from /lib/libpthread.so.0 #1 0xb7b9ae98 in __pthread_wait_for_restart_signal () from /lib/libpthread.so.0 #2 0xb7b9bd6b in sem_wait@@GLIBC_2.1 () from /lib/libpthread.so.0 #3 0xb79d943f in PyThread_acquire_lock () from /usr/lib/apache2/modules/mod_python.so #4 0xb79d262e in PyThreadState_Delete () from /usr/lib/apache2/modules/mod_python.so #5 0xb79d21e8 in PyInterpreterState_Delete () from /usr/lib/apache2/modules/mod_python.so #6 0xb79d3318 in Py_Finalize () from /usr/lib/apache2/modules/mod_python.so #7 0xb794eeea in python_finalize (data=0x0) at mod_python.c:1823 #8 0xb7c7a6ed in apr_pool_cleanup_run () from /usr/lib/libapr-0.so.0 #9 0xb7c79f8d in apr_pool_destroy () from /usr/lib/libapr-0.so.0 #10 0x0806860c in ap_graceful_stop_signalled () #11 0xb7b9def5 in __pthread_sighandler () from /lib/libpthread.so.0 #12 <signal handler called> #13 0xb7b240b8 in poll () from /lib/libc.so.6 #14 0xb7c7b2af in apr_poll () from /usr/lib/libapr-0.so.0 #15 0x08068952 in ap_graceful_stop_signalled () #16 0x08067f0d in ap_graceful_stop_signalled () #17 0x08067954 in ap_mpm_run () #18 0x0806f02f in main () So I've added traces to python_finalize(): static apr_status_t python_finalize(void *data) { interpreterdata *idata; pid_t pid = getpid(); idata = get_interpreter(NULL, NULL); fprintf(stderr, "python_finalize %d\n", pid); fflush(stderr); if (idata) { fprintf(stderr, " python_finalize %d - before \n", pid); fflush(stderr); Py_Finalize(); fprintf(stderr, " python_finalize %d - after\n", pid); fflush(stderr); #ifdef WITH_THREAD PyEval_ReleaseLock(); #endif fprintf(stderr, " python_finalize %d - after\n", pid); fflush(stderr); } return APR_SUCCESS; } And I get this log, which confirms that indeed the child never leaves Py_Finalize(). ==> /var/log/apache2/error_log <== [Mon Jan 02 17:29:17 2006] [error] python_cleanup 29677 - interp [Mon Jan 02 17:29:17 2006] [error] python_cleanup 29678 - interp [Mon Jan 02 17:29:17 2006] [error] python_cleanup 29677 - before python_finalize 29680 python_finalize 29680 - before python_finalize 29680 - after python_finalize 29680 - after python_finalize 29681 python_finalize 29681 - before python_finalize 29681 - after python_finalize 29681 - after python_finalize 29686 python_finalize 29686 - before python_finalize 29686 - after python_finalize 29686 - after [Mon Jan 02 17:29:17 2006] [error] python_cleanup 29678 - before python_finalize 29679 python_finalize 29679 - before python_finalize 29679 - after python_finalize 29679 - after python_finalize 29688 python_finalize 29688 - before python_finalize 29688 - after python_finalize 29688 - after dracohandler.py: child 29677 finishing childExit [Mon Jan 02 17:29:18 2006] [error] python_cleanup 29677 - after [Mon Jan 02 17:29:18 2006] [error] python_cleanup 29677 - released python_finalize 29677 python_finalize 29677 - before dracohandler.py: child 29678 finishing childExit [Mon Jan 02 17:29:18 2006] [error] python_cleanup 29678 - after [Mon Jan 02 17:29:18 2006] [error] python_cleanup 29678 - released python_finalize 29678 python_finalize 29678 - before python_finalize 29677 - after python_finalize 29677 - after python_finalize 29687 python_finalize 29687 - before python_finalize 29687 - after python_finalize 29687 - after [Mon Jan 02 17:29:21 2006] [warn] child process 29678 still did not exit, sending a SIGTERM [Mon Jan 02 17:29:23 2006] [warn] child process 29678 still did not exit, sending a SIGTERM [Mon Jan 02 17:29:25 2006] [warn] child process 29678 still did not exit, sending a SIGTERM [Mon Jan 02 17:29:27 2006] [error] child process 29678 still did not exit, sending a SIGKILL [Mon Jan 02 17:29:28 2006] [info] removed PID file /var/run/apache2.pid (pid=29673) [Mon Jan 02 17:29:28 2006] [notice] caught SIGTERM, shutting down On 1/2/06, Martin Blais <blais at furius.ca> wrote: > Thanks Graham. > > Looks like a new directive only available in 2.2... will update now. > > More info: I'm starting to hack into mod_python to find out what's > happening there, and python_finalize() is not called for the offending > processes. > > > > > > On 1/2/06, Graham Dumpleton <grahamd at dscpl.com.au> wrote: > > Instead of using "apachectl restart", try using "apachectl graceful" to see > > if that makes any difference. > > > > You may also want to adjust GracefulShutdownTimeout for this case to > > only allow a certain time for child processes to shutdown. > > > > http://httpd.apache.org/docs/2.2/mod/mpm_common.html#gracefulshutdowntimeout > > > > By fiddling with this, you may be able to see if given time that the child > > processes do actually shutdown, or whether they actually hang. If they > > hang, then you may be able to attach using gdb to the child process and > > work out where it is hanging. > > > > It may just be a case that with Python loaded it takes longer to shutdown > > than Apache expects it to and so it abruptly kills it. I haven't been able > > to find any directive for specifying a shutdown timeout for "restart" which > > it waits before it kills the child processes. > > > > Graham > > > > Martin Blais wrote .. > > > On 1/2/06, Martin Blais <blais at furius.ca> wrote: > > > > Before someone asks, I did not get seg faults in my log, so that's not > > > > the problem. > > > > > > > > On closer inspection though, I do have a few SSL error messages like > > > these: > > > > > > > > [Mon Jan 02 15:07:30 2006] [info] (70014)End of file found: SSL > > > > handshake interrupted by system [Hint: Stop button pressed in > > > > browser?!] > > > > [Mon Jan 02 15:07:30 2006] [info] Connection to child 0 closed with > > > > abortive shutdown(server furius.dyndns.biz:443, client 127.0.0.1) > > > > [Mon Jan 02 15:07:31 2006] [info] Connection to child 9 established > > > > (server furius.dyndns.biz:443, client 127.0.0.1) > > > > [Mon Jan 02 15:07:31 2006] [info] Seeding PRNG with 136 bytes of entropy > > > > > > > > ... the number of which does not match the number of zombie children > > > > (does not mean that it's not the problem though). > > > > > > > > Any information/tips welcome. > > > > > > It's not the SSL errors, I just reproduced the zombie child error > > > without getting the SSL errors (it's my automated test code that > > > somehow creates the SSL problems, fiddling a lot with the browser does > > > not cause SSL errors, but will result in children not wanting to die). > > > > > > > > > > > > > > > > > > On 1/2/06, Martin Blais <blais at furius.ca> wrote: > > > > > Hi > > > > > > > > > > I'm getting messages like this in my apache2 log when I stop it after > > > > > running requests to a mod_python-based web app: > > > > > > > > > > ==> /var/log/apache2/error_log <== > > > > > [Mon Jan 02 13:04:12 2006] [warn] child process 21806 still did not > > > > > exit, sending a SIGTERM > > > > > [Mon Jan 02 13:04:12 2006] [warn] child process 21768 still did not > > > > > exit, sending a SIGTERM > > > > > [Mon Jan 02 13:04:12 2006] [warn] child process 21792 still did not > > > > > exit, sending a SIGTERM > > > > > [Mon Jan 02 13:04:14 2006] [warn] child process 21806 still did not > > > > > exit, sending a SIGTERM > > > > > [Mon Jan 02 13:04:14 2006] [warn] child process 21768 still did not > > > > > exit, sending a SIGTERM > > > > > [Mon Jan 02 13:04:14 2006] [warn] child process 21792 still did not > > > > > exit, sending a SIGTERM > > > > > [Mon Jan 02 13:04:16 2006] [warn] child process 21806 still did not > > > > > exit, sending a SIGTERM > > > > > [Mon Jan 02 13:04:16 2006] [warn] child process 21768 still did not > > > > > exit, sending a SIGTERM > > > > > [Mon Jan 02 13:04:16 2006] [warn] child process 21792 still did not > > > > > exit, sending a SIGTERM > > > > > [Mon Jan 02 13:04:18 2006] [error] child process 21806 still did not > > > > > exit, sending a SIGKILL > > > > > [Mon Jan 02 13:04:18 2006] [error] child process 21768 still did not > > > > > exit, sending a SIGKILL > > > > > [Mon Jan 02 13:04:18 2006] [error] child process 21792 still did not > > > > > exit, sending a SIGKILL > > > > > [Mon Jan 02 13:04:19 2006] [info] removed PID file > > > > > /var/run/apache2.pid (pid=21754) > > > > > [Mon Jan 02 13:04:19 2006] [notice] caught SIGTERM, shutting down > > > > > > > > > > I've been tracking the problem, and on exit, the cleanup function (in > > > > > Python) that I registered with the mod_python server does get called, > > > > > and exits. > > > > > > > > > > I saw somewhere that recompiling expat, and then all the rest (in my > > > > > case, python-2.4.2, apache2, mod_python-3.1.4, psycopg2) might fix > > > the > > > > > issue (it would be due to a library incompatibility). I did this, > > > and > > > > > the problem still occurs. I also tried using mod_python-3.2.5b and > > > > > the problem persists. This is all running on Linux (Gentoo). > > > > > > > > > > Any clue on how to debug this issue further? I'm currently trying > > > to > > > > > run gdb on a non-daemon apache2, or trying to attach to a running > > > > > child. The problem is a bit tricky to fix, since it does not happen > > > > > always -- but always happens at least after I run a lot of requests > > > > > e.g. my automated test suite for my web app. > > > > > > > > > > Any hints would be appreciated. > > > > > > > > > > cheers, > > > > > > > > > > > > > > > _______________________________________________ > > > Mod_python mailing list > > > Mod_python at modpython.org > > > http://mailman.modpython.org/mailman/listinfo/mod_python > > >
|