[mod_python] Re: child process XXXX still did not exit, sending a SIGTERM

Martin Blais blais at furius.ca
Mon Jan 2 17:31:29 EST 2006


I think there is a deadlock somewhere, I start apache2 in debug mode
and eventually manage to interrupt in this stack frame (where it
remains stuck, cont'inuing does not help):


    #0  0xb7b9b054 in __pthread_sigsuspend () from /lib/libpthread.so.0
    #1  0xb7b9ae98 in __pthread_wait_for_restart_signal () from
/lib/libpthread.so.0
    #2  0xb7b9bd6b in sem_wait@@GLIBC_2.1 () from /lib/libpthread.so.0
    #3  0xb79d943f in PyThread_acquire_lock () from
/usr/lib/apache2/modules/mod_python.so
    #4  0xb79d262e in PyThreadState_Delete () from
/usr/lib/apache2/modules/mod_python.so
    #5  0xb79d21e8 in PyInterpreterState_Delete () from
/usr/lib/apache2/modules/mod_python.so
    #6  0xb79d3318 in Py_Finalize () from /usr/lib/apache2/modules/mod_python.so
    #7  0xb794eeea in python_finalize (data=0x0) at mod_python.c:1823
    #8  0xb7c7a6ed in apr_pool_cleanup_run () from /usr/lib/libapr-0.so.0
    #9  0xb7c79f8d in apr_pool_destroy () from /usr/lib/libapr-0.so.0
    #10 0x0806860c in ap_graceful_stop_signalled ()
    #11 0xb7b9def5 in __pthread_sighandler () from /lib/libpthread.so.0
    #12 <signal handler called>
    #13 0xb7b240b8 in poll () from /lib/libc.so.6
    #14 0xb7c7b2af in apr_poll () from /usr/lib/libapr-0.so.0
    #15 0x08068952 in ap_graceful_stop_signalled ()
    #16 0x08067f0d in ap_graceful_stop_signalled ()
    #17 0x08067954 in ap_mpm_run ()
    #18 0x0806f02f in main ()


So I've added traces to python_finalize():


static apr_status_t python_finalize(void *data)
{
    interpreterdata *idata;
    pid_t pid = getpid();

    idata = get_interpreter(NULL, NULL);

    fprintf(stderr, "python_finalize %d\n", pid);
    fflush(stderr);

    if (idata) {

      fprintf(stderr, "  python_finalize %d - before \n", pid);
      fflush(stderr);

      Py_Finalize();

      fprintf(stderr, "  python_finalize %d - after\n", pid);
      fflush(stderr);

#ifdef WITH_THREAD
        PyEval_ReleaseLock();
#endif

      fprintf(stderr, "  python_finalize %d - after\n", pid);
      fflush(stderr);
    }


    return APR_SUCCESS;

}



And I get this log, which confirms that indeed the child never leaves
Py_Finalize().



==> /var/log/apache2/error_log <==
[Mon Jan 02 17:29:17 2006] [error] python_cleanup 29677 - interp
[Mon Jan 02 17:29:17 2006] [error] python_cleanup 29678 - interp
[Mon Jan 02 17:29:17 2006] [error] python_cleanup 29677 - before
python_finalize 29680
  python_finalize 29680 - before
  python_finalize 29680 - after
  python_finalize 29680 - after
python_finalize 29681
  python_finalize 29681 - before
  python_finalize 29681 - after
  python_finalize 29681 - after
python_finalize 29686
  python_finalize 29686 - before
  python_finalize 29686 - after
  python_finalize 29686 - after
[Mon Jan 02 17:29:17 2006] [error] python_cleanup 29678 - before
python_finalize 29679
  python_finalize 29679 - before
  python_finalize 29679 - after
  python_finalize 29679 - after
python_finalize 29688
  python_finalize 29688 - before
  python_finalize 29688 - after
  python_finalize 29688 - after
dracohandler.py: child 29677 finishing childExit
[Mon Jan 02 17:29:18 2006] [error] python_cleanup 29677 - after
[Mon Jan 02 17:29:18 2006] [error] python_cleanup 29677 - released
python_finalize 29677
  python_finalize 29677 - before
dracohandler.py: child 29678 finishing childExit
[Mon Jan 02 17:29:18 2006] [error] python_cleanup 29678 - after
[Mon Jan 02 17:29:18 2006] [error] python_cleanup 29678 - released
python_finalize 29678
  python_finalize 29678 - before
  python_finalize 29677 - after
  python_finalize 29677 - after
python_finalize 29687
  python_finalize 29687 - before
  python_finalize 29687 - after
  python_finalize 29687 - after
[Mon Jan 02 17:29:21 2006] [warn] child process 29678 still did not
exit, sending a SIGTERM
[Mon Jan 02 17:29:23 2006] [warn] child process 29678 still did not
exit, sending a SIGTERM
[Mon Jan 02 17:29:25 2006] [warn] child process 29678 still did not
exit, sending a SIGTERM
[Mon Jan 02 17:29:27 2006] [error] child process 29678 still did not
exit, sending a SIGKILL
[Mon Jan 02 17:29:28 2006] [info] removed PID file
/var/run/apache2.pid (pid=29673)
[Mon Jan 02 17:29:28 2006] [notice] caught SIGTERM, shutting down







On 1/2/06, Martin Blais <blais at furius.ca> wrote:
> Thanks Graham.
>
> Looks like a new directive only available in 2.2...   will update now.
>
> More info:  I'm starting to hack into mod_python to find out what's
> happening there, and python_finalize() is not called for the offending
> processes.
>
>
>
>
>
> On 1/2/06, Graham Dumpleton <grahamd at dscpl.com.au> wrote:
> > Instead of using "apachectl restart", try using "apachectl graceful" to see
> > if that makes any difference.
> >
> > You may also want to adjust GracefulShutdownTimeout for this case to
> > only allow a certain time for child processes to shutdown.
> >
> >   http://httpd.apache.org/docs/2.2/mod/mpm_common.html#gracefulshutdowntimeout
> >
> > By fiddling with this, you may be able to see if given time that the child
> > processes do actually shutdown, or whether they actually hang. If they
> > hang, then you may be able to attach using gdb to the child process and
> > work out where it is hanging.
> >
> > It may just be a case that with Python loaded it takes longer to shutdown
> > than Apache expects it to and so it abruptly kills it. I haven't been able
> > to find any directive for specifying a shutdown timeout for "restart" which
> > it waits before it kills the child processes.
> >
> > Graham
> >
> > Martin Blais wrote ..
> > > On 1/2/06, Martin Blais <blais at furius.ca> wrote:
> > > > Before someone asks, I did not get seg faults in my log, so that's not
> > > > the problem.
> > > >
> > > > On closer inspection though, I do have a few SSL error messages like
> > > these:
> > > >
> > > > [Mon Jan 02 15:07:30 2006] [info] (70014)End of file found: SSL
> > > > handshake interrupted by system [Hint: Stop button pressed in
> > > > browser?!]
> > > > [Mon Jan 02 15:07:30 2006] [info] Connection to child 0 closed with
> > > > abortive shutdown(server furius.dyndns.biz:443, client 127.0.0.1)
> > > > [Mon Jan 02 15:07:31 2006] [info] Connection to child 9 established
> > > > (server furius.dyndns.biz:443, client 127.0.0.1)
> > > > [Mon Jan 02 15:07:31 2006] [info] Seeding PRNG with 136 bytes of entropy
> > > >
> > > > ... the number of which does not match the number of zombie children
> > > > (does not mean that it's not the problem though).
> > > >
> > > > Any information/tips welcome.
> > >
> > > It's not the SSL errors, I just reproduced the zombie child error
> > > without getting the SSL errors (it's my automated test code that
> > > somehow creates the SSL problems, fiddling a lot with the browser does
> > > not cause SSL errors, but will result in children not wanting to die).
> > >
> > >
> > > >
> > > >
> > > > On 1/2/06, Martin Blais <blais at furius.ca> wrote:
> > > > > Hi
> > > > >
> > > > > I'm getting messages like this in my apache2 log when I stop it after
> > > > > running requests to a mod_python-based web app:
> > > > >
> > > > > ==> /var/log/apache2/error_log <==
> > > > > [Mon Jan 02 13:04:12 2006] [warn] child process 21806 still did not
> > > > > exit, sending a SIGTERM
> > > > > [Mon Jan 02 13:04:12 2006] [warn] child process 21768 still did not
> > > > > exit, sending a SIGTERM
> > > > > [Mon Jan 02 13:04:12 2006] [warn] child process 21792 still did not
> > > > > exit, sending a SIGTERM
> > > > > [Mon Jan 02 13:04:14 2006] [warn] child process 21806 still did not
> > > > > exit, sending a SIGTERM
> > > > > [Mon Jan 02 13:04:14 2006] [warn] child process 21768 still did not
> > > > > exit, sending a SIGTERM
> > > > > [Mon Jan 02 13:04:14 2006] [warn] child process 21792 still did not
> > > > > exit, sending a SIGTERM
> > > > > [Mon Jan 02 13:04:16 2006] [warn] child process 21806 still did not
> > > > > exit, sending a SIGTERM
> > > > > [Mon Jan 02 13:04:16 2006] [warn] child process 21768 still did not
> > > > > exit, sending a SIGTERM
> > > > > [Mon Jan 02 13:04:16 2006] [warn] child process 21792 still did not
> > > > > exit, sending a SIGTERM
> > > > > [Mon Jan 02 13:04:18 2006] [error] child process 21806 still did not
> > > > > exit, sending a SIGKILL
> > > > > [Mon Jan 02 13:04:18 2006] [error] child process 21768 still did not
> > > > > exit, sending a SIGKILL
> > > > > [Mon Jan 02 13:04:18 2006] [error] child process 21792 still did not
> > > > > exit, sending a SIGKILL
> > > > > [Mon Jan 02 13:04:19 2006] [info] removed PID file
> > > > > /var/run/apache2.pid (pid=21754)
> > > > > [Mon Jan 02 13:04:19 2006] [notice] caught SIGTERM, shutting down
> > > > >
> > > > > I've been tracking the problem, and on exit, the cleanup function (in
> > > > > Python) that I registered with the mod_python server does get called,
> > > > > and exits.
> > > > >
> > > > > I saw somewhere that recompiling expat, and then all the rest (in my
> > > > > case, python-2.4.2, apache2, mod_python-3.1.4, psycopg2) might fix
> > > the
> > > > > issue (it would be due to a library incompatibility).  I did this,
> > > and
> > > > > the problem still occurs.  I also tried using mod_python-3.2.5b and
> > > > > the problem persists.  This is all running on Linux (Gentoo).
> > > > >
> > > > > Any clue on how to debug this issue further?  I'm currently trying
> > > to
> > > > > run gdb on a non-daemon apache2, or trying to attach to a running
> > > > > child.  The problem is a bit tricky to fix, since it does not happen
> > > > > always -- but always happens at least after I run a lot of requests
> > > > > e.g. my automated test suite for my web app.
> > > > >
> > > > > Any hints would be appreciated.
> > > > >
> > > > > cheers,
> > > > >
> > > >
> > >
> > > _______________________________________________
> > > Mod_python mailing list
> > > Mod_python at modpython.org
> > > http://mailman.modpython.org/mailman/listinfo/mod_python
> >
>



More information about the Mod_python mailing list