[mod_python] Udate python modules without restarting Apache

Sun Oct 10 14:55:03 EDT 2004

Apologies in advance, this is a long and complicated email. ;-)

On 09/10/2004, at 6:42 PM, Graham Dumpleton wrote:

>
> On 09/10/2004, at 6:10 PM, Nicolas Lehuen wrote:
>
>> Hi Graham, what are the pros and cons of using the imp module versus 
>> an
>> execfile call or an exec call ?
>>
>> Using Apache 2.0.52 on win32, with a multithreaded MPM, I have 
>> noticed a few
>> problem with mod_python.publisher which reloads the same module many 
>> times
>> when many threads try to load it concurrently. It seems that either
>> mod_python.publisher or the imp module does not properly handle this
>> situation. The result is that I have many instances of the same 
>> module in
>> memory, so I got many connection pools instead of only one and so on. 
>> That's
>> why I decided to fix this and other problems (the famous 'only one 
>> index.py
>> in your application') by reimplementing my own variant of the 
>> publisher.
>
> You have asked me about the one thing that I can't yet comment on. 
> That is, how
> threading affects all this.

I have done the bit of research on the threading stuff I needed to do. 
The
main thing I wasn't sure on before was the relationship between 
interpreters
and threads in threaded MPM and whether or not multiple threads 
pertaining to
distinct requests could be operating in the same interpreter at the 
same time.

Now from what I have seen by way of past comments in mailing list 
archive is that
there can be multiple request threads running within one interpreter. 
It concerns
me that I couldn't fine a good bit of official documentation on web 
site which
explains this relationship, as I would have thought it would have been 
a quite
important bit of information. Is there some and I simply missed it?

Anyway, back to your observation about mod_python.publisher reloading 
the same
module many times, I believe I can explain that one. I'll try and put 
down my
thoughts on some other stuff as well, but perhaps not all in this email
as I'll try and focus more on problems/issues with imp and 
load_module() method.

Please, if anyone sees that I say something wrong here, correct it. 
This is
all mainly based on reading and theory and not actual first hand 
experience
of the particular problem being mentioned.

Quoting "imp" module documentation, the first important bit of 
information to
be aware of is:

   On platforms with threads, a thread executing an import holds an 
internal
   lock until the import is complete. This lock blocks other threads 
from doing
   an import until the original import completes, which in turn prevents 
other
   threads from seeing incomplete module objects constructed by the 
original
   thread while in the process of completing its import (and the 
imports, if
   any, triggered by that).

Thus, once you get inside an import, including using 
"imp.load_module()", you
are guaranteed that no other import will be able to be commenced by 
another
thread. This doesn't stop the same threading doing further imports 
which are
nested within the initial import though.

Now, going by past comments on the mailing list, it seems therefore 
people
believe that initialising globals in an imported module, as the module 
is being
imported is safe. I'm not sure that this is correct, and a great deal of
care would still need to be taken in initialising global data. 
Basically, this
is because "apache.import_module()" isn't strictly safe in a threaded 
environment,
plus there are other issues as well.

To illustrate the problems, consider the following handler code, which 
follows
what some have suggested on the mailing list is the approach to use to 
ensure
that your content handler is thread safe.

   lock = Lock()

   myglobal1 = None
   myglobal2 = None
   myglobal3 = None

   ... arbitrary code to initialise globals

   def handler(req):
       lock.acquire()

       ... use and manipulate my global data

       lock.unlock()

Now, what the import lock first ensures is that you cannot have two 
brand new
instances of a module being imported at the same time. It also ensures 
that
if the module already exists, that you cannot have two module reloads 
occurring
at the same time.

Since variable assignments and executable code at global scope only get 
executed
at the time of import, all the import lock is doing is ensuring that 
you can't
have that code being executed within two threads at the same time.

My understanding is that the import lock can not ensure that a separate 
thread
which is already executing code within the module can't access the 
global data
in the case where a module reload is occurring. I think this as I can't 
see how
this could be implemented in a practical way. It would somehow 
necessitate a
lock on every module which would be acquired when any code in that 
module is
executing, so as to prevent a reload of the module at the same time.

What this means is that if a HTTP request came in which caused 
handler() above
to be executed, and while that request was being handled, the 
modification time
of the file was first changed and then a new HTTP request came in 
before the
first had even finished, you have the potential for problems with the 
above code
as it is written.

Before I get to why, one needs to understand what happens when a module 
reload
occurs. Specifically, when a module reload occurs, the existing data 
within
the module is not thrown away. That is, where a brand new import 
occurs, one
starts out with an empty module, when a module reload occurs, you start 
out with
the existing module. Thus, when a module reload occurs and global 
assignments
are performed, it is overlaid on or replaces what was there before.

To illustrate this, create a content handler which is triggered using 
SetHandler
and which contains the following code. Note this ignores threading 
issues
for now.

   from mod_python import apache

   _temp = globals().keys()
   _temp.sort()

   apache.log_error("globals 
"+str(_temp),apache.APLOG_NOERRNO|apache.APLOG_ERR)

   if globals().has_key("_count"):
     _count = _count + 1
     apache.log_error("reload 
"+str(_count),apache.APLOG_NOERRNO|apache.APLOG_ERR)
   else:
     apache.log_error("import",apache.APLOG_NOERRNO|apache.APLOG_ERR)
     _count = 0

   _value = 0

   apache.log_error("value 
"+str(_value),apache.APLOG_NOERRNO|apache.APLOG_ERR)

   _value = _value + 1

   def handler(req):
     req.content_type = "text/plain"
     req.send_http_header()
     req.write("Hello World!")
     return apache.OK

The first time this is accessed you should see in the Apache log file 
the following.

[Sun Oct 10 11:36:14 2004] [notice] mod_python: (Re)importing module 
'mptest'
[Sun Oct 10 11:36:14 2004] [error] globals ['__builtins__', '__doc__', 
'__file__', '__name__', 'apache']
[Sun Oct 10 11:36:14 2004] [error] import
[Sun Oct 10 11:36:14 2004] [error] value 0

That is, code detected that it was a first time import and has logged 
"import".
Now touch the code file and access it again and when you hit the same 
child
Apache process, you should see the following.

[Sun Oct 10 11:36:22 2004] [notice] mod_python: (Re)importing module 
'mptest'
[Sun Oct 10 11:36:22 2004] [error] globals ['__builtins__', '__doc__', 
'__file__', '__mtime__', '__name__', '_count', '_temp', '_value', 
'apache', 'handler']
[Sun Oct 10 11:36:22 2004] [error] reload 1
[Sun Oct 10 11:36:22 2004] [error] value 0

Touch it once more, and you should see the following.

[Sun Oct 10 11:36:33 2004] [notice] mod_python: (Re)importing module 
'mptest'
[Sun Oct 10 11:36:33 2004] [error] globals ['__builtins__', '__doc__', 
'__file__', '__mtime__', '__name__', '_count', '_temp', '_value', 
'apache', 'handler']
[Sun Oct 10 11:36:33 2004] [error] reload 2
[Sun Oct 10 11:36:33 2004] [error] value 0

Note how that on the second access, there already exists in the set of 
globals,
the "_count" variable and the "handler" function. This shows how a 
reload will
overlay what is already there. The check against the presence of 
"_count" can be
used to determine that the global scope code which is being executed is 
in fact
because of a reload and not an initial import. We use that check here 
such that
on a reload, we only increment the value of the existing variable and 
not set
it again to 0.

If you don't have this check, as is the case with "_value", when a 
reload occurs,
the value gets reset to whatever its starting value was. Thus, even 
though "_value"
was incremented, it gets set back to 0 upon the next reload.

Now think about the consequences of this for the original example. 
First off, the
lock variable will get reassigned on a reload of the module. In my 
understanding
of Python, what this means is that if the handler() were executing in a 
separate
thread and it had acquired the lock, but not yet released it and the 
reload
occurred in a new thread, the global lock is going to get replaced with 
a new one.
When the handler executing in the original thread goes to release the 
lock, it will
call unlock on a different lock.

The next problem is that the intent of the lock is to protect access to 
the global
variables. The handler correctly acquires the lock in order to make use 
of the
global variables, but the code executed by the subsequent reload does 
not. Thus,
not only does it replace the lock and the global variables, the 
modification of
the global variables is not done inside of the context of a lock.

This means that the handler which is trying to use the global 
variables, may find
they have suddenly been changed when it was expecting to have exclusive 
access.

In the context of threading and module reloading, the code presented to 
me would
thus appear to be quite dangerous. If a module is written like this, 
one would
definitely want to head the advice of disabling automatic module 
reloading in
a production system if my understanding of reloading is correct.

What one could do though to make the code more robust, is explicitly 
check when
a reload is occurring, and not set or change any global variables. If 
it was
still necessary somehow to modify the global variables, the code 
executing at
module scope when the module is being reloaded, could itself acquire 
the lock so
that there is no conflict on access.

Thus, code would be something like the following.

   if not globals().has_key("lock"):
     # This is a brand new import and we are inside the scope of the
     # import lock, so this should all be safe to execute.

     lock = Lock()

     myglobal1 = None
     myglobal2 = None
     myglobal3 = None

     ... arbitrary code to initialise globals

   lock.acquire()

   ... extra code to fiddle the globals

   lock.unlock()

   def handler(req):
       lock.acquire()

       ... use and manipulate my global data

       lock.unlock()

Unfortunately where not done yet, as there are still a few more dangers 
lurking
here, and I haven't even started on the problems with 
apache.import_module() yet.

The next thing to be careful of is something like the following.

   def modify1(req):
     lock.acquire()
     ...
     lock.unlock()
     return value

   def modify2(req,value):
     lock.acquire()
     ...
     lock.unlock()

   def handler(req):
     value = modify1(req)
     modify2(req,value)
     ...

Maybe I am being a bit paranoid on this one, but what happens if the 
operation of
modify1() and modify2() is tightly linked such that modify2() expects 
modify1()
to return data in a certain way. Imagine now that a module reload 
occurred between
the call to modify1() and modify2() and the code for both functions got 
changed
and the protocol as to what format modify2() expected to see something 
as got
changed. When the handler now executes modify2() it is the newer 
function and
it finds data set up in a different way than expected because it was 
the old
modify1() that was previously called to get the value it is operating 
on.

The point I am trying to make is that perhaps locking shouldn't pertain 
just
to the global data and that executable code should in some way be 
protected as well.
It needs more exploration, but perhaps the module should read something 
like the
following (untested).

   if not globals().has_key("datalock"):
     # This is a brand new import and we are inside the scope of the
     # import lock, so this should all be safe to execute.

     datalock = Lock()
     codelock = Lock()

     myglobal1 = None
     myglobal2 = None
     myglobal3 = None

     ... arbitrary code to initialise globals

   codelock.acquire()
   datalock.acquire()

   ... extra code to fiddle the globals

   def modify1(req):
     datalock.acquire()
     ...
     datalock.unlock()
     return value

   def modify2(req,value):
     datalock.acquire()
     ...
     datalock.unlock()

   def _handler(req):
     value = modify1(req)
     modify2(req,value)
     ...

   def handler(req):
     # code for this method should never be changed
     try:
       codelock.acquire()
       result = _handler(req)
     finally:
       codelock.unlock()
     return result

   datalock.unlock()
   codelock.unlock()

What is trying to be done here is using a lock to ensure that any code 
itself is
not replaced while the handler is being executed. The reason for two 
level handler
arrangement, is that the lock actually needs to be acquired outside of 
the real
handler method being called. If it is done inside, it is too late, as 
Python
has already grabbed the existing code for handler and if it gets 
replaced prior to
the lock being acquired, everything may have changed.

Am I being too paranoid? I don't understand enough about how code gets 
replaced when
a module reload is occurring.

One final thing to cover before moving on to looking at 
apache.import_module().

Ignore now that we had done all this thread stuff to make things work a 
bit better
within a threaded environment, as the next bit pertains even to prefork 
system.

Remember how "_value" in our original example was replaced with 0 again 
when the
module was reloaded. Imagine now that that was a quite complicated 
Python class
instance or even a Python class instance which wrapped up C code for 
something like
a database connection pool.

With the way the code was, the existing instance of the class would get 
thrown
away when a reload occurs. If that class was never designed to be 
discarded which
may well be the case in a database pooling scheme, or if the class had 
a __del__
method and there were reference cycles that couldn't be broken by the 
garbage
collector. The original class instance would hang around. This may 
simply result
in ever growing memory use, or a continual growing in the number of 
socket
connections to a back end database, as older instances of the database 
pooling
interface never got cleaned up.

To see how this latter issue is relevant to the original problem you 
saw, we now
need to look at apache.import_module(). The problem with the module 
importing and
caching system in the mod_python.apache module is that it doesn't do 
any thread
locking at all. That is, many threads can at the same time be 
consulting the module
cache, checking modification times, using imp.find_module() and then 
calling the
imp.load_module() method.

Think about the case where a module hasn't previously been loaded and a 
huge number
of requests suddenly come in which require it be loaded. Because there 
is no thread
locking on access to the cache, many threads may at the same time 
decide that a
module needs to be loaded. All of these threads will continue through 
to call the
imp.load_module() method.

It has already been explained that there is locking on module 
importation, so it
isn't possible to import a brand new module from two places at the same 
time.
This doesn't stop however all those extra threads from still calling 
imp.find_module()
once the initial thread has already imported it.

Why is this bad, well to quote the Python documentation for the 
imp.load_module().

   This function does more than importing the module: if the module was 
already
   imported, it is equivalent to a reload()!

Thus, all those extra calls into imp.load_module() from the other 
threads that had
decided at the same time to load the module in, get turned into module 
reloads.

Link this with what was described about the case of a database pooling 
instance being
replaced on module reload and you might see what is happening. It isn't 
that multiple
instances of the module corresponding to the handler are loaded into 
memory at the
same time, but the one module is reloaded many times in quick 
succession. In each case
the database pooling instance would get thrown away, but if it wasn't 
implemented so as
to be able to clean up itself when destroyed, if it even is destroyed, 
it will just
hang around and accumulate memory and potentially hold open connections 
to the
database as well.

Okay, I think am done for now. This doesn't contrast imp vs execfile 
etc, but I'll
get to that in another mail.

Hope this has helped.

If anyone who understands this better than I detects falsehoods here, 
please please
correct them and explain what actually happens.

Can some please confirm that multiple threads corresponding to 
different HTTP request
can be operating with one interpreter at the same time? Certain bits of 
the above
analysis are assuming that is the case.

--
Graham Dumpleton (grahamd at dscpl.com.au)