Graham Dumpleton
grahamd at dscpl.com.au
Sun Oct 10 14:55:03 EDT 2004
Apologies in advance, this is a long and complicated email. ;-) On 09/10/2004, at 6:42 PM, Graham Dumpleton wrote: > > On 09/10/2004, at 6:10 PM, Nicolas Lehuen wrote: > >> Hi Graham, what are the pros and cons of using the imp module versus >> an >> execfile call or an exec call ? >> >> Using Apache 2.0.52 on win32, with a multithreaded MPM, I have >> noticed a few >> problem with mod_python.publisher which reloads the same module many >> times >> when many threads try to load it concurrently. It seems that either >> mod_python.publisher or the imp module does not properly handle this >> situation. The result is that I have many instances of the same >> module in >> memory, so I got many connection pools instead of only one and so on. >> That's >> why I decided to fix this and other problems (the famous 'only one >> index.py >> in your application') by reimplementing my own variant of the >> publisher. > > You have asked me about the one thing that I can't yet comment on. > That is, how > threading affects all this. I have done the bit of research on the threading stuff I needed to do. The main thing I wasn't sure on before was the relationship between interpreters and threads in threaded MPM and whether or not multiple threads pertaining to distinct requests could be operating in the same interpreter at the same time. Now from what I have seen by way of past comments in mailing list archive is that there can be multiple request threads running within one interpreter. It concerns me that I couldn't fine a good bit of official documentation on web site which explains this relationship, as I would have thought it would have been a quite important bit of information. Is there some and I simply missed it? Anyway, back to your observation about mod_python.publisher reloading the same module many times, I believe I can explain that one. I'll try and put down my thoughts on some other stuff as well, but perhaps not all in this email as I'll try and focus more on problems/issues with imp and load_module() method. Please, if anyone sees that I say something wrong here, correct it. This is all mainly based on reading and theory and not actual first hand experience of the particular problem being mentioned. Quoting "imp" module documentation, the first important bit of information to be aware of is: On platforms with threads, a thread executing an import holds an internal lock until the import is complete. This lock blocks other threads from doing an import until the original import completes, which in turn prevents other threads from seeing incomplete module objects constructed by the original thread while in the process of completing its import (and the imports, if any, triggered by that). Thus, once you get inside an import, including using "imp.load_module()", you are guaranteed that no other import will be able to be commenced by another thread. This doesn't stop the same threading doing further imports which are nested within the initial import though. Now, going by past comments on the mailing list, it seems therefore people believe that initialising globals in an imported module, as the module is being imported is safe. I'm not sure that this is correct, and a great deal of care would still need to be taken in initialising global data. Basically, this is because "apache.import_module()" isn't strictly safe in a threaded environment, plus there are other issues as well. To illustrate the problems, consider the following handler code, which follows what some have suggested on the mailing list is the approach to use to ensure that your content handler is thread safe. lock = Lock() myglobal1 = None myglobal2 = None myglobal3 = None ... arbitrary code to initialise globals def handler(req): lock.acquire() ... use and manipulate my global data lock.unlock() Now, what the import lock first ensures is that you cannot have two brand new instances of a module being imported at the same time. It also ensures that if the module already exists, that you cannot have two module reloads occurring at the same time. Since variable assignments and executable code at global scope only get executed at the time of import, all the import lock is doing is ensuring that you can't have that code being executed within two threads at the same time. My understanding is that the import lock can not ensure that a separate thread which is already executing code within the module can't access the global data in the case where a module reload is occurring. I think this as I can't see how this could be implemented in a practical way. It would somehow necessitate a lock on every module which would be acquired when any code in that module is executing, so as to prevent a reload of the module at the same time. What this means is that if a HTTP request came in which caused handler() above to be executed, and while that request was being handled, the modification time of the file was first changed and then a new HTTP request came in before the first had even finished, you have the potential for problems with the above code as it is written. Before I get to why, one needs to understand what happens when a module reload occurs. Specifically, when a module reload occurs, the existing data within the module is not thrown away. That is, where a brand new import occurs, one starts out with an empty module, when a module reload occurs, you start out with the existing module. Thus, when a module reload occurs and global assignments are performed, it is overlaid on or replaces what was there before. To illustrate this, create a content handler which is triggered using SetHandler and which contains the following code. Note this ignores threading issues for now. from mod_python import apache _temp = globals().keys() _temp.sort() apache.log_error("globals "+str(_temp),apache.APLOG_NOERRNO|apache.APLOG_ERR) if globals().has_key("_count"): _count = _count + 1 apache.log_error("reload "+str(_count),apache.APLOG_NOERRNO|apache.APLOG_ERR) else: apache.log_error("import",apache.APLOG_NOERRNO|apache.APLOG_ERR) _count = 0 _value = 0 apache.log_error("value "+str(_value),apache.APLOG_NOERRNO|apache.APLOG_ERR) _value = _value + 1 def handler(req): req.content_type = "text/plain" req.send_http_header() req.write("Hello World!") return apache.OK The first time this is accessed you should see in the Apache log file the following. [Sun Oct 10 11:36:14 2004] [notice] mod_python: (Re)importing module 'mptest' [Sun Oct 10 11:36:14 2004] [error] globals ['__builtins__', '__doc__', '__file__', '__name__', 'apache'] [Sun Oct 10 11:36:14 2004] [error] import [Sun Oct 10 11:36:14 2004] [error] value 0 That is, code detected that it was a first time import and has logged "import". Now touch the code file and access it again and when you hit the same child Apache process, you should see the following. [Sun Oct 10 11:36:22 2004] [notice] mod_python: (Re)importing module 'mptest' [Sun Oct 10 11:36:22 2004] [error] globals ['__builtins__', '__doc__', '__file__', '__mtime__', '__name__', '_count', '_temp', '_value', 'apache', 'handler'] [Sun Oct 10 11:36:22 2004] [error] reload 1 [Sun Oct 10 11:36:22 2004] [error] value 0 Touch it once more, and you should see the following. [Sun Oct 10 11:36:33 2004] [notice] mod_python: (Re)importing module 'mptest' [Sun Oct 10 11:36:33 2004] [error] globals ['__builtins__', '__doc__', '__file__', '__mtime__', '__name__', '_count', '_temp', '_value', 'apache', 'handler'] [Sun Oct 10 11:36:33 2004] [error] reload 2 [Sun Oct 10 11:36:33 2004] [error] value 0 Note how that on the second access, there already exists in the set of globals, the "_count" variable and the "handler" function. This shows how a reload will overlay what is already there. The check against the presence of "_count" can be used to determine that the global scope code which is being executed is in fact because of a reload and not an initial import. We use that check here such that on a reload, we only increment the value of the existing variable and not set it again to 0. If you don't have this check, as is the case with "_value", when a reload occurs, the value gets reset to whatever its starting value was. Thus, even though "_value" was incremented, it gets set back to 0 upon the next reload. Now think about the consequences of this for the original example. First off, the lock variable will get reassigned on a reload of the module. In my understanding of Python, what this means is that if the handler() were executing in a separate thread and it had acquired the lock, but not yet released it and the reload occurred in a new thread, the global lock is going to get replaced with a new one. When the handler executing in the original thread goes to release the lock, it will call unlock on a different lock. The next problem is that the intent of the lock is to protect access to the global variables. The handler correctly acquires the lock in order to make use of the global variables, but the code executed by the subsequent reload does not. Thus, not only does it replace the lock and the global variables, the modification of the global variables is not done inside of the context of a lock. This means that the handler which is trying to use the global variables, may find they have suddenly been changed when it was expecting to have exclusive access. In the context of threading and module reloading, the code presented to me would thus appear to be quite dangerous. If a module is written like this, one would definitely want to head the advice of disabling automatic module reloading in a production system if my understanding of reloading is correct. What one could do though to make the code more robust, is explicitly check when a reload is occurring, and not set or change any global variables. If it was still necessary somehow to modify the global variables, the code executing at module scope when the module is being reloaded, could itself acquire the lock so that there is no conflict on access. Thus, code would be something like the following. if not globals().has_key("lock"): # This is a brand new import and we are inside the scope of the # import lock, so this should all be safe to execute. lock = Lock() myglobal1 = None myglobal2 = None myglobal3 = None ... arbitrary code to initialise globals lock.acquire() ... extra code to fiddle the globals lock.unlock() def handler(req): lock.acquire() ... use and manipulate my global data lock.unlock() Unfortunately where not done yet, as there are still a few more dangers lurking here, and I haven't even started on the problems with apache.import_module() yet. The next thing to be careful of is something like the following. def modify1(req): lock.acquire() ... lock.unlock() return value def modify2(req,value): lock.acquire() ... lock.unlock() def handler(req): value = modify1(req) modify2(req,value) ... Maybe I am being a bit paranoid on this one, but what happens if the operation of modify1() and modify2() is tightly linked such that modify2() expects modify1() to return data in a certain way. Imagine now that a module reload occurred between the call to modify1() and modify2() and the code for both functions got changed and the protocol as to what format modify2() expected to see something as got changed. When the handler now executes modify2() it is the newer function and it finds data set up in a different way than expected because it was the old modify1() that was previously called to get the value it is operating on. The point I am trying to make is that perhaps locking shouldn't pertain just to the global data and that executable code should in some way be protected as well. It needs more exploration, but perhaps the module should read something like the following (untested). if not globals().has_key("datalock"): # This is a brand new import and we are inside the scope of the # import lock, so this should all be safe to execute. datalock = Lock() codelock = Lock() myglobal1 = None myglobal2 = None myglobal3 = None ... arbitrary code to initialise globals codelock.acquire() datalock.acquire() ... extra code to fiddle the globals def modify1(req): datalock.acquire() ... datalock.unlock() return value def modify2(req,value): datalock.acquire() ... datalock.unlock() def _handler(req): value = modify1(req) modify2(req,value) ... def handler(req): # code for this method should never be changed try: codelock.acquire() result = _handler(req) finally: codelock.unlock() return result datalock.unlock() codelock.unlock() What is trying to be done here is using a lock to ensure that any code itself is not replaced while the handler is being executed. The reason for two level handler arrangement, is that the lock actually needs to be acquired outside of the real handler method being called. If it is done inside, it is too late, as Python has already grabbed the existing code for handler and if it gets replaced prior to the lock being acquired, everything may have changed. Am I being too paranoid? I don't understand enough about how code gets replaced when a module reload is occurring. One final thing to cover before moving on to looking at apache.import_module(). Ignore now that we had done all this thread stuff to make things work a bit better within a threaded environment, as the next bit pertains even to prefork system. Remember how "_value" in our original example was replaced with 0 again when the module was reloaded. Imagine now that that was a quite complicated Python class instance or even a Python class instance which wrapped up C code for something like a database connection pool. With the way the code was, the existing instance of the class would get thrown away when a reload occurs. If that class was never designed to be discarded which may well be the case in a database pooling scheme, or if the class had a __del__ method and there were reference cycles that couldn't be broken by the garbage collector. The original class instance would hang around. This may simply result in ever growing memory use, or a continual growing in the number of socket connections to a back end database, as older instances of the database pooling interface never got cleaned up. To see how this latter issue is relevant to the original problem you saw, we now need to look at apache.import_module(). The problem with the module importing and caching system in the mod_python.apache module is that it doesn't do any thread locking at all. That is, many threads can at the same time be consulting the module cache, checking modification times, using imp.find_module() and then calling the imp.load_module() method. Think about the case where a module hasn't previously been loaded and a huge number of requests suddenly come in which require it be loaded. Because there is no thread locking on access to the cache, many threads may at the same time decide that a module needs to be loaded. All of these threads will continue through to call the imp.load_module() method. It has already been explained that there is locking on module importation, so it isn't possible to import a brand new module from two places at the same time. This doesn't stop however all those extra threads from still calling imp.find_module() once the initial thread has already imported it. Why is this bad, well to quote the Python documentation for the imp.load_module(). This function does more than importing the module: if the module was already imported, it is equivalent to a reload()! Thus, all those extra calls into imp.load_module() from the other threads that had decided at the same time to load the module in, get turned into module reloads. Link this with what was described about the case of a database pooling instance being replaced on module reload and you might see what is happening. It isn't that multiple instances of the module corresponding to the handler are loaded into memory at the same time, but the one module is reloaded many times in quick succession. In each case the database pooling instance would get thrown away, but if it wasn't implemented so as to be able to clean up itself when destroyed, if it even is destroyed, it will just hang around and accumulate memory and potentially hold open connections to the database as well. Okay, I think am done for now. This doesn't contrast imp vs execfile etc, but I'll get to that in another mail. Hope this has helped. If anyone who understands this better than I detects falsehoods here, please please correct them and explain what actually happens. Can some please confirm that multiple threads corresponding to different HTTP request can be operating with one interpreter at the same time? Certain bits of the above analysis are assuming that is the case. -- Graham Dumpleton (grahamd at dscpl.com.au)
|