Graham Dumpleton
grahamd at dscpl.com.au
Sat Jul 9 02:40:40 EDT 2005
Note that this has been pushed to the mod_python developers list. It is suggested that if you are interested in following this discussion that you subscribe to that list instead. Anyone replying to this email please remove the standard mod_python list from the cc to keep it just on the developers list. Sorry, it is long email. I was probably deluded in thinking I could defer discussions on implementation, but then I guess it isn't strictly as it is still more at the level of how it could work. You might want to digest the email for some time before replying. :-) On 09/07/2005, at 12:49 AM, Jorey Bump wrote: > dharana wrote: > >> Oh, ok. I feel better now. But I think I should comment that then, >> for me, the new mechanism won't make a difference in the "apachectl >> restart" routine. I will continue to do restart apache if the new >> mechanism doesn't supports packages. I've abstracted a lot of things >> in my custom handler/webapp so the only thing I'm changing nowadays >> are inside packages (99% of the time). > > Same here. My published modules typically have only a few functions in > them, and they are usually interfaces to code from package imports. > The reload mechanism would have to support packages to avoid being > another pandora's box. I'll try and expand on some of the issues about Python packages and why it becomes so complicated and provide direction on how it all can work. I know it can work as I already have an implementation similar to what is described. :-) The import_module() currently allows you to import the top level of a Python package, or a sub module/package within a Python package. Its current support for importing a sub module/package at the moment is broken in various ways, as I documented in my article. You probably would not notice these issues if your use of Python packages is simplistic, ie., merely to have a bunch of modules kept within a specific namespace for convenience. For example, an empty __init__.py in the top level directory and a series of file based modules contained within that directory. Also, the fact that modules are currently reloaded on top of the existing instance does lesson the possibility of problems. Once a Python package starts to get more complicated and you have internal dependencies between different parts of the package itself, you will more easily start to encounter problems. This is because dependency cycles can develop between the different parts of the Python package. The consequence of this is that even if only one part of the Python package changes, the only sane thing to do is to reload the complete package. As such, at the moment the only reliable way to reload a Python package properly is to restart Apache. If an updated import_module() function attempted to try and support automatic reloading of packages properly, it would have to take a similar approach whereby where if even a single part of the Python package is changed, it would have to ensure that the whole Python package was reloaded. I haven't quite worked out whether this is even totally possible yet. Even to get to the point where I have has exceedingly complicated the design of the module importing system. The reasons for the extra complexity, is that in order to support different named modules/packages in different directories, you cannot store modules in sys.modules as they are keyed based on the module/package name. The problem now is that due to how Python internally supports the "import" statement internal to a Python package, whereby it will first look for a module within in the package itself, requires that the component modules of the Python package be stored in sys.modules. I don't know why, but it doesn't work if it isn't. Thus you have a contradiction. The parts of a Python package must be present in sys.modules for imports internal to the Python package to work, but to put it in sys.modules runs up against the goal of being able to use the same name for a module/package in different directories. To achieve both goals entails still storing the Python package parts in sys.modules, but they aren't internally assigned its true name. Instead they are given a unique name derived from the path name of the file containing the code for that component. Ie., __name__ within the module and the key used to store it in sys.modules use this magic generated name. Problem now is that when an import hook is asked to find the sub part of a Python package, rather than use the true name of the package in the dotted path, eg., "a.b", it will sometimes use the magic name for the parent parts of the path, eg., "magic_name.b". Therefore you have to do some translation of the magic names back and forth. This all gets even worse when one looks at the issue of modules reloading on top of the existing instance, which can cause loss of access to resources without them being closed off properly and other problems in a multithreaded MPM. To get around the issues, you keep around the old instance of a module/package while importing it again into a separate object space. For packages where it has to be in sys.modules, it means that magic name therefore has to incorporate an instance count such that the name changes for each revision of the package. I don't expect most to understand all this, I just hope you believe me when I say that trying to provide support for reloading of Python packages is hard. My worry is that the addition of all this extra complexity will make the code fragile as it may unknowingly include too many dependencies on subtle nuances of how the Python module import system works. Thus, it may work fine now (if it can be made to work in the first place), but the next major version of Python may break it and we may not always be able to work out how to tweak it to work with a newer version of Python. What does this all mean? First is that I really don't see it as practical that import_module() be able to support Python packages, whether that means importing from the root of the package or a sub part of the package, as part of a scheme to handle automatic module reloading. To do so adds too much complexity, with the code being much more fragile as a result. I will point out now that this would not affect the ability to specify a Python package or sub part of a package in a handler directive such as PythonHandler. This is because the dispatch callback for a handler is more complicated than it being just a call to import_module(). As such, within the dispatch callback, it can do some checks to ascertain how it should import a module. For example, if the module specified by the directive is located in the directory the handler pertains to, it would use import_module(), specifying the exact directory the module should be loaded from. As a fallback, it would then use the standard Python module import mechanism to otherwise import the module. This would cause sys.path to be consulted and the handler could be a Python package as a result. No automatic module reloading would be available for a handler module found along sys.path in this way. Because the dispatch callback would explicitly check the directory for which the directive is defined to find the module and tell the import_module() explicitly this directory, there would not actually be a need for that directory to be put in sys.path. This means there would be no overlap of directories for which import_module() is being used and those in sys.path. The "import" statement within a handler in the directory would be made to work without use of sys.path using import hooks and other means so that it transparently flows through and uses import_module() instead. For those cases where someone does not want to put a handler in the directory the directive applies to, but still make use of automatic module reloading, there could be an alternate path definition. For example PythonImportModulePath. This path would be searched for the file based module and if found, import_module() used to explicitly load it from that directory. Thus, for a handler directive, steps would be: 1. Look in directory directive is defined for and if handler module is there, load it using import_module(). It can only be a file based module. The name of the module doesn't have to be unique. 2. Look in directories defined by PythonImportModulePath and if found there, load it using import_module(). It again can only be a file based module. The name of the module doesn't have to be unique (but should it). 3. If still not found, use inbuilt Python __import__ mechanism to try and find handler module. In this case, could be a file based module or package, but it must be on sys.path. The PythonPath setting can still be used to extend sys.path as to alternate places to look. The name of the module/package must be unique within the set of all modules and packages available by searching sys.path. Next issue is how import_module() would work when called explicitly within a handler. Up till now if no path has been supplied to import_module() it would search sys.path. This I believe needs to change. Instead of searching sys.path, it should instead search PythonImportModulePath. The reason for this is that it retains the clear separation between directories where the standard Python builtin import mechanism is used and those where import_module() is used. Doing this avoids the problems caused by a module being imported from different places using the different mechanisms. The trick here is how does import_module() get access to the value of PythonImportModulePath as it is generally only accessible via the req object and a user may not have that available to supply to the import_module() method. There is a similar problem here with how the log and autoreload options have to be passed to import_module(). The solution here is that mod_python itself, which already holds the req object when it calls the handler, should provide a means to access the current request object from code where it doesn't otherwise have it. This could be as _apache.current_request(). In general it would be the intention that this only be used for Python code within mod_python itself, like import_module(), but it may also be useful to implementors of other handlers extending on mod_python. With this function, import_module() can internally derive the value of PythonImportModulePath as well as PythonDebug and PythonAutoReload. Knowing the latter makes the log and autoreload options redundant and addresses other issues I raised in my article. As to when a path is explicitly supplied to import_module(), this would still behave the same way, with that path being searched for the module. Note though that when import_module() is used, it would only look for file based modules. It would not pick up Python packages even if located in the directories where it was told to look. Because this would break backward compatibility, what one might instead do then, is provide a new apache.import_file() function. This function would strictly enforce only being able to load file based modules. The import_module() method then could implement the steps: 1. If no explicit search path provided, set the search path to the value of PythonImportModulePath. 2. Look in directories defined by search path and if found there, load it using import_file(). It can only be a file based module. The name of the module doesn't have to be unique. 3. If still not found, use inbuilt Python __import__ mechanism to try and find handler module. In this case, could be a file based module or package, but it must be on sys.path. The PythonPath setting can still be used to extend sys.path as to alternate places to look. The name of the module/package must be unique within the set of all modules and packages available by searching sys.path. No automatic reloading would apply to modules/packages found this way. The import_file() function would though have to raise some unique exception type so that the import_module() knows that the module could not be found and thus move on to trying __import__ instead. If import_module() is implemented like this, then the existing code to find the handler probably wouldn't need to be changed and neither would any user code. Whereas before any modules found on sys.path using import_module() would have automatic module reloading applied to them, this would no longer occur and thus user would need to shift any additional search directories added using PythonPath into PythonImportModulePath instead. That way module reloading would work on them. Because of the multiple levels of functions, a name other than PythonImportModulePath may be more appropriate. Next is what happens when the "import" statement is used within a handler. At the moment it will allow importing of modules from directories specified by handler directive and sys.path. In this case, import hooks as per PEP 302 would be used to customise how the "import" statement works. It would work similar to the steps above except that the search path would be the combination of the directory in which the importers code is located, the handler root directory and the value of PythonImportModulePath. If it isn't in those locations it would fall through to using standard import from sys.path. Obviously if found on sys.path, no automatic module reloading and because Python packages aren't supported by import_module(), they have to be on sys.path. Note that this would always add a search of local directory where the Python code is located first. This is different to now for the case of a subdirectory, but I really feel that this makes more sense when you consider that for the nearest parallel, ie., that of packages, it will always find a module in the same directory first. I know this isn't a true Python package, but that modules can be spread over a directory hierarchy when dispatchers such as mod_python.publisher are used yet there is still only one handler root, makes it similar in some ways. It also must be noted that "import" would only try and use the mod_python module importing system if the module in which it was used was itself loaded using the mod_python module importing system in the first place. Thus, the use of "import" in a module/package anywhere along sys.path would not be impacted and would not start using the mod_python module importing system. As a final summary: 1. Existing code wouldn't need to be changed if two levels of import_module()/import_file() are implemented as described. Ie., where import_module() gives appearance of providing same features as before. 2. Python packages would have to be on sys.path though. This would only be an issue though for where someone had put the Python package in the handler root directory, rather than elsewhere and hadn't extended the PythonPath setting explicitly. 3. Python packages would not be candidates for automatic module reloading. 4. Simple file based modules outside of the handler directories which were previously identified by setting PythonPath would not be candidates for automatic module reloading unless those directories are moved from PythonPath to PythonImportModulePath. 5. The import statement would now transparently make use of the mod_python module importing system in modules to import file based modules not on sys.path as appropriate. These modules would be candidates for automatic module reloading. This bit of magic would only occur for modules imported using the mod_python module importing system in the first place. Hmmm, rambled on more than I should have there. Gotta rush out for a dinner date now. :-) Graham
|