This week I’ve been thinking about how we’ll manage Python packages for the ACCESS community at NCI to add better support for Python 3. I’ve started adding Python 3 support to my own scripts (using travis.org to check they actually work), and it’s not been that bad, and we’re starting to get users ask on the helpdesk for Python 3 versions of the climate-focused packages we support.

There are a couple inconvenient differences between Python 2 and Python 3. Print is now a function, rather than a statement, so brackets must be added anywhere you print. A number of built-in packages were renamed, meaning occasionally you need to go searching for the new name when you get an import error. Finally files you open are now interpreted as Unicode, rather than plain bytes as Python 2 and most Unix tools do, so data files need to be opened specifically in binary mode.

Past these differences, which can be worked around easily enough using the __future__ and six packages, the core of the language is pretty much the same. The majority of core libraries like numpy and pandas have been converted to work with either version, but many libraries and scripts remain Python 2 only, while some newer ones only support Python 3.

Currently NCI only support numpy, scipy and matplotlib on the Raijin supercomputer, and ask users to install extra packages to their home directory using Pip. This is because managing the dependencies between library versions gets very complicated quite quickly when supporting a larger number of packages for a wide range of different use cases, especially as some researchers using the machine want extreme stability so that their results are reproducible, while others want to make use of the latest and greatest features.

The ACCESS community separately maintains a number of python libraries in the /projects/access area. These are both libraries that are complicated for researchers to install, for instance file format libraries that require binary dependencies to be compiled, as well as odds and ends like pip and pytest that are helpful for general development as well as climate and weather research.

We support these libraries using the same environment module system that NCI uses for the software it supports. Environment modules are little TCL scripts that modify your $PATH, allowing multiple versions of a software package to be installed alongside each other, so loading the modulepython/2.7.5 will add /apps/python/2.7.5/bin, and python/3.4.3 will add /apps/python/3.4.3/bin.

Our ACCESS pythonlib modules work much the same way, only modifying $PYTHONPATH so that the Python interpreter can find the libraries in the module. We’ve even got a nifty script that searches PyPI for a specific library version, installs that version into the apps/pythonlib directory, then installs a module file so that researchers can use it.

There are two downsides with this however. Firstly dependencies need to be installed and tracked separately, which is a manual process as PyPI doesn’t track this information, and secondly our setup only supports Python 2.7. As the different patch versions are mostly compatible with each other a library installed with 2.7.5 will still work with 2.7.6, so this is ok if we only support 2.7 variants.

If we start supporting Python 3 though we’ve got another axis to think about - both the library version as well as the Python version. Add tracking dependencies on to that, which is somewhat tedious using the TCL module files, and a dedicated package manager becomes attractive.

Anaconda is the natural choice for this - it has a sophisticated dependency solver and a database of packages, meaning we don’t need to worry about tracking dependencies ourselves. It also handles binary dependencies, so less need for us to work out random build systems, and has a large number of climate and weather related packages available.

Intel have also recently released on Anaconda a number of libraries optimised for their processors, which would be great for people doing number crunching. Previous attempts to compile these ourselves have resulted in a mess of linker errors.

Unfortunately NCI aren’t presently too keen on supporting Anaconda centrally, since it includes system packages like OpenSSL where getting prompt security updates are important. I’ve also got my own concerns about their packaging system, which is entirely separate from PyPI so if you create a package with dependencies that aren’t in the Conda ecosystem you must re-package them yourself, and manually monitor for updates from the developer.