Installing Chaste on Archer

(Last updated Nov 2016.)

General

Important: New users are encouraged to skim the Archer documentation first, in particular the getting started guide. Be aware that you have a /home and /work partition, and the differences between them.

In this document it is assumed that the dependencies will live in /work/.../chaste-libs and the Chaste code will live in /work/.../Chaste, where in both cases the "..." are something like "e462/e462/louiecn2".

I/O

Serious consideration should be given to input/output (I/O) on a system like Archer. At the very least please read the section in tuning on I/O optimisation. Getting the wrong settings can make things 1-2 orders of magnitude slower!

Curious readers are advised to look at these slides (or these ones for more detail), and search around for Lustre tips such as this page.

Lustre striping

/work is a Lustre filesystem, where files can be distributed and broken up ("striped") over a large number of hard disks ("OSTs") to improve parallel performance, but it's up to you to make sure things are working at their best. Before you do anything (and before copying data across) you should set this up properly.

You basically have control over two parameters for every file and directory you own: the number of stripes, and the size of these stripes. For parallel access, slide 24 in the second link above contains a good rule of thumb:

  • If #files > # OSTs Set stripe_count=1 You will reduce the lustre contention and OST file locking this way and gain performance
  • #files==1 Set stripe_count=#OSTs Assuming you have more than 1 I/O client
  • #files<#OSTs Select stripe_count so that you use all OSTs Example : You have 8 OSTs and write 4 files at the same time, then select stripe_count=2

(There are 48 OSTs at time of writing.)

Other good rules of thumb are:

  • Use a stripe count of 1 for directories with many small files.
  • Increase the stripe_count for parallel writes to the same file - approximately 1 stripe per GB file size. E.g. you might try using 2 for files < 1 GB, 8 for files < 10 GB, and 24 for files < 100 GB, etc.
  • Set stripe count to a factor of the number of parallel processes for best symmetry/load balancing.

Suggestions for Chaste

Source code (many small files)

The source code is many small files, so we want striping off completely. To do this we'll make a Chaste directory and use setstripe so that when we checkout the code it inherits the setting.

louiecn2@eslogin004:~> cd /work/e462/e462/louiecn2/
louiecn2@eslogin004:/work/e462/e462/louiecn2> mkdir Chaste
louiecn2@eslogin004:/work/e462/e462/louiecn2> lfs setstripe --stripe-size 1M --stripe-count 1 --stripe-index -1 Chaste

Note that we're also setting the stripe-index here just in case (it should ALWAYS be -1 as this allows the system to load balance). Note also that the stripe-size makes no difference if stripe-count=1, it's just a good default.

Large data files e.g. mesh files

The process for data files you're copying over (e.g. mesh files) and data files that get created at run-time is slightly different.

For large input files, the easiest solution is to change the directory settings before copying the data over (using scp or whatever), then change them back after. For example

louiecn2@eslogin005:/work/e462/e462/louiecn2/Chaste/projects/louiecn/test/data> lfs setstripe -c 8 .
(copy big mesh files into data directory)
louiecn2@eslogin005:/work/e462/e462/louiecn2/Chaste/projects/louiecn/test/data> lfs setstripe -c 1 .

stripes any new files over 8 OSTs. You can confirm the stripe settings worked using lfs getstripe . to check the stripe_count. For example for each file you should see something like

./FullMeshVolume_bin.ele
lmm_stripe_count:   8
lmm_stripe_size:    1048576
...

which means the file is sliced up over 8 OSTs.

I use the above rule of thumb about roughly 1 stripe per GB and use factors of 12, e.g.:

  • 1 stripe for small files (<100 MB)
  • 2 stripes for larger files (100 MB to a few GB)
  • 4 stripes for ~several GB files
  • 8 stripes for ~10 GB files
  • 12 stripes for ~15 GB files
  • 24 stripes... etc.

It's a bit of a pain having to set it, copy a file or two, set it again, etc., but it only has to be done once so it's worth doing this optimally.

large HDF5 data files

Do these three thing to make sure HDF5 files are striped properly.

  1. Use an MPICH_MPIIO_HINTS environment variable to specify striping numbers for newly-created .h5 files. (See example bashrc below!) striping_factor=8 will use 8 OSTs, but as above you might be better off using more or fewer, so experiment with the "one per GB" and/or "factor of # processes" rules. striping_unit=1048576 means use 1 MB stripes, which works for me but you might benefit from something larger.
  2. Whatever you set the stripe size to (previous point) you should let the cardiac problem know by adding a line in your test. E.g. if you used 1048576, call e.g. monodomain_problem.SetHdf5DataWriterTargetChunkSizeAndAlignment(1048576). This makes it possible for the writer to divide the results into chunks that each fit neatly into a stripe.
  3. Enable Hdf5DataWriter caching, again by adding a line to your test e.g. bidomain_problem.SetUseHdf5DataWriterCache(). This tells the writer to only write to disk after multiple timesteps, which massively improves bandwidth.

If you have data with "bad" stripe settings

If you've already got data on the system and it's got sub-optimal settings (check with lfs getstripe ...), use the following template:

mv dir old-dir
mkdir dir
lfs setstripe -i -1 -s 1M -c 1 dir
cp -a old-dir/* dir/

This moves the old directory somewhere safe, creates a new directory, sets the stripe properties, and copies the backed-up contents to the new directory, where it inherits the striping. You can then delete old-dir.

Dependencies and environment variables

Adapt the following (i.e. replace with your user name and paths) and put it at the end of your ~/.bashrc file to set things up automatically every time you log in:

export WORK=/work/e462/e462/louiecn2
alias cdchaste='cd $WORK/Chaste'

module swap PrgEnv-cray PrgEnv-intel
module load cray-petsc cray-hdf5-parallel vtk boost xerces-c cray-tpsl svn python-compute craype-hugepages2M

export CHASTE_LIBS=$WORK/chaste-libs
export CHASTE_LOAD_ENV=1
export CHASTE_TEST_OUTPUT=$WORK/testoutput

export PATH=$CHASTE_LIBS/bin:$PATH
export LD_LIBRARY_PATH=$WORK/Chaste/lib:$LD_LIBRARY_PATH # Lets us use cl=1
export PYTHONPATH=$CHASTE_LIBS/lib/python:$PYTHONPATH

# Convenient alias for scons, call it whatever you like, and invoke from Chaste directory like this:
# Sco global/test/TestChasteBuildInfo.hpp
function Sco {
    scons -j8 b=IntelHpc co=1 br=1 do_inf_tests=0 $1
}

# Stripe h5 files over 8 OSTs. More/fewer may be better depending on your problem 
# size and number of cores! Also note the 1 M stripe size (1048576 bytes)
export MPICH_MPIIO_HINTS="*.h5:striping_factor=8:striping_unit=1048576"

# Build dynamic exes
#export CRAYPE_LINK_TYPE=dynamic

Here we set MPICH_MPIIO_HINTS so that new .h5 files are striped over 8 OSTs (out of 48 at time of writing). This is sensible for HDF5 files on the order of 10 GB. For larger files you might benefit from more, even up to using all the OSTs. Another thing to consider if you're doing simulations on a relatively small number of processes, e.g. on 1 or 2 compute nodes (24 or 48 cores) then you could try replacing the striping_factor with the number of nodes (so that each node has a dedicated writer) or even the number of processes (so that every single process is a writer). As it's just an environment variable you could set this right in the job script itself so it's tailored to the specific job and node count.

The final line (CRAYPE_LINK_TYPE) tells the compiler to use dynamic linking. I'd strongly recommend using this (i.e. uncomment the line), but do read what it means first here. The main effect is smaller executables that use much less memory, but be aware that it means the executable will use the versions of libraries specified by the module, which might change in future. In other words, if you compile a program with a module that later disappears (Archer periodically update the modules) then the program will no longer work without recompiling.

On a related note, the module load line loads the default versions, which change over time and may not be the recommended versions. You can see which versions of things are installed using

module avail [modulefile]

and load them specifically, e.g. module load cray-petsc/3.4.2.0.

A useful list of libraries and their versions, and upcoming changes, can be found here.

Furthermore, the LD_LIBRARY_PATH line makes it possible to use Chaste's cl=1 option.

Installation

If you've only just done your ~/.bashrc file (the previous step), log out and log back in again before continuing!

SCons

Note: building with SCons was deprecated (in early 2016) in favour of CMake, but CMake has not yet been tested or configured on Archer. If anyone can look into this it would be useful! For now SCons is still acceptable.

From $CHASTE_LIBS:

wget http://downloads.sourceforge.net/project/scons/scons/2.3.0/scons-2.3.0.tar.gz
tar zxf scons-2.3.0.tar.gz
cd scons-2.3.0
python setup.py install --prefix=$CHASTE_LIBS
cd ..
rm -rf scons-2.3.0.tar.gz scons-2.3.0

PyCml dependencies

See InstallPyCml for more explanation.

You will need to do the following to make easy_install work:

Create ~/.pydistutils.cfg with the following content (replacing with your path to chaste-libs):

[install]
install_lib = /work/.../chaste-libs/lib/python
install_scripts = /work/.../chaste-libs/bin

and make the /lib/python directory, i.e.

mkdir /work/.../chaste-libs/lib/python

Then, again from $CHASTE_LIBS:

wget http://peak.telecommunity.com/dist/ez_setup.py
python ez_setup.py
easy_install "python-dateutil==1.5"
easy_install "Amara==1.2.0.2"
easy_install rdflib

(We don't need to do lxml as it's already installed.)

XSD

For XSD we get the binary. Again, from $CHASTE_LIBS:

wget http://www.codesynthesis.com/download/xsd/3.3/linux-gnu/x86_64/xsd-3.3.0-x86_64-linux-gnu.tar.bz2
tar -xjf xsd-3.3.0-x86_64-linux-gnu.tar.bz2
mv xsd-3.3.0-x86_64-linux-gnu xsd
ln -s $CHASTE_LIBS/xsd/bin/xsd $CHASTE_LIBS/bin/xsd
rm -f xsd-3.3.0-x86_64-linux-gnu.tar.bz2

As documented elsewhere, unfortunately there is a small bug with GCC and XSD. Fix it by modifying $CHASTE_LIBS/xsd/libxsd/xsd/cxx/zc-istream.txx and changing line 35 of zc-istream.txx to read

this->setg(

instead of

setg(

Building Chaste

From /work/.../ check out a working copy of the source code according to ChasteGuides/AccessCodeRepository i.e.

git clone -b develop https://chaste.cs.ox.ac.uk/git/chaste.git Chaste

(Make sure you followed the I/O instructions earlier first!)

Edit your SConscript file, see InstallGuides/CheckoutUserProject#Important:touseexistingChastecode.

If the profile has loaded correctly then svn and scons should be in your PATH and you can check out the code and compile with:

scons build=IntelHpc co=1 ...

or if you used my example bashrc you can use the handy alias Sco. Either way you can't run anything on the login nodes so it's important to use compile_only=1 (or co=1) to stop the test running right away. Parallel tests need to be run through the queue.

Note that the IntelHpc build type has CC flags -DNDEBUG -O3 -no-prec-div which mean "turn off asserts", "aggressive optimisation", and "use less-accurate divisions", respectively. In my testing they have a large effect on performance (especially -DNDEBUG) but they might not be suitable for all simulations (especially -no-prec-div).

If you're getting weird test failures try turning the asserts back on by leaving out the -DNDEBUG flag. It's not pretty but you can do this by removing it from the IntelHpc build type, i.e. edit python/BuildTypes.py as follows and recompile

Index: python/BuildTypes.py
===================================================================
--- python/BuildTypes.py    (revision 26677)
+++ python/BuildTypes.py    (working copy)
@@ -1104,7 +1104,7 @@
         self.rdynamic_link_flag = '-dynamic'
         self.tools['mpicxx'] = 'CC'
         self.build_dir = 'intelhpc'
-        self._cc_flags = ['-DNDEBUG','-O3','-no-prec-div']
+        self._cc_flags = ['-O3','-no-prec-div']
         self.is_optimised = True

See ChasteGuides/DeveloperBuildGuide for more on the SCons arguments.

Running Chaste

Read this to learn about submitting jobs, there's too much to cover in this short guide. In particular try out the handy bolt script, and note the commands qstat and qdel.

Once you've read the above, you might find the following example job script helpful:

#!/bin/bash --login
#PBS -l select=10
#PBS -N [job name]
#PBS -A [credit quota]
#PBS -l walltime=01:23:45
#PBS -m abe
#PBS -M [your email address]

# Switch to current working directory
cd $PBS_O_WORKDIR

# Run the parallel program
aprun -n 240 -N 24 -d 1 -S 12 -j 1 /work/.../Chaste/global/build/intelhpc/TestChasteBuildInfoRunner >& stdout.txt

This script asks for 10 nodes (-l), for a total of 240 processes (-n). It assigns them 24 per node (-N) and 12 per NUMA region (-S), without hyperthreading (-j) or OpenMP threading (-d). If you have no idea what this means then you probably want to just change the select=10, -n 240, and walltime bits, so that -n is 24 times select=.

The script may then be added to the job queue by typing

qsub [script name]

By appending >& stdout.txt you'll get some output in the path from which qsub was invoked. Without this, you get output in a file named after the job name (e.g. [job name].o1234567 and [job name].e1234567). Note: there is a performance penalty to doing this (that depends on the amount of output).

Happy supercomputing!


You can probably ignore the information below, it's been hanging around since this was a HECToR install guide, just in case it becomes useful again.


CrayPat

CrayPat is the Cray profiling suite. There is a section in the user manual above about automatic profile generation. It is not always successful.

The process is "compile, use pat_build to instrument executable, run, use pat_report to examine profiling data".

Load the CrayPat module before compiling.

If automatic profiling process doesn't work then there are alternatives

Sampling

The simplest profiling to perform is sampling

module load xt-craypat
scons build=... 
pat_build \
   notforrelease/build/craygcc_ndebug/TestChasteBenchmarksForPreDiCTRunner \
   TestChasteBenchmarksForPreDiCTRunner+pat

Above example creates an instrumented executable called "TestChasteBenchmarksForPreDiCTRunner+pat" from the original executable. Run this instrumented executable as normal and then use pat_report to analyze either the .xf (small # of processes) or directory produced.

pat_report -O profile TestChasteBenchmarksForPreDiCTRunner+pat+21007-12441sdt

This will give time spent in individual functions and the computational imbalance in those functions. It will not give information on the calltree.

Several other pat_report options including:

pat_report -T -O profile # report all functions, not just most important
pat_report -s pe=ALL # report values for all processes

see man page and CrayPat documentation for more information.

MPI profiling

CrayPat has a series of tracegroups that can be profiled, one of which is MPI.

pat_build -g mpi executable-name instrumented-exectuable-name

To get a calltree of where the MPI time is spent:

pat_report -O calltree filename.xf

Other -O options are load_balance, callers (plus others). See man page for more details.

Other profiling

CrayPat has other tracegroups (io, hdf5, lustre, system, blas, math, ...) which work in the same fashion as the mpi tracegroup.

GUI

Apprentice2 is a GUI to look at the results (packages are also available for download for use locally with profiling results, see user manual)

module load apprentice2
app2