Installing Chaste on Archer

(Last updated Feb 2016.)

General

Important: New users are encouraged to skim the Archer documentation first, in particular the getting started guide. Be aware that you have a /home and /work partition, and the differences between them.

In this document it is assumed that the dependencies will live in /work/.../chaste-libs and the Chaste code will live in /work/.../Chaste, where in both cases the "..." are something like "e462/e462/louiecn2".

I/O

Serious consideration should be given to input/output (I/O) on a system like Archer. At the very least please read the section in tuning on I/O optimisation. Getting the wrong settings can make things 1-2 orders of magnitude slower!

Curious readers are advised to look at these slides (or these ones for more detail), and search around for Lustre tips such as this page.

Lustre striping

/work is a Lustre filesystem, where files can be distributed and broken up ("striped") over a large number of hard disks ("OSTs") to improve parallel performance, but it's up to you to make sure things are working at their best. Before you do anything (and before copying data across) you should set this up properly.

You basically have control over two parameters for every file and directory you own: the number of stripes, and the size of these stripes. For parallel access, slide 24 in the second link above contains a good rule of thumb:

  • If #files > # OSTs Set stripe_count=1 You will reduce the lustre contention and OST file locking this way and gain performance
  • #files==1 Set stripe_count=#OSTs Assuming you have more than 1 I/O client
  • #files<#OSTs Select stripe_count so that you use all OSTs Example : You have 8 OSTs and write 4 files at the same time, then select stripe_count=2

(There are 48 OSTs at time of writing.)

Other good rules of thumb are:

  • Use a stripe count of 1 for directories with many small files.
  • Increase the stripe_count for parallel writes to the same file - approximately 1 stripe per GB file size.
  • Set stripe count to a factor of the number of parallel processes

Suggestions for Chaste

Source code (many small files)

The source code is many small files, so we want striping off completely. To do this we'll make a Chaste directory and use setstripe so that when we checkout the code it inherits the setting.

louiecn2@eslogin004:~> cd /work/e462/e462/louiecn2/
louiecn2@eslogin004:/work/e462/e462/louiecn2> mkdir Chaste
louiecn2@eslogin004:/work/e462/e462/louiecn2> lfs setstripe --stripe-size 1M --stripe-count 1 --stripe-index -1 Chaste

Note that we're also setting the stripe-index here just in case (it should ALWAYS be -1 as this allows the system to load balance). Note also that the stripe-size makes no difference if stripe-count=1, it's just a good default.

Large data files e.g. mesh files

The process for data files you're copying over (e.g. mesh files) and data files that get created at run-time is slightly different.

For large input files, the easiest solution is to change the directory settings before copying the data over (using scp or whatever), then change them back after. For example

louiecn2@eslogin005:/work/e462/e462/louiecn2/Chaste/projects/louiecn/test/data> lfs setstripe -c -1 .
(copy big mesh files into data directory)
louiecn2@eslogin005:/work/e462/e462/louiecn2/Chaste/projects/louiecn/test/data> lfs setstripe -c 1 .

stripes any new files over every OST (-1 means use all). You can confirm the stripe settings worked using lfs getstripe . to check the stripe_count. For example for each file you should see something like

./FullMeshVolume_bin.ele
lmm_stripe_count:   48
lmm_stripe_size:    1048576
...

which means the file is sliced up over all 48 OSTs.

large HDF5 data files

The HDF5 file is trickier as it gets made by the program, so the solution involves doing three things to your code. (Note the links below are to an older version of the code so the line numbers will be different!)

  1. uncomment the H5Pset_alignment call to ensures each HDF5 chunk fits neatly into a Lustre stripe, reducing contention.
  2. increase the chunk size parameter `target_size_in_bytes` from 1024*1024/8 (128 KB) to 1024*1024 (1 MB).
  3. Use an MPICH_MPIIO_HINTS environment variable to specify striping numbers for newly-created .h5 files (see example bashrc below!).

By doing these things, the HDF5 chunks will be 1 MB, aligned to 1 MB blocks, and striped in 1 MB stripes. Perfect!

If you have data with "bad" stripe settings

If you've already got data on the system and it's got sub-optimal settings (check with lfs getstripe ...), use the following template:

mv dir old-dir
mkdir dir
lfs setstripe -i -1 -s 1M -c 1 dir
cp -a old-dir/* dir/

This moves the old directory somewhere safe, creates a new directory, sets the stripe properties, and copies the backed-up contents to the new directory, where it inherits the striping. You can then delete old-dir.

Dependencies and environment variables

Adapt the following (i.e. replace with your user name and paths) and put it at the end of your ~/.bashrc file to set things up automatically every time you log in:

export WORK=/work/e462/e462/louiecn2
alias cdchaste='cd $WORK/Chaste'

module swap PrgEnv-cray PrgEnv-intel
module load cray-petsc cray-hdf5-parallel vtk boost xerces-c cray-tpsl svn python-compute

export CHASTE_LIBS=$WORK/chaste-libs
export CHASTE_LOAD_ENV=1
export CHASTE_TEST_OUTPUT=$WORK/testoutput

export PATH=$CHASTE_LIBS/bin:$PATH
export LD_LIBRARY_PATH=$WORK/Chaste/lib:$LD_LIBRARY_PATH # Lets us use cl=1
export PYTHONPATH=$CHASTE_LIBS/lib/python:$PYTHONPATH

# Convenient alias for scons, call it whatever you like, and invoke from Chaste directory like this:
# Sco global/test/TestChasteBuildInfo.hpp
function Sco {
    scons -j8 b=IntelHpc co=1 br=1 do_inf_tests=0 $1
}

# Stripe h5 files over 48 OSTs. Fewer may be better depending on your problem 
# size and number of cores! Also specify a 1 M stripe size (1048576 bytes)
export MPICH_MPIIO_HINTS="*.h5:striping_factor=48:striping_unit=1048576"

# Build dynamic exes
#export CRAYPE_LINK_TYPE=dynamic

Here we set MPICH_MPIIO_HINTS so that new .h5 files are striped over all OSTs (48 at time of writing). Using all the OSTs is good when the number of processes > OSTs. If you're doing simulations on a smaller number of processes than this, e.g. on 1 or 2 compute nodes (24 or 48 cores) then you could replace the striping_factor with the number of processes (24 or 48, say) so that there is a 1:1 mapping between threads and writers. As it's just an environment variable this can be set in the job script for most flexibility.

The final line (CRAYPE_LINK_TYPE) tells the compiler to use dynamic linking. I'd strongly recommend using this (i.e. uncomment the line), but do read what it means first here. The main effect is smaller executables that use much less memory, but be aware that it means the executable will use the versions of libraries specified by the module, which might change in future.

On a related note, the module load line loads the default versions, which change over time and may not be the recommended versions. You can see which versions of things are installed using

module avail [modulefile]

and load them specifically, e.g. module load cray-petsc/3.4.2.0.

A useful list of libraries and their versions, and upcoming changes, can be found here.

Furthermore, the LD_LIBRARY_PATH line makes it possible to use Chaste's cl=1 option.

Installation

If you've only just done your ~/.bashrc file (the previous step), log out and log back in again before continuing!

SCons

Note: building with SCons will soon be deprecated (early 2016) in favour of CMake. This guide will be updated once the CMake system is ready.

From $CHASTE_LIBS:

wget http://downloads.sourceforge.net/project/scons/scons/2.3.0/scons-2.3.0.tar.gz
tar zxf scons-2.3.0.tar.gz
cd scons-2.3.0
python setup.py install --prefix=$CHASTE_LIBS
cd ..
rm -rf scons-2.3.0.tar.gz scons-2.3.0

PyCml dependencies

See InstallPyCml for more explanation.

You will need to do the following to make easy_install work:

Create ~/.pydistutils.cfg with the following content (replacing with your path to chaste-libs):

[install]
install_lib = /work/.../chaste-libs/lib/python
install_scripts = /work/.../chaste-libs/bin

and make the /lib/python directory, i.e.

mkdir /work/.../chaste-libs/lib/python

Then, again from $CHASTE_LIBS:

wget http://peak.telecommunity.com/dist/ez_setup.py
python ez_setup.py
easy_install "python-dateutil==1.5"
easy_install "Amara==1.2.0.2"
easy_install rdflib

(We don't need to do lxml as it's already installed.)

XSD

For XSD we get the binary. Again, from $CHASTE_LIBS:

wget http://www.codesynthesis.com/download/xsd/3.3/linux-gnu/x86_64/xsd-3.3.0-x86_64-linux-gnu.tar.bz2
tar -xjf xsd-3.3.0-x86_64-linux-gnu.tar.bz2
mv xsd-3.3.0-x86_64-linux-gnu xsd
ln -s $CHASTE_LIBS/xsd/bin/xsd $CHASTE_LIBS/bin/xsd
rm -f xsd-3.3.0-x86_64-linux-gnu.tar.bz2

As documented elsewhere, unfortunately there is a small bug with GCC and XSD. Fix it by modifying $CHASTE_LIBS/xsd/libxsd/xsd/cxx/zc-istream.txx and changing line 35 of zc-istream.txx to read

this->setg(

instead of

setg(

Building Chaste

From /work/.../ check out a working copy of the source code using:

svn co https://chaste.cs.ox.ac.uk/svn/chaste/trunk Chaste --username [your Chaste username]

(Make sure you followed the I/O instructions earlier first!)

Edit your SConscript file, see InstallGuides/CheckoutUserProject#Important:touseexistingChastecode.

If the profile has loaded correctly then svn and scons should be in your PATH and you can check out the code and compile with:

scons build=IntelHpc co=1 ...

or if you used my example bashrc you can use the handy alias Sco. Either way you can't run anything on the login nodes so it's important to use compile_only=1 (or co=1) to stop the test running right away. Parallel tests need to be run through the queue.

Note that the IntelHpc build type has CC flags -DNDEBUG -O3 -no-prec-div which mean "turn off asserts", "aggressive optimisation", and "use less-accurate divisions", respectively. In my testing they have a large effect on performance (especially -DNDEBUG) but they might not be suitable for all simulations (especially -no-prec-div).

See ChasteGuides/DeveloperBuildGuide for more on the SCons arguments.

Running Chaste

Read this to learn about submitting jobs, there's too much to cover in this short guide. In particular try out the handy bolt script, and note the commands qstat and qdel.

Once you've read the above, you might find the following example job script helpful:

#!/bin/bash --login
#PBS -l select=10
#PBS -N [job name]
#PBS -A [credit quota]
#PBS -l walltime=01:23:45
#PBS -m abe
#PBS -M [your email address]

# Switch to current working directory
cd $PBS_O_WORKDIR

# Run the parallel program
aprun -n 240 -N 24 -d 1 -S 12 -j 1 /work/.../Chaste/global/build/intelhpc/TestChasteBuildInfoRunner >& stdout.txt

This script asks for 10 nodes (-l), for a total of 240 processes (-n). It assigns them 24 per node (-N) and 12 per NUMA region (-S), without hyperthreading (-j) or OpenMP threading (-d). If you have no idea what this means then you probably want to just change the select=10, -n 240, and walltime bits, so that -n is 24 times select=.

The script may then be added to the job queue by typing

qsub [script name]

By appending >& stdout.txt you'll get some output in the path from which qsub was invoked. Without this, you get output in a file named after the job name (e.g. [job name].o1234567 and [job name].e1234567). Note: there is a performance penalty to doing this (that depends on the amount of output).

Happy supercomputing!


You can probably ignore the information below, it's been hanging around since this was a HECToR install guide, just in case it becomes useful again.


CrayPat

CrayPat is the Cray profiling suite. There is a section in the user manual above about automatic profile generation. It is not always successful.

The process is "compile, use pat_build to instrument executable, run, use pat_report to examine profiling data".

Load the CrayPat module before compiling.

If automatic profiling process doesn't work then there are alternatives

Sampling

The simplest profiling to perform is sampling

module load xt-craypat
scons build=... 
pat_build \
   notforrelease/build/craygcc_ndebug/TestChasteBenchmarksForPreDiCTRunner \
   TestChasteBenchmarksForPreDiCTRunner+pat

Above example creates an instrumented executable called "TestChasteBenchmarksForPreDiCTRunner+pat" from the original executable. Run this instrumented executable as normal and then use pat_report to analyze either the .xf (small # of processes) or directory produced.

pat_report -O profile TestChasteBenchmarksForPreDiCTRunner+pat+21007-12441sdt

This will give time spent in individual functions and the computational imbalance in those functions. It will not give information on the calltree.

Several other pat_report options including:

pat_report -T -O profile # report all functions, not just most important
pat_report -s pe=ALL # report values for all processes

see man page and CrayPat documentation for more information.

MPI profiling

CrayPat has a series of tracegroups that can be profiled, one of which is MPI.

pat_build -g mpi executable-name instrumented-exectuable-name

To get a calltree of where the MPI time is spent:

pat_report -O calltree filename.xf

Other -O options are load_balance, callers (plus others). See man page for more details.

Other profiling

CrayPat has other tracegroups (io, hdf5, lustre, system, blas, math, ...) which work in the same fashion as the mpi tracegroup.

GUI

Apprentice2 is a GUI to look at the results (packages are also available for download for use locally with profiling results, see user manual)

module load apprentice2
app2