This section contains some simple examples for you to familiarise yourself with the cluster. The relevant files can be found under /direct/Software/Training.

Hello World

In this simple example you will learn how to write a script which can be submitted into the batch system. You can then submit it and get some output back.

Copy this script into your account:


# a simple script
# PN Thu Feb 23 15:58:23 CET 2012

#$ -S /bin/bash
#$ -e <put the directory the script resides in here, e.g. /home/<your user name>/myfirstsge>
#$ -o <like -e>

echo "hello from $(hostname)"

Name this script (or whatever name you like to give it).

Submit the script:

qsub ./

Check it's status:


You might see something like

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
   487 0.00000   lpartec      qw     03/12/2012 09:15:24                               1

Meaning that your job has been assigned the ID 487, has no running priority yet (it's 0), the name of your job is, it's running under the user id of lpartec, it's waiting (qw), was submitted at that time, and uses 1 slot. After some time it will get a higher priority (0.50500 or so) and will change into running (r) and a queue instance will be assigned. At some point it will have disappeared from qstat (if not, check the FAQ section) and you will find two files called hello.e.<jobid> and hello.o.<jobid>. One is the std error, the other the std output of your job. Inside, it says something like

hello from nodeXXX

You could add the command env to your script and find out how your environment differs from interactive use.

Now, modify the "echo" line of your script to look like this:

echo "hello from task ${SGE_TASK_ID} on $(hostname)"

To run your job again as a task array, use the -t switch:

qsub -t 10-19 ./

This feature is useful when working on a set of data, you could e.g. index your data files using the $SGE_TASK_ID variable.


The AMD Core Math Library is a compilation of mathematical functions specially tuned for the AMD processor. You can find more info on the  ACML home page. A PDF document containing compilation and linking examples is at /opt/acml<version>/Doc/acml.pdf.

Example programs are found in the /opt/acml<version>/<compiler>/examples directory.

Under /direct/Software/Training/acml, you will find the C++ source code to do an LU-decomposition of a matrix ( You can use the supplied Makefile to generate a binary and to check what flags are necessary.

The program will read in a problem size from the command line, print out the ACML version information and calculate the GFlops from the number of operations and the time spent in the necessary routine:

lpartec@marc2-h1:~/training/acml> ./dgetrf 200
ACML (AMD Core Math Library) version 5.1.0  (Mon Dec 12 01:58:29 CST 2011)
Copyright AMD,NAG 2011
Build system: Linux x86_64 denarius
Built using Fortran compiler: GNU Fortran (GCC) 4.6.0
   with flags:  -ffixed-line-length-132 -Wall -W -Wno-unused -Wno-uninitialized -fPIC -fno-second-underscore -fimplicit-none -m64 -DIS_64BIT -msse2 -O3
 and C compiler: gcc (GCC) 4.6.0
   with flags:  -Wall -W -Wno-unused-parameter -Wstrict-prototypes -Wwrite-strings -D_GNU_SOURCE -D_ISOC99_SOURCE -fPIC  -m64 -DIS_64BIT -mstackrealign -msse2 -O3
Problem size: 200, Number of runs: 100, time spent in dgetrf: 0.35 s, Number of operations: 5.36e+08, GFlop/s: 1.53143

The supplied shell script contains two directives:

  1. #$ -cwd: It tells SGE to change to the directory the job was submitted from. Thus, you can specify your binary as ./dgetrf instead of it's absolute path. By the way, the ~ won't expand to your home directory.
  2. #$ -j y: Join the standard input and standard output. Without it, the acml version information will go to the .o file, the timing info will go to the .e file.

Don't forget to supply the problem size when submitting:

qsub ./ <problem size>


MPI (Message Passing Interface) is a way of parallel programming. ParaStation MPI is an effective implementation.

A minimal program is found at /direct/Software/Training/mpi. To compile the source code, you can use the mpi compiler wrapper:

mpicc -o hello hello.c

The program prints out its MPI rank and the host it is running on.

To protect the frontend from overload, parallel programs won't run on it, so you have to submit it into the cluster via SGE. There's a script file, When submitted, you must specify the parallel environment it should run in and the number of ranks, e.g.

qsub -pe parastation_rr 16 ./

SGE will reserve the required number of slots and provides the environment variable $NSLOTS which can be used in the -np argument:

mpiexec -np ${NSLOTS} ./hello

ParaStation? will ignore any —machinefile etc. switches in batch operations.

In case you see

ips_proto_connect: Couldn't connect to nodeXXX

in your output, disable the PSM layer by setting

export PSP_PSM=0

in your script file.

High Performance Linpack

The HPL is a benchmark used to rank SuperComputers? in the  top 500 list. It solves a system of linear equations in a distributed fashion.

The source code is available at It is also available at /direct/Software/Training/hpl/hpl-2.0.tar.gz. To get started, unpack it:

gzip -dc hpl-2.0.tar.gz | tar xvf -

In the resulting directory, you can find some example Makefiles in the setup directory. Copy the Make.Linux_ATHLON_FBLAS into the hpl top directory and rename it to something specifying the MPI and performance lib used, e.g. Make.ps_acml. In there, you need to change the following settings:

  1. ARCH: The name of your "architecture" (ps_acml) in our example
  2. TOPdir: The location of your hpl top directory
  3. MPdir: The top directory of your MPI implementation, e.g. /opt/parastation/mpi2
  4. MPlib: The libraries needed for MPI. mpif77 -show tells you the details
  5. LAdir: The top directory of your performance lib, e.g. /opt/acml5.1.0/gfortran64
  6. LAlib: The librararies for your performance lib, e.g. $(LAdir)/lib/libacml.a
  7. CC: Your C compiler, ususally gcc
  8. LINKER: Your linker, gfortran

Once your Make.<ARCH> is set up, type

make arch=<ARCH>

and you will get a binary xhpl in the directory ./bin/<arch> plus a default steering file using 4 processes. Write a wrapper script:


#$ -pe parastation_rr 4

mpiexec -np ${NSLOTS} ./xhpl

Once you got it running, you can  tune your HPL.dat file to measure the parallel performance of your system.