Hints

Some tips for the use of the Marc2 cluster

MPI Jobs

Open MPI

These modules are for debuging reasons.
They are shipped with the qlogic/Intel Infiniband suite.
To open cases by qlogic/Intel and do some tests.
I would prefer the Parastation MPI in production because
this MPI is easier to use. And there is no significant
performance difference anymore.

Parastation MPI

The following modules are available

-bash-4.1$ module avail

For the gcc compiler suite:

  • parastation/mpi2-gcc-5.0.27-1(default)
  • parastation/mpi2-gcc-mt-5.0.27-1 (combined mpi and openMP)

For the intel suite:

  • parastation/mpi2-intel-5.0.27-1
  • parastation/mpi2-intel-mt-5.0.27-1 (combined mpi and openMP)

For the pgi compiler

  • parastation/mpi2-pgi-5.0.27-1

The AMD Architecture

The physical architecture can be displayed with two tools:

  • numactl —hardware
  • likwid-topology

Output of numactl —hardware

available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 32758 MB
node 0 free: 31705 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 32768 MB
node 1 free: 31780 MB
node 2 cpus: 32 33 34 35 36 37 38 39
node 2 size: 32768 MB
node 2 free: 31921 MB
node 3 cpus: 40 41 42 43 44 45 46 47
node 3 size: 32768 MB
node 3 free: 31841 MB
node 4 cpus: 48 49 50 51 52 53 54 55
node 4 size: 32768 MB
node 4 free: 31896 MB
node 5 cpus: 56 57 58 59 60 61 62 63
node 5 size: 32768 MB
node 5 free: 31764 MB
node 6 cpus: 16 17 18 19 20 21 22 23
node 6 size: 32768 MB
node 6 free: 31934 MB
node 7 cpus: 24 25 26 27 28 29 30 31
node 7 size: 32752 MB
node 7 free: 31921 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  16  16  22  16  22  16  22 
  1:  16  10  22  16  16  22  22  16 
  2:  16  22  10  16  16  16  16  16 
  3:  22  16  16  10  16  16  22  22 
  4:  16  16  16  16  10  16  16  22 
  5:  22  22  16  16  16  10  22  16 
  6:  16  22  16  22  16  22  10  16 
  7:  22  16  16  22  22  16  16  10 

And likwid-topology:

-------------------------------------------------------------
CPU type:	AMD Interlagos processor 
*************************************************************
Hardware Thread Topology
*************************************************************
Sockets:	4 
Cores per socket:	16 
Threads per core:	1 
-------------------------------------------------------------
HWThread	Thread		Core		Socket
0		0		0		0
1		0		1		0
2		0		2		0
3		0		3		0
4		0		4		0
5		0		5		0
6		0		6		0
7		0		7		0
8		0		8		0
9		0		9		0
10		0		10		0
11		0		11		0
12		0		12		0
13		0		13		0
14		0		14		0
15		0		15		0
16		0		0		3
17		0		1		3
18		0		2		3
19		0		3		3
20		0		4		3
21		0		5		3
22		0		6		3
23		0		7		3
24		0		8		3
25		0		9		3
26		0		10		3
27		0		11		3
28		0		12		3
29		0		13		3
30		0		14		3
31		0		15		3
32		0		0		1
33		0		1		1
34		0		2		1
35		0		3		1
36		0		4		1
37		0		5		1
38		0		6		1
39		0		7		1
40		0		8		1
41		0		9		1
42		0		10		1
43		0		11		1
44		0		12		1
45		0		13		1
46		0		14		1
47		0		15		1
48		0		0		2
49		0		1		2
50		0		2		2
51		0		3		2
52		0		4		2
53		0		5		2
54		0		6		2
55		0		7		2
56		0		8		2
57		0		9		2
58		0		10		2
59		0		11		2
60		0		12		2
61		0		13		2
62		0		14		2
63		0		15		2
-------------------------------------------------------------
Socket 0: ( 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 )
Socket 1: ( 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 )
Socket 2: ( 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 )
Socket 3: ( 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 )
-------------------------------------------------------------

*************************************************************
Cache Topology
*************************************************************
Level:	1
Size:	16 kB
Cache groups:	( 0 ) ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 ) ( 8 ) ( 9 ) ( 10 ) ( 11 ) ( 12 ) ( 13 ) ( 14 ) ( 15 ) ( 32 ) ( 33 ) ( 34 ) ( 35 ) ( 36 ) ( 37 ) ( 38 ) ( 39 ) ( 40 ) ( 41 ) ( 42 ) ( 43 ) ( 44 ) ( 45 ) ( 46 ) ( 47 ) ( 48 ) ( 49 ) ( 50 ) ( 51 ) ( 52 ) ( 53 ) ( 54 ) ( 55 ) ( 56 ) ( 57 ) ( 58 ) ( 59 ) ( 60 ) ( 61 ) ( 62 ) ( 63 ) ( 16 ) ( 17 ) ( 18 ) ( 19 ) ( 20 ) ( 21 ) ( 22 ) ( 23 ) ( 24 ) ( 25 ) ( 26 ) ( 27 ) ( 28 ) ( 29 ) ( 30 ) ( 31 )
-------------------------------------------------------------
Level:	2
Size:	2 MB
Cache groups:	( 0 1 ) ( 2 3 ) ( 4 5 ) ( 6 7 ) ( 8 9 ) ( 10 11 ) ( 12 13 ) ( 14 15 ) ( 32 33 ) ( 34 35 ) ( 36 37 ) ( 38 39 ) ( 40 41 ) ( 42 43 ) ( 44 45 ) ( 46 47 ) ( 48 49 ) ( 50 51 ) ( 52 53 ) ( 54 55 ) ( 56 57 ) ( 58 59 ) ( 60 61 ) ( 62 63 ) ( 16 17 ) ( 18 19 ) ( 20 21 ) ( 22 23 ) ( 24 25 ) ( 26 27 ) ( 28 29 ) ( 30 31 )
-------------------------------------------------------------
Level:	3
Size:	6 MB
Cache groups:	( 0 1 2 3 4 5 6 7 ) ( 8 9 10 11 12 13 14 15 ) ( 32 33 34 35 36 37 38 39 ) ( 40 41 42 43 44 45 46 47 ) ( 48 49 50 51 52 53 54 55 ) ( 56 57 58 59 60 61 62 63 ) ( 16 17 18 19 20 21 22 23 ) ( 24 25 26 27 28 29 30 31 )
-------------------------------------------------------------

*************************************************************
NUMA Topology
*************************************************************
NUMA domains: 8 
-------------------------------------------------------------
Domain 0:
Processors:  0 1 2 3 4 5 6 7
Memory: 31705.1 MB free of total 32758.1 MB
-------------------------------------------------------------
Domain 1:
Processors:  8 9 10 11 12 13 14 15
Memory: 31781.2 MB free of total 32768 MB
-------------------------------------------------------------
Domain 2:
Processors:  32 33 34 35 36 37 38 39
Memory: 31921.6 MB free of total 32768 MB
-------------------------------------------------------------
Domain 3:
Processors:  40 41 42 43 44 45 46 47
Memory: 31841.1 MB free of total 32768 MB
-------------------------------------------------------------
Domain 4:
Processors:  48 49 50 51 52 53 54 55
Memory: 31896.7 MB free of total 32768 MB
-------------------------------------------------------------
Domain 5:
Processors:  56 57 58 59 60 61 62 63
Memory: 31764.8 MB free of total 32768 MB
-------------------------------------------------------------
Domain 6:
Processors:  16 17 18 19 20 21 22 23
Memory: 31935 MB free of total 32768 MB
-------------------------------------------------------------
Domain 7:
Processors:  24 25 26 27 28 29 30 31
Memory: 31920.9 MB free of total 32752 MB
-------------------------------------------------------------

What numactl calls nodes it is called domain by likwid-topologie.

Eight cores shares one level 3 cache.

One socket have two domains multiply with four we have 64 cores per machine.

One die with two integer cores (cpus) shares one 256Bit FPU

so the mpi binds the ranks as follows.

The logical cpu map looks like this:

0 2 4 6 8 10 12 14 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 16 18 20 22 24 26 28 30
1 3 5 7 9 11 13 15 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 17 19 21 23 25 27 29 31

This means the first rank of an mpi job is placed on cpu 0 the next rank on cpu 2 and so on.

If there are only 32 ranks per node every rank has one FPU exclusively.

For best results use the following environment variables

In a bash environment:

export PSM_RANKS_PER_CONTEXT=4  #to garantee no context overbooking
export PSP_OPENIB=0             #don't use
export PSP_OFED=0               #don't use
export PSP_PSM=2                #use with highest prio
export PSP_SHM=0                #don't use
export PSP_TCP=0                #don't use
export PSP_SCHED_YIELD=0        #use busy polling (default)

#Not all environment variables are send to the mpi ranks
#use this to define which variables are send too
export PSI_EXPORTS="PSM_MQ_RNDV_IPATH_THRESH,PSM_MQ_RNDV_SHM_THRESH"

export PSM_MQ_RNDV_IPATH_THRESH=4000    #eager/rendezvos threshold
export PSM_MQ_RNDV_SHM_THRESH=4000      #same for shm

If you can garantee that all ranks are on the same machine you can set

export PSP_PSM=0
export PSP_SHM=1

to enable shared memory

Please don't mix PSP_PSM=1 and PSP_SHM=1

In a csh environment use the setenv respectively.

Threaded OpenMP Jobs

As mentioned above is it important for the performance to bind a Job to a core (cpu). There are four tools to do this.

  • numactl
  • taskset
  • likwid-pin
  • dell_affinity

It's a matter of taste which tool to use. For threaded applications I like likwid-pin.

If you have exclusive access to one machine you can place your task in a best manner. As numactl shows:

node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  16  16  22  16  22  16  22 
  1:  16  10  22  16  16  22  22  16 
  2:  16  22  10  16  16  16  16  16 
  3:  22  16  16  10  16  16  22  22 
  4:  16  16  16  16  10  16  16  22 
  5:  22  22  16  16  16  10  22  16 
  6:  16  22  16  22  16  22  10  16 
  7:  22  16  16  22  22  16  16  10 

For example. If you have a job with 8 threads place it on one "node" (domain).Cost = 10 If you have one with 16 threads place it as follows. First 8 jobs to 0 next 8 jobs to domain 1 or 2 or 4 or 6 (cost = 16) but not to 3,5 and 7.

Bind your job to domain 0 cores 0-7 and Domain 1 cores 0-7

/home/software/likwid/icc/bin/likwid-pin M0:0-7@M1:0-7 ./yourjob

This is the ideal pinning, But the real world isn't ideal.Why?

In a batch based cluster as MarC2 with the SGE as job submission system you have normaly not an

exclusive access to the machines.

One strategy according to the MPI policy is, to find out which number

of job you are and place it to the next even numbers of cores if there are less than 32 jobs on this machine, Otherwise bind the job to the next odd numbers. OpenMP and serial Jobs don't spawn across machines.

to be continued