Marc2_AdminGuide/4_SystemServices

System services and configuration

This chapter describes installed system services and configuration details.

User Management

LDAP client configuration files:

  • /etc/pam_ldap.conf
    uri ldap://auth01.hrz.uni-marburg.de ldap://auth02.hrz.uni-marburg.de
    base ou=Accounts,o=Universitaet Marburg,c=DE
    ssl start_tls
    tls_cacertfile /etc/pki/CA/certs/Kette-Deutsche-Telekom-complete.pem
    rootbinddn uid=marc2,ou=Proxy,o=Universitaet Marburg,c=DE
    binddn uid=marc2,ou=Proxy,o=Universitaet Marburg,c=DE
    
  • /etc/pam_ldap.secret → /etc/openldap/ldap.secret
  • /etc/nslcd.conf
    uid nslcd
    gid ldap
    uri ldap://auth01.hrz.uni-marburg.de ldap://auth02.hrz.uni-marburg.de
    base   passwd   ou=Accounts,o=Universitaet Marburg,c=DE
    base   shadow   ou=Accounts,o=Universitaet Marburg,c=DE
    base   group    ou=Accounts,o=Universitaet Marburg,c=DE
    filter passwd (&(objectClass=posixAccount)(UniMrDarf=marc2)(!(UniMrDatLoeBV=*)))
    filter group  (objectClass=posixGroup)
    map    group  uniqueMember     member
    ssl start_tls
    tls_reqcert demand
    tls_cacertfile /etc/pki/CA/certs/Kette-Deutsche-Telekom-complete.pem
    scope sub
    binddn uid=marc2,ou=Proxy,o=Universitaet Marburg,c=DE
    bindpw <siehe:/etc/openldap/ldap.secret>
    nss_initgroups_ignoreusers ALLLOCAL
    
  • /etc/ldap.conf → /etc/openldap/ldap.conf (only on servers with UMRnet IP address, i.e. marc2-h1, marc2-h2, marc2-fh, marc2-fs1)
  • /etc/ldap.secret → /etc/openldap/ldap.secret (only on servers with UMRnet IP address, i.e. marc2-h1, marc2-h2, marc2-fh, marc2-fs1)
  • /etc/openldap/ldap.conf
    uri ldap://auth01.hrz.uni-marburg.de ldap://auth02.hrz.uni-marburg.de
    ldap_version 3
    rootbinddn uid=marc2,ou=Proxy,o=Universitaet Marburg,c=DE
    base ou=Accounts,o=Universitaet Marburg,c=DE
    bind_policy soft
    ssl start_tls
    TLS_CACERT /etc/pki/CA/certs/Kette-Deutsche-Telekom-complete.pem
    
  • /etc/pki/CA/certs/Kette-Deutsche-Telekom-complete.pem
  • TBD: establish replication server on the master nodes and configure the compute nodes and the file server nodes to reference the master nodes' LDAP server.

Locally defined users lpartec and lcircular exist for testing and system administration purposes.

Restart the nslcd service.

Configuring the Module Environment

The location of the modulefiles can be configured in

/usr/share/Modules/init/.modulespath 

The default set of modulefiles at login time is defined in /etc/profile.d/zmodules-local.sh :

if [ `/usr/bin/id -nu` != root ]; then 
  module load acml gcc parastation 
fi 

and applies to all non-root users. If the defaults of specific modulefiles need to be changed, adapt the file .version in the appropriate module directory, e.g. /usr/share/ModulesLocal /parastation/.version :

#%Module1.0 
## 
## Default version: 
## 
set ModulesVersion "mpi2-gcc-5.0.27-1" 

Refer also to the Users Guide.

Firewall

On both head nodes marc2-h1, marc2-h2 and the file servers marc2-fh and marc2-fs1, the access from outside of Marc2 is restricted by a firewall. Allowed ports (see also #6):

Port Node
223 (sshd) marc2-h1,marc2-h2,marc2-fh, marc2-fs1
389 marc2-h1
636 marc2-h1

marc2-h1: iptables -S; iptables -t nat -S

-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-N RH-Firewall-1-INPUT
-A INPUT -j RH-Firewall-1-INPUT 
-A FORWARD -j RH-Firewall-1-INPUT 
-A RH-Firewall-1-INPUT -i lo -j ACCEPT 
-A RH-Firewall-1-INPUT -i eth0 -j ACCEPT 
-A RH-Firewall-1-INPUT -i eth1 -j ACCEPT 
-A RH-Firewall-1-INPUT -i eth2 -j ACCEPT 
-A RH-Firewall-1-INPUT -i br1 -j ACCEPT 
-A RH-Firewall-1-INPUT -i tap0 -j ACCEPT 
-A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 223 -j ACCEPT 
-A RH-Firewall-1-INPUT -p icmp -m icmp --icmp-type any -j ACCEPT 
-A RH-Firewall-1-INPUT -p esp -j ACCEPT 
-A RH-Firewall-1-INPUT -p ah -j ACCEPT 
-A RH-Firewall-1-INPUT -p tcp -m tcp --dport 636 -j ACCEPT 
-A RH-Firewall-1-INPUT -p tcp -m tcp --dport 389 -j ACCEPT 
-A RH-Firewall-1-INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT 
-A RH-Firewall-1-INPUT -p tcp -m state --state NEW -m tcp --dport 223 -j ACCEPT 
-A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited 

-P PREROUTING ACCEPT
-P POSTROUTING ACCEPT
-P OUTPUT ACCEPT
-A POSTROUTING -o br3 -j MASQUERADE 

marc2-fs1 u. marc2-fh: iptables -S; iptables -t nat -S

-P INPUT ACCEPT
-P FORWARD ACCEPT
-P OUTPUT ACCEPT
-A INPUT -i lo -j ACCEPT 
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT 
-A INPUT -p icmp -j ACCEPT 
-A INPUT -i lo -j ACCEPT 
-A INPUT -p tcp -m state --state NEW -m tcp --dport 223 -j ACCEPT 
-A INPUT -p tcp -m state --state NEW -m tcp --dport 5901 -j ACCEPT 
-A INPUT -s 172.26.9.0/24 -j ACCEPT 
-A INPUT -s 172.26.10.0/24 -j ACCEPT 
-A INPUT -j REJECT --reject-with icmp-host-prohibited 

-P PREROUTING ACCEPT
-P POSTROUTING ACCEPT
-P OUTPUT ACCEPT
-A POSTROUTING -o em1 -j MASQUERADE

All other ports will be blocked. See iptables for details.

As marc2-fs2 does not have an external uplink, no firewall is configured.

Ssh service

On both headnodes, the sshd daemon configured to use port 223.

rsyslog service

The rsyslog service runs in on both frontends. The compute node configuration is such that the logging is forwarded to one of the two frontends, preferably to marc2-h1. There, the log files are written into the directory /var/log/Nodes. This directory is hosted on the drbd device and might be moved between the two frontends.

Documentation is found through the man page man rsyslogd and the manpage about the configuration man rsyslog.conf.

You can use the command pssyslog <node> to follow the messages from a particular node.

Note: You may have to restart the syslog service (/etc/init.d/rsyslog restart) after a failover/failback, see here.

DNS service

The DNS (Domain Name System) service is responsible for translating host names into IP addresses within the cluster. Translation requests happen at numerous occasions: When submitting a job, when trying to mount Lustre or NFS volumes, querying the batch system or when the administrator connects onto a node via ssh to perform maintenance tasks. There are two kinds of DNS requests: forward lookup, in which a host name is translated into an IP address, or reverse lookup, from where a IP address is turned into a host name.

DNS service overview

For the Marc2 setup, the service will be provided by the head node, marc2-h1. The login node marc2-h2 runs a slave server, therefore both head nodes will answer DNS queries. All other nodes, in particualar all compute nodes, ask the head nodes. Additionally, requests not resolvable by the internal DNS service will be forwarded to an external nameserver.

DNS master configuration

The DNS server on marc2-h1 uses /etc/named.conf as basic configuration files. All Marc2-specific information is included in /etc/named/named.conf.include. The particular host definitions are in /var/named/data/*.

When updating these information on marc2-h1, don't forget to update the serial number. Otherwise, these changes will not be propagated to the slave server running on marc2-h2. Run

# service named reload

to load the modified data and to trigger an update of the slave server of marc2-h2.

DNS slave configuration

The DNS server on marc2-h2 is configured as slave server. The file /etc/named/named.conf.include implements the corresponding settings. The host definitions are synchronised from marc2-h1 into /var/named/slaves/*

DNS clients

A typical resolv.conf file for a compute node will be:

nameserver 172.26.9.241
nameserver 172.26.9.242
search marc2

Note: for all nodes, the file /etc/nsswitch.conf should have an entry

hosts:  files dns

NFS service

NFS-based file systems are available from node-f1 for /home and marc2-h1 for /localhome (for administrative purposes only). The /home filesystems will be mounted automatically during system boot on all nodes. The /localhome filesystem will be mounted automatically during system boot on the compute nodes.

/etc/fstab

#
# /etc/fstab
# Created by anaconda on Tue Dec 13 13:21:22 2011
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/vg_file1_00-lv_root /                       ext4    defaults        1 1
/dev/mapper/vg_file1_01-lv_home /home                   ext4   defaults        1 1
UUID=0774f34c-7214-41ca-a2ae-5716c09509a3 /boot                   ext4    defaults        1 2
/dev/mapper/vg_file1_00-lv_swap swap                    swap    defaults        0 0
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0

/etc/exports

/home 	172.26.9.0/255.255.255.0(rw,no_root_squash,async) 172.26.10.0/255.255.255.0(rw,no_root_squash,async)

Important: Do not start the nfs server manually. It is done via the HA setup.

Quota service (/home)

Quotas are managed on the /home-fileserver marc2-fh. Currently no groups quotas are enforced.

The default block quota is 1000000000(soft) / 1200000000(hard).

The default file quota is 1000000(soft) / 1200000(hard).

The default grace period is 7 days.

Frequently used commands include:

# repquota -a

(list quotas per user)

# edquota <username>

(configure quota for a single user)

# edquota -p gebhardt <username>

(configure quota for a single user, copy from template user gebhardt)

# edquota -t <username>

(set grace period for a single user)

FhgFS service

Modular Disk Storage Manager

The MD3220 + MD1220 is devided in two 18 TB RAID 6 Luns with 2 Hotspares. The configuration client can be launched using

[root@marc2-fs1 ~]#  /opt/dell/mdstoragemanager/client/SMclient

Screenshot Download

/etc/fstab

# /etc/fstab
# Created by anaconda on Tue Dec 13 16:13:01 2011
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/vg_file2-lv_root /                       ext4    defaults        1 1
UUID=07f3cf41-6e32-498a-af71-815e994d34ac /boot                   ext4    defaults        1 2
/dev/mapper/marc2-fs1_fhgfs_meta /meta		     ext4    defaults,user_xattr,noatime,nodiratime 1 2 
/dev/mapper/md3220-vol0 /scratch		     xfs     defaults,largeio,inode64,swalloc 0 0
/dev/mapper/vg_file2-lv_swap swap                    swap    defaults        0 0
marc2-fh-em2:/home	/home	nfs	defaults 0 0
tmpfs                   /dev/shm                tmpfs   defaults        0 0
devpts                  /dev/pts                devpts  gid=5,mode=620  0 0
sysfs                   /sys                    sysfs   defaults        0 0
proc                    /proc                   proc    defaults        0 0

FhGFS

"FraunhoferFS (short: FhGFS) is the high-performance parallel file system from the Fraunhofer Competence Center for High Performance Computing. The distributed metadata architecture of FhGFS has been designed to provide the scalability and flexibility that is required to run today's most demanding HPC applications."

marc2-fs1 and marc2-f2 offers the /scratch filesystem in an nfs - like manner.

Two daemons are running on marc2-fs1 and -fs2:

  • fhgfs-meta the metadata about the filesystem eg. where are the files locateted, who access the file etc..
  • fhgfs-storage the distributed storage

On marc-fs1 there's two other daemons called:

  • fhgfs-mgmtd the management deamon which serves as a central point of system configuration information for clients and servers
  • fhgfs-admon which tracs the system health and provides a graphical frontend for monitoring services

All daemons are started automaticaly at system boot via their correspondent init-script. The configuration files are located under /etc/fhgfs.

/etc/fhgfs/fhgfs-meta.conf

# This is a config file for Fraunhofer parallel file system metadata nodes.
# http://www.fhgfs.com


logLevel                  = 3
logNoDate                 = false
logStdFile                = /var/log/fhgfs-meta.log
logNumLines               = 50000
logNumRotatedFiles        = 5

connPortShift             = 0
connMgmtdPortUDP          = 8008
connMgmtdPortTCP          = 8008
connMetaPortUDP           = 8005
connMetaPortTCP           = 8005
connUseSDP                = false
connUseRDMA               = true
connBacklogTCP            = 64
connMaxInternodeNum       = 16
connInterfacesFile        =
connNetFilterFile         = /etc/fhgfs/connNetFilterFile.txt
connNonPrimaryExpiration  = 10000

storeMetaDirectory        = /meta/fhgfs/fhgfs_meta
storeAllowFirstRunInit    = true
storeUseExtendedAttribs   = true

tuneNumWorkers            = 0
tuneTargetChooser         = randomized

sysMgmtdHost              = marc2-fs1-ib1

runDaemonized             = true

/etc/fhgfs/fhgfs-mgmtd.conf

# This is a config file for Fraunhofer parallel file system management nodes.
# http://www.fhgfs.com

logLevel                       = 2
logNoDate                      = false
logStdFile                     = /var/log/fhgfs-mgmtd.log
logNumLines                    = 50000
logNumRotatedFiles             = 5

connPortShift                  = 0
connMgmtdPortUDP               = 8008
connMgmtdPortTCP               = 8008
connBacklogTCP                 = 64
connInterfacesFile             =
connNetFilterFile              = /etc/fhgfs/connNetFilterFile.txt

storeMgmtdDirectory            = /data/fhgfs/fhgfs_mgmtd
storeAllowFirstRunInit         = true

tuneNumWorkers                 = 4
tuneMetaNodeAutoRemoveMins     = 0
tuneStorageNodeAutoRemoveMins  = 0
tuneClientAutoRemoveMins       = 30
tuneMetaSpaceLowLimit          = 10G
tuneMetaSpaceEmergencyLimit    = 3G
tuneStorageSpaceLowLimit       = 512G
tuneStorageSpaceEmergencyLimit = 10G

sysAllowNewServers             = true
sysForcedRoot                  =
sysOverrideStoredRoot          = false

runDaemonized                  = true

/etc/fhgfs/fhgfs-storage.conf

# This is a config file for Fraunhofer parallel file system storage nodes.
# http://www.fhgfs.com


logLevel               = 3
logNoDate              = false
logStdFile             = /var/log/fhgfs-storage.log
logNumLines            = 50000
logNumRotatedFiles     = 5

connPortShift          = 0
connMgmtdPortUDP       = 8008
connMgmtdPortTCP       = 8008
connStoragePortUDP     = 8003
connStoragePortTCP     = 8003
connUseSDP             = false
connUseRDMA            = true
connBacklogTCP         = 64
connInterfacesFile     =
connNetFilterFile      = /etc/fhgfs/connNetFilterFile.txt

storeStorageDirectory  = /scratch/fhgfs/fhgfs_storage
storeAllowFirstRunInit = true

tuneNumWorkers         = 8
tuneWorkerBufSize      = 4m
tuneFileReadSize       = 32k
tuneFileWriteSize      = 64k

sysMgmtdHost           = marc2-fs1-ib1

runDaemonized          = true

/etc/fhgfs/fhgfs-admon.conf

# This is a config file for the Fraunhofer parallel file system Admon daemon.
# http://www.fhgfs.com


logLevel                 = 2
logNoDate                = false
logStdFile               = /var/log/fhgfs-admon.log
logNumLines              = 50000
logNumRotatedFiles       = 2

connPortShift            = 0
connMgmtdPortUDP         = 8008
connMgmtdPortTCP         = 8008
connAdmonPortUDP         = 8007
connMaxInternodeNum      = 3
connNetFilterFile        = /etc/fhgfs/connNetFilterFile.txt
connNonPrimaryExpiration = 10000

tuneNumWorkers           = 4

sysMgmtdHost             = marc2-fs1-ib1

runDaemonized            = true

httpPort                 = 8000
queryInterval            = 5
databaseFile             = /var/lib/fhgfs/fhgfs-admon.db
clearDatabase            = false


Launching the GUI

  [root@marc2-fs1 client]#java -Dsun.java2d.pmoffscreen=false -jar /opt/fhgfs/fhgfs-admon-gui/fhgfs-admon-gui.jar

Clients

On all clients two services must be running.

  • fhgfs-client
  • fhgfs-helperd

Both daemons are started during system boot and mounts the remote filesystem.

/etc/fhgfs/fhgfs-client.conf

# This is a config file for Fraunhofer parallel file system clients.
# http://www.fhgfs.com


logLevel                      = 3
logClientID                   = false
logHelperdIP                  =

connPortShift                 = 0
connMgmtdPortUDP              = 8008
connMgmtdPortTCP              = 8008
connClientPortUDP             = 8004
connHelperdPortTCP            = 8006
connUseSDP                    = false
connUseRDMA                   = true
connRDMABufSize               = 8192
connRDMABufNum                = 128
connMaxInternodeNum           = 6
connInterfacesFile            =
connNetFilterFile             =
connNonPrimaryExpiration      = 10000
connCommRetrySecs             = 600

tuneNumWorkers                = 0
tunePreferredMetaFile         =
tunePreferredStorageFile      =
tuneFileCacheType             = buffered
tuneRemoteFSync               = true
tuneUseGlobalFileLocks        = false

sysMgmtdHost                  = 172.26.9.244
sysCreateHardlinksAsSymlinks  = true
sysMountSanityCheckMS         = 11000

/etc/fhgfs/fhgfs-mounts.conf

/scratch /etc/fhgfs/fhgfs-client.conf

Time service

Time keeping of the servers and nodes is done via the ntp protocol ( http://www.ntp.org). A server connected to a precise clock passes on its time to clients, using statistical methods to compensate the signal travel time over the network. The service is implemented via the ntp daemon, which does two things: It updates its own time from the server and passes on the time to potential clients.

Within Marc2, a cascaded setup is used: The head nodes (marc2-h1 and marc2-h2) synchronize to the default CentOS time servers [0-2].centos.pool.ntp.org, the compute nodes synchronize to the head node.

The important configuration file is /etc/ntp.conf, which includes server-directives for all CentOS time servers:

# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
server ntp-1.uni-marburg.de
server ntp-2.uni-marburg.de
server ntp-3.uni-marburg.de

You can check the current state of time synchronisation with the command

ntpq -p

(ntptrace isn't installed).

Openais High-Availability (HA) Setup

The two front end nodes marc2-h1 and marc2-h2 are configured as nodes of a high availability cluster. The  pacemaker stack is used. Services which don't bring their own HA functionality are managed here. These are:

  • A file system containing the administrative files (system images, software repository),
  • An NFS server exporting the above file system,
  • An IP address pointing to the "active" headnode.

Both nodes check each other's health (heartbeat) via the dedicated direct link between the eth2 interfaces of the front ends. The assigned addresses are:

marc2-h1 10.99.1.1
marc2-h2 10.99.1.2

The firewall should pass all traffic on this line. In the unlikely event of the failure of one of the nodes, the surviving nodes will use the IPMI interface to power cycle ("STONITH") the other node. In case the link breaks down, the pacemaker cluster stack can't assess the which of the servers is healthy and a mutual power cycle might occur. It is therefore mandatory to keep the heartbeat network alive. It appears that unlike the RHCS, pacemaker cannot assign weights to nodes which would assure that one node could survive a crash of the heartbeat network. The HA file system is provided via a  drbd (disk replicating block device). The LVM partitions /dev/System/drbd0 on both frontends are controlled by the drbd system and are kept in sync. The configuration is found under /etc/drbd/shared.conf. The drbd-stack provides a device, /dev/drbd0, which carries the actual file system on it. The drbd configuration is of the Active/Passive type, i.e. only one node can mount the filesystem.

Bringing up the HA cluster after a clean shutdown

  1. On the -h1 machine, mount the /localhome:
    mount /localhome
    
  1. The HA cluster is started via the command aisexec, executed on both nodes. An !/etc/init.d isn't available in the current version of RHEL 6.2. Verify the stack is running with the ps axf command. You should see the following processes:
    ...
     4271 ?        Ssl    7:44 corosync
     4276 ?        S      0:38  \_ /usr/lib64/heartbeat/stonithd
     4277 ?        S      1:03  \_ /usr/lib64/heartbeat/cib
     4278 ?        S      4:24  \_ /usr/lib64/heartbeat/lrmd
     4279 ?        S      0:00  \_ /usr/lib64/heartbeat/attrd
     4280 ?        S      0:10  \_ /usr/lib64/heartbeat/pengine
     4281 ?        S      1:19  \_ /usr/lib64/heartbeat/crmd
    ...
    
  2. Check that the services are coming up: crm_mon (or crm_mon -1, if you want to see it once). The output will go through several stages (both nodes unclean, nodes and services starting, …) and should stabilise at this:
    ============
    Last updated: Sun Mar  4 22:43:19 2012
    Last change: Thu Mar  1 11:39:52 2012 via cibadmin on marc2-h1
    Stack: openais
    Current DC: marc2-h1 - partition with quorum
    Version: 1.1.5-8.el6-b933cbea41b5737b442f8a0a9c6e1eddc9f41375
    2 Nodes configured, 2 expected votes
    8 Resources configured.
    ============
    
    Online: [ marc2-h1 marc2-h2 ]
    
     h1-stonith     (stonith:fence_ipmilan):        Started marc2-h2
     h2-stonith     (stonith:fence_ipmilan):        Started marc2-h1
     Master/Slave Set: msDRBDshared [DRBDshared]
         Masters: [ marc2-h1 ]
         Slaves: [ marc2-h2 ]
     Resource Group: GROUP-h1
         DRBDshared_fs      (ocf::heartbeat:Filesystem):    Started marc2-h1
         NFSSERVER  (ocf::heartbeat:nfsserver):     Started marc2-h1
         IP (ocf::heartbeat:IPaddr2):       Started marc2-h1
    
    You can check the state of the drbd device through the proc filesystem:
    marc2-h1:~ # cat /proc/drbd
    version: 8.3.12 (api:88/proto:86-96)
    GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by dag@Build64R6, 2011-11-20 10:57:03
     0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
        ns:9600520 nr:0 dw:9600520 dr:959382 al:806 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0
    
    The role (ro:) should be "Primary" on marc2-h1, the state (ds:) should be UpToDate/UpToDate.
  3. Mount the /direct/Software/ directories on both frontends:
    marc2-h1:~ # mount /direct/Software
    
    This is an NFS mount which makes the configuration a little more symmetric between both nodes.
  4. Restart the syslog daemon if necessary

Bringing up the HA cluster after a crash

After a crash, the drbd device might take some time to sync. This may take longer than the HA control allows, which will lead to a STONITH of the node. Thus, you need to make sure the drbd device is clean. See also the  drbd documentation.

  1. On the -h1 machine, mount /localhome.
  2. Start the drbd device manually on the crashed node after it's back up:
    ...:~ # /etc/init.d/drbd start
    
  3. Watch the file /proc/drbd. The file might indicate synchronisation in progress.
  4. Should this fail and the connection state remains in "ds: Secondary/Unknown?" or similar, try a
    ...:~ # drbdadm connect shared
    
    on either the victim or the survivor node. Synchronisation should start.
  5. After bringing the drbd device into the "UpToDate/UpToDate" state, shutdown the drbd service again (you might have to do it twice if a message like "drbd module is still loaded" appears:
    ...:~ # /etc/init.d/dbrd stop
    
  6. Start aisexec as soon as possible after manually stopping the drbd device, so that the changes are minimal.
  7. Restart the rsyslog daemon if necessary

Moving services back after a crash

Should a node have crashed and its resource been taken over by the other node, these are the commands to perform the failback:

  1. Use crm_mon to show where the service is currently running.
  2. Use crm resource cleanup <resource> to clean any erroneous state.
  3. Use crm resource unmove <resource> to remove node binding constraints of the resource. The resource might fail back now.
  4. Use crm resource move <resource> <node> to manually shift the resource from its current location to <node>.
  5. Restart the syslog daeomon if necessary

Other commands/Further reading

The administration of the pacemaker cluster stack is done through the crm command line tool. For details, see the pacemaker  FAQ,  wiki and  documentation pages.

Batch system (SGE)

The batch system lives in /home/sge, the sge cell is called default. The servers marc2-h1 and marc2-h2 act as a master/shadow configuration. The cell directory itself is hosted on the file server so it is shared between all the nodes.

The batch system is started on both frontends manually. It can't be started unless the file server provides the /home file system. Once this prerequisite is met, start the server through the command

/etc/init.d/sgemaster.marc2 start

on both nodes. You might see messages about the qmaster role of the frontends, depending on the contents of the file /opt/sge/default/common/act_qmaster.

ParaStation

The ParaStation documentation provides the necessary information. The administrative commands concerning node imaging and administration are found here.

Concerning the administration of processes running under ParaStation control, use the psiadmin command. This command controls the ParaStation daemons ("psid") running throughout the cluster. There's an interactive and batch mode for this tool. Examples:

  1. psiadmin -c help: online help
  2. psiadmin -c 'l v': list the versions of daemons used
  3. psiadmin -c 'l p': list the processes running under ParaStation control
  4. psiadmin -c 'l l': list the load on all nodes
  5. psiadmin -c 'l hw': list the available hardware
  6. psiadmin -c 'set cpumap "0 4 8 12 16 ... 63": set the order in which parastation assigns processes

psh Parallel Shell

ParaStation offers a parallel shell. It communicates either via the ParaStation daemons ("pssh") or ssh and offers a file transfer mode, too. The output is collected and matched between the nodes, e.g.

marc2-h1:~ # psh date
=== node[001-047,049-057]
Mon Mar 12 14:56:20 CET 2012
=== node[048,058-067,069-088]
Mon Mar 12 14:56:21 CET 2012
=== node068
Mon Mar 12 14:56:24 CET 2012

It also changes directories on the remote side, e.g.

marc2-h1:/etc/yum # psh ls
=== node[001-088]
pluginconf.d
protected.d
vars
version-groups.conf

The configuration file is ~/.psh.d/default. Typically, it looks like this on the frontend:

## psh defaults
## uncomment all options you need...

## default nodelist:
node node[001-088]

## default remote command. Use pssh for fast communication via psid
rcmd pssh
#rcmd ssh -x

## dont start more than 1 remote shell at the same time?
# max 1

## be always verbose ?
# verbose

On the golden client, the node list should comprise all the nodes attached to its image. This will allow comfortable distribution of files after small changes. (Don't forget to pull an image, though and remember that this is the way to update nodes.)

Examples:

  1. psh uname -r: show the kernel version on all nodes
  2. psh -autocd "" pshealthcheck -v manual > manual.hc.log 2>&1: run the HealthCheck testset "manual", don't change directory on the remote side.
  3. psh mount /home: mount the /home file system on all nodes
  4. psh /etc/init.d/fhgfs-client stop: stop the fhgfs on all nodes
  5. psh -s myspecial.conf: Distribute the file myspecial.conf onto all nodes

pssh Remote shell

Just for sake of completeness: The remote shell pssh behaves very much like ssh, except it communicates via the ParaStation daemons. Sometimes this will allow you to login, even if ssh stops working.

HealthCheck

The ParaStation Healthcheck was developed to improve the stability of cluster systems. To enable the Healthcheck to recognize all sort of problems various tests has been developed. If any critical problem is found, the corresponding node will be removed from the batchsystem. The compute job which should be run on this faulty nodes will just be re-queued and executed immediately if enough resources are available. This automatic reaction to problems will ensure that compute jobs will only run on clean nodes and therefore preventing job crashes.

Usage

Just login to any (compute) node and call the pshealthcheck script.

marc2-h1:~ # pshealthcheck manual
Starting ParaStation Healthcheck with testset 'manual'

The Testsets

The Healthcheck supports different testsets (former testlevels) to handle the numerous requirements it has to fulfill.

Testset All

The testset all contains every available test of the Healthcheck. This will normally not be used in productions and is mainly a container for the tests itself.

Testset Prologue

The testset prologue will be executed by the batchsystem on every participating node, right before a new job gets to run. This testset will execute a selected number of tests to ensure that the batchjob will find a clean environment to run. These test are designed to have a short runtime so the job start will be delayed as little as possible.

For integration into the SGE, please see here.

Testset Epilogue

The testset epilogue will be executed by the batchsystem on every participating node after the batchjob is finished. Especially errors which are triggered from the stress a computejob will put on a node will be found by with these tests. Because these tests are running very frequently they have a very short runtime.

For integration into the SGE, please see here.

Testset Manual

The testset manual contains all the available tests with exception of the stress-tests. Its purpose is to quickly check the status of a selected node.

Testset Reboot

The testset reboot will be run automatically on every startup of a node. All the available tests with exception of the stress-tests will be executed. Hence nodes with software or hardware problems will be detected and removed as soon as possible before they can have a bad influence.

Testset Stress-Test

The testset stress-test is supposed to trigger certain problems which only appear if a node is under a high load. Especially after a hardware replacement these tests can assure that the node is repaired successfully.

Testset IB-Test

The testset ib-test will perform all available infiniband tests. This also includes a bandwidth test which is measured with the golden client marc2-h1 as server node. If there are any problems make sure that qperf runs in server mode on the golden client.

Testset Cron1h

The testset cron1h will be periodically performed on admin nodes every 1 hour. See /etc/cron.d/pshealthcheck. In addition to typical tests also performed by other testsets, tests like node state monitoring or logfile analysis may be configured.

The tests itself

Name Description Class
healthcheck_version Check if pshealthcheck is up to date Warning
bios_date Test the bios date Warning
bios_version Test the bios version Warning
cpu_count Count the number of cpus Error
cpu_type Check the cpu type of all available cpus Error
cpu_speed Check the cpu speed of all available cpus Warning
disc_free Check for available disc space in MB or percent Warning
disc_smart Check the smart health status of a harddisk Error
daemons Use pidof to test if various daemons are currently running Error
hpl Perform a HP Linpack to stress the node Error
infiniband_bandwidth Test the bandwidth of the infiniband network with testnode marc2-h1 Error
infiniband_counters Check all infiniband error counters Warning
infiniband_speed Test if infiniband is running with full speed Error
infiniband_phy_state Test the physical infiniband connection state Error
infiniband_state Test the infiniband connection state Warning
ipmi Get the ipmi address from the nameserver and test if it is set correctly in the bmc Warning
kernel_version Test the current kernel version Warning
ldap Tests basic ldap configuration and functionality Error
md5sum Test various files against a given md5sum Error
memory_free Check the memory which is available for compute processes Error
memory_mcelog Invoke the mcelog command to check for memory errors Error
memory_not_reclaimable Test if too much memory is consumed by not reclaimable slab Error
memory_size Check the size of the total memory Error
memory_speed Check the memory speed Error
memory_stress Stress the node with a memory test to trigger errors Error
mounts Check for various mount points Error
nameserver Check if the nameserver is reachable and name resolving is possible Error
net_counters Check ethernet error counters Warning
net_ping Ping various addresses to check for connection errors Error
net_speed Check the ethernet link speed Error
pbs_note Warn if a node is in pbs state 'free' and has a note set Warning
pbs_prologue Warn if a node is in pbs state 'free' and has prologue/epilogue errors set Warning
ports_udp Test if a daemon is listening on a specific udp port Error
ports_tcp Test if a daemon is listening on a specific tcp port Error
psid Various test for the parastation psid daemon Error
service Test if various services are running (status) and if they are in current runlevel (chkconfig) Error
software_versions Test if at least the given rpm version is installed Error
h200 Tests H200 RAID controller disk states Error
nodestates Monitors if any node state of the batch queueing system (SGE) changed Error

Important directories

Configuration

The current configuration of a testset may be obtained by using the command pshcgetconf.

Overall configuration file:

/etc/parastation/general/healthcheck.conf
/etc/parastation/healthcheck/testconf.d/tests.conf

The unified test configuration is maintained on marc2-h1. In case, modify /etc/parastation/healthcheck/testconf.d/tests.conf on marc2-h1 and copy it over to all other nodes.

Particular testset configuration:

/etc/parastation/healthcheck/testsets/{testset}/testset.conf

Note: please apply changes to the master copy residing on the master node and propagate testset.conf to all nodes.

Logfiles

The results are logged using syslog if option -l is specified and will be included in files called /var/log/Nodes/messages-{nodename} residing on the corresponding admin nodes.
Run pssyslog -c on the master for quick checks (see pssyslog(8) for details), e.g.:

pssyslog -cl 3 node[001-01]

Tests itself

/opt/parastation/lib/checks/*

pshealthcheck command

The healthcheck script on every node supports the following options

marc2-h1:~ # pshealthcheck -h
Usage: pshealthcheck <options> <testset>

   -c <path>   sets the path to look for test configurations files
                (override global configfile)
   -p <path>   sets the path to look for testset scripts
                (override global configfile)
   -t <time>   specify/override the testset timeout
   -l          turns on logging via syslog (stdout by default)
   -n          do not run tests, only show what would be done
   -v          increase verbosity (-vv is not supported, use -v -v instead)
   -x          extends the testset statistical summary by the name of all tests
               with warning or error state. (twice for listing of ok tests)
   --no-update do not execute "psconfig update" before reading configuration
   <testset>   name of the testset to run (mandatory)

psmaintenance

See man psmaintenance.

pschecknodes

See man pschecknodes.

infinibandwatch/inspectibdiagnet

Note: Currently not part of the system monitoring. TBD by Circular.

Healthcheck/Sun Grid Engine Integration

The healthcheck is integrated into the Sun Grid Engine batch system via prologue and epilogue scripts. Two scripts are available so far:

  • /opt/sge/default/common/prolog/prolog.sh
  • /opt/sge/default/common/epilog/epilog.sh

They are linked to the all.q:

prolog                root@/opt/sge/default/common/prolog/prolog.sh $job_id \
                      $job_owner
epilog                root@/opt/sge/default/common/prolog/prolog.sh $job_id \
                      $job_owner

Please note that they run with root permissions and are passed the job id and the job owner as arguments. They have to run as root as they will offline the node (qmod -d ...) in case of a problem. If this happens, the job will be requeued. The prolog and epilog will also fan out to all nodes requested in a parallel environment and run the necessary checks.

The healthcheck is also run during startup of a node. In case of problems, the node is offlined, too. See testset reboot and appropriate action for details.

Once the healthcheck has struck, the node will be in state disabled. To find out why, log onto the node as root and run the prologue, reboot or manual testset:

node025:~ # pshealthcheck -v prologue 
...
[EE] ... look for the problem here ... 

After fixing the problem, enable the node again:

qmod -e "*@node025.marc2"

ParaStation GridMonitor

The GridMonitor is running on the virtual machine marc2-tracvm, see http://marc2-tracvm.marc2/gridmon.

ParaStation Trac and Ticketsystem

The Trac and ticket system is running on the virtual machine marc2-tracvm, see http://marc2-tracvm.marc2/trac. For details on the ParaStation trac tools see man pstrac.

To start and stop the virtual machine, use

marc2-h1 # virsh start trac
marc2-h1 # virsh shutdown trac

Note: the virtual machine's files reside on the /shared file system, therefore it is not available without this file system. A (printed) copy of the Administration Guide might be handy, see obtaining a PDF version of this Admin Guide.

Maintaining a change log

The pschglog command adds and lists change log entries. See man pschglog for details. It is maintained on marc2-h1. Some tools automatically add entries, like psnodes-getimage.

Attachments