Marc2_AdminGuide/A_FAQ

Frequently asked questions

How to install new software

To install additional software or software packages, or to update existing packages on the compute nodes, the following steps are required (running as root):

  1. Login to the golden client (see Image usage)
  2. Install all packages, eg. using yum
  3. Configure as required
  4. On the master node, run psnodes-getimage (see Updating an image).
  5. Synchronize all other used using psnodes-update (see Updating nodes).

OpenMPI 2.x peculiarities

Force process spawn via qrsh

OpenMPI v2.0.x and v2.1.0 have a bug in their SGE integration:  https://github.com/open-mpi/ompi/issues/2947

In short, job spawning is done with ssh when it should be done via SGE's qrsh, leading to 'Host verification failed' errors when trying to spawn on a node that doesn't have an entry in the user's ./ssh/known_hosts file. To alleviate this, 'plm_rsh_agent=foo' (that's a *literal* 'foo') must be added to /home/software/openmpi-2.1.0/etc/openmpi_mca_params.conf. This removes 'ssh' as default spawning method and allows the check for qrsh to take prevalence (a fallback to ssh is still possible if no qrsh is found).

Working with starter_method scripts

OpenMPI treats the spawning commands as shell commands and will add env variables to the job script path string, which will cause method scripts using 'exec ${@}' to fail. Additionally, some of SGE's env variables might get reset by mpiexec. As a workaround, the starter_method should look like this:

#!/bin/bash

# <everything the starter method should do goes here>

DEFSHELL=/bin/bash
LOGINFLAG=""

if [ "X$SGE_STARTER_SHELL_PATH" = "X" ] ; then
        SHELL=$DEFSHELL
else
        # yet another sanity check
        if [ ! -x $SGE_STARTER_SHELL_PATH  ] ; then
                SHELL=$DEFSHELL
        else
                SHELL=$SGE_STARTER_SHELL_PATH
        fi
fi

# Omit setting the login flag to prevent problems with the module environment
#if [ "X$SGE_STARTER_USE_LOGIN_SHELL" == "Xtrue" ] ; then
#        LOGINFLAG="-l"
#fi

exec $LOGINFLAG $SHELL -c "$*"

See  http://gridengine.org/pipermail/users/2014-April/007472.html for explanation. On MaRC2, the login flag should *not* be set, because this interferes with the module environment; it would load the default modules, which then could produce conflicts with modules loaded inside the job script, and it causes a bunch of error messages in the job's stderr.

Disabling automatic core binding

OpenMPI defaults to doing core binding automatically as of version 1.8.0. This can lead to problems in a shared-node cluster, because OpenMPI does not check with SGE which cores to bind to and has no way to determine whether other (open-)MPI applications are trying to use the same core(s). Thus, in a shared-node cluster like MaRC2, multiple MPI jobs will compete for using the same cores on a machine. It is therefore advisable to disable openMPI's core binding by using the —bind-to switch:

mpiexec --bind-to none [MPI parameters] <executable> [execution parameters]