Marc2_UserGuide/2_BatchSystem

Batch system

Purpose of a batch system

The batch system schedules the workload on the cluster in such a way that all users can benefit according to the site policy. It should prevent users from monopolising the resources. Typically this is done by enqueuing several jobs into a queue. The batch systems then assigns priorities to the queued jobs. These priorities can be calculated from the time a job has been waiting, resources already used during the last few days, membership in a special project, requested resources, … (see Job priorities). The queued jobs are re-ordered according to priority and are executed.

Sun Grid Engine

The installed version of Sun Grid Engine is 6.2u5.

The documentation for the Sun Grid Engine can be found at

If you find the time, you're encouraged to read through it to learn about the concepts.

A first example

Jobs are submitted to the system via scripts. Resources are requested by specifying certain directives.

To start, here's the classical "hello world" example:

#!/bin/bash

# hello_world.sh
# a minimal example

#$ -S /bin/bash
#$ -e $HOME/sge
#$ -o $HOME/sge
#$ -l h_rt=360

echo "Hello world from $(hostname)"

exit 0

This simple script tells the Sun Grid Engine what shell to run in, and where to write the stderr and stdout outputs. It does it via directives. These are marked by the comment sign followed by the dollar sign (#$). Here,

  • -S specifies the shell to use (/bin/bash in this case),
  • -e specifies the location where your stderr goes,
  • -o specifies the location where your stdout goes.
  • -l h_rt specifies the runtime (wallclock) limit in seconds (this is mandatory, your job won't start otherwise).

Finally, the script echoes the name of the machine it is executed on and exits. Assuming you have placed this file under the name "hello.sh" in your storage, you can place it into the queue for execution (your prompt may vary):

lpartec@marc2-h1:~/sge>qsub ./hello.sh
Your job 115 ("hello.sh") has been submitted

Please note that it's not necessary to grant execution rights to the script. You can check the state of your job using the qstat command:

lpartec@marc2-h1:~/sge> qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------   
    115 0.50500 hello.sh   lpartec      Eqw   03/01/2012 15:21:06                                    1

Our job is in an error state (E) and queued waiting (qw). To find out why, use the -explain switch of qstat. Look for the "error reason":

lpartec@marc2-h1:~/sge> qstat -j 115 -explain E
==============================================================
job_number:                 115
exec_file:                  job_scripts/115
submission_time:            Thu Mar  1 15:21:06 2012
owner:                      lpartec
uid:                        1000
group:                      partec
gid:                        1000
...
stderr_path_list:           NONE:NONE:/home/partec/sge
mail_list:                  lpartec@marc2-h1
notify:                     FALSE
job_name:                   hello.sh
stdout_path_list:           NONE:NONE:/home/partec/sge
jobshare:                   0
shell_list:                 NONE:/bin/bash
env_list:                   
script_file:                ./hello.sh
error reason    1:          03/01/2012 15:21:16 [1000:24143]: error: can't open output file "/home/partec/sge": No such file or 
...

The problem is that the file /home/partec/sge doesn't exist. There's not much we can do about this (the correct home dir would be /localhome/lpartec/sge in this case), so the only option is to delete the job:

lpartec@marc2-h1:~/sge> qdel 115
lpartec has deleted job 115

To see if we get it right, we can override the settings in the script from the command line:

lpartec@marc2-h1:~/sge> qsub -e $(pwd) -o $(pwd) -l h_rt=360 ./hello.sh
Your job 116 ("hello.sh") has been submitted

where the current working dir has been evaluated. Our job is running now:

lpartec@marc2-h1:~/sge> qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------    
    116 0.50500 hello.sh   lpartec      r     03/01/2012 15:28:16 all.q@node082.marc2                1

It has produced some output files:

lpartec@marc2-h1:~/sge> ls -l hello.sh*
-rw-r--r-- 1 lpartec partec 180 Mar  1 15:27 hello.sh
-rw-r--r-- 1 lpartec partec   0 Mar  1 15:28 hello.sh.e116
-rw-r--r-- 1 lpartec partec  19 Mar  1 15:28 hello.sh.o116

Indeed the output file (hello.sh.o116) contains:

lpartec@marc2-h1:~/sge> cat hello.sh.o116
hello from node082

With this knowledge, you can now fix your script to use the proper locations for the output. Be aware though that shell script evaluation like we did on the command line won't work. You have to hard-wire the exact location:

Bad:

#$ -e $(pwd)
#$ -o $(pwd)

Good:

#$ -e /your/output/directory
#$ -o /your/output/directory

Better:

#$ -e $HOME/directory
#$ -o $HOME/directory

Further reading: man sge_intro

About directories (Thanks to Mr. Reuter):

If you want the output just to be created inside the actual working directory instead of your home directory’s top level, submitting with -cwd is sufficient too. If -cwd can’t be used directly as the mountpoint of /home is different between submit host end execution host, the file sge_aliases can translate this (man sge_aliases):

/home/ * * /remotehome/users/

i.e. /home/reuti is /remotehome/users/reuti on the execution host. Submitting with -cwd from /home/reuti/sge will change the working directory on the execution host to /remotehome/users/reuti/sge.

The important commands

The main commands you will need deal with the creation, diagnose and control of batch jobs. They all come with well written man pages. Please consult the man pages first if you have a question.

qstat

This command tells you about the state of the cluster: The running jobs, state of nodes, … Some examples:

  • Show your jobs:
    lpartec@marc2-h1:~> qstat
    job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
    -----------------------------------------------------------------------------------------------------------------
        107 0.55500 hello.sh   lpartec      r     02/29/2012 14:34:04 all.q@node046.marc2              128
    
    Use the option -u "*" to see jobs from all users.
  • Select full view:
    lpartec@marc2-h1:~> qstat -f
    queuename                      qtype resv/used/tot. load_avg arch          states
    ---------------------------------------------------------------------------------
    ...
    ---------------------------------------------------------------------------------
    all.q@node011.marc2            BIP   0/64/64        8.00     lx24-amd64    
        107 0.55500 hello.sh   lpartec      r     02/29/2012 14:34:04    64        
    ---------------------------------------------------------------------------------
    ...
    
  • Diagnose job priority: qstat -pri or qstat -urg (see man sge_priority or Job priorities)
  • Show details about your job: qstat -f -j <jobid>
  • Diagnose your job: qstat -j <jobid> -explain a|A|c|E
    Please note that, due to a memory leak bug in SGE 6.2u5, gathering of scheduler information had to be disabled, thus it must be queried explicitly:
    qalter -w v <jobid>
  • Diagnose a node: qstat -q <queue> -explain a|A|c|E

qsub

This command is used to submit the actual jobs. You can use all the options specified here also in your script by using the #$ directive comment.

  • qsub -l h_rt= <runtime limit> <your job script.sh> <your job script's arguments>: Simple job submission via script
  • qsub -l h_rt= <runtime limit> -b y <your binary> <your binary's arguments>: To submit a binary directly
  • qsub -l h_rt= <runtime limit> -t <n1>-<n2> …: To submit a job array. n1 has to be bigger than 0.
  • qsub -l h_rt= <runtime limit> -cwd <directory> …: To specify the workdir of a job
  • qsub -l h_rt= <runtime limit> -q <queue-list> …: To specify a list of queues
  • qsub -l h_rt= <runtime limit> -m bea …: Receive notification emails on begin, end or abort (use -M to specify a custom email address)

qmod

This command can control/administer running jobs.

  • qmod -sj <job id>: puts job into suspend (Z/kill -SIGSTOP)
  • qmod -usj <job id>: releases job from suspend (kill -SIGCONT)
  • qmod -cj <jobid>: clears the error state of a job

qalter

This command can modify pending or running jobs. This is useful if you have to re-shuffle the order of your jobs. Most of the switches available with qsub are available here, so you can fix parameters you forgot or specified wrongly at submission time. Note that you can ask only for less, not more.

  • qalter -a <time> <jobid>: Re-define the time a job is eligible for execution.
  • qalter -cwd </some/path> <jobid>: Re-define the working dir. No effect if job is already in state "r"
  • qalter -q <new queue> <jobid>: Move the job from the previously specified queue to the new queue. No effect if job is already in state "r"

There is also a workaround for getting scheduler information for your job:

  • qalter -w v <jobid>: Show scheduler information

qdel

Terminates jobs.

  • qdel <job id>: Deletes the job
  • qdel -u <user>: Deletes jobs of user <user>

qconf

This command allows you to query properties of the batch system.

  • qconf -sql: List all queues
  • qconf -sel: List all execution nodes
  • qconf -se <node>: Show properties of execution node <node>
  • qconf -sq <queue>: Show properties of queue <queue>
  • qconf -spl: List all parallel environments (see below)
  • qconf -sp <pe>: Show properties of parallel environment <pe>

Batch jobs/Resources

You can find several example scripts for the Sun Grid Engine in $SGE_ROOT/examples/jobs. The topics treated cover probably most of what you will need:

  • Job Arrays: Multiple jobs running under one JOB
  • Job Dependencies: A chain of jobs running in a certain order.

As mentioned before, the purpose of a batch system is to make resources available to user in a somewhat just way and at the same time maximise the load of the machine. Among these resources are compute time, memory, licenses, … They are specified by the -l switch. Resources come in two flaovours:

  • hard: job will not start until resource is available
  • soft: Sun Grid Engine will try to allocate resource, but might start without

The complete list of resources is available via man 5 complex and man 5 queue_conf.

Some of the more important resources are:

  • [h|s]_rt: The real ("wallclock") time a job may run. When reaching the hard limit, the job is killed via a SIGKILL. If the soft limit is reached, your job will receive a SIGUSR1 signal. After the time specified in the queue parameter notify, it is then killed with SIGKILL.
  • [h|s]_cpu: The CPU time your job (all processes) have consumed. Otherwise, same behaviour as the _rt parameter. For parallel jobs, the limit is applied to each slot.
  • [h|s]_fsize: The filesize, set via ulimit -f
  • [h|s]_data: The data seg size, set via ulimit -d
  • [h|s]_stack: The stack size, set via ulimit -s
  • [h|s]_core: The core file size, set via ulimit -c
  • [h|s]_vmem: The virtual memory size, set via ulimit -v
  • h_rss: resident set size limit. Job will be killed if exceeded.
  • mem_free: Node must have at least this free memory.

See also man ulimit

Interactive jobs

Interactive jobs can be launched in different ways:

  • qlogin: Interactive login session
  • qrsh: Interactive rsh session
  • qsh: Interactive X-windows session

What's the difference?

Task arrays

Task arrays allow you to submit a group of jobs under one job id. This is useful e.g. if you have a set of data which is split into chunks ("farming"). You can identify your data set by evaluation of the environment variable SGE_TASK_ID. Note: Due to a bug in the Sun Grid Engine software, tasks may occasionally fail and display a 'job script not found' error. To circumvent this, we recommend using

qsub -b y <sge_parameters> <jobscript>

when submitting task arrays. Be aware that all SGE parameters (such as -l h_rt=200000 etc.) must be given via the command line, as the script will not be evaluated by the scheduler in this case; you then also have to make the script executable first (using chmod ugo+x <jobscript>) and the script file mustn't be moved or changed until the whole array job is completed.

Alternatively, if you want to leave the parameters within the job script for ease of reuse, you can submit task arrays via a custom wrapper that extracts all SGE parameters from the file and submits them. In this case, use

module load tools/python-3.5 tools/arsub
arsub.py -t <start-end:stepsize> <jobscript>

Parallel environments

By setting a parallel environment, you may define the number of slots (i.e. CPU cores) needed by your job, as well as the way the slots are allocated among the compute nodes (e.g. all slots on one node).

Select the parallel environment with the -pe switch, for example:

  • orte_sl04 / orte_sl04_long / orte_sl08 / orte_sl08_long / orte_sl16 / orte_sl16_long / orte_sl32 / orte_sl32_long / orte_sl64 / orte_sl64_long: Allocate multicore jobs using 4 / 8 / 12 / 16 / 32 / 64 cores per node
    (e.g. -pe orte_sl08 64 uses 8 cores on 8 nodes while -pe orte_sl16 64 uses 16 cores on 4 nodes)
  • smp / smp_long / orte_smp / orte_smp_long: Single-node allocation of slots (symmetric multiprocessing, all slots on one node)
    Hint: Use a * wildcard to allow any of those environments, e.g. "-pe orte_smp* 4" to allocate 4 slots on one node, either on orte_smp or orte_smp_long (depending on "-l h_rt")
  • Use "qconf -spl" to get a list of all available parallel environments, "qconf -sp <pe>" will show configuration details of parallel environment <pe>.

Parallel environments also provide the environment variable NSLOTS, which can be used for the -np option of the mpiexec command:

#!/bin/bash

#$ -S /bin/bash
#$ -e /localhome/lpartec/src/mpi
#$ -o /localhome/lpartec/src/mpi
#$ -cwd /localhome/lpartec/src/mpi
#$ -pe orte_sl64* 128

# load the proper module
. /etc/profile.d/modules.sh
module load parastation/mpi2-gcc-5.0.27-1

# run the mpiexec
mpiexec -np ${NSLOTS} ./hello

Environment variables

Your job script sets the following environment variables (among others)

SGE_O_WORKDIR The directory from which you submitted, or, if applicable, specified with the -cwd switch
TMPDIR The location of a node local scratch dir provided to you. Disappears after job finished
SGE_TASK_ID The id of a task in your task array.

Environment variables (HKHLR compatible)

The following environment are aimed to be available on all HKHLR sites (KA, MR, GI, F, DA). You should preferably use these variables for portable job scripts.

HOME User home directory
HPC_LOCAL The location of a node local scratch dir provided to you. Disappears after job finished
HPC_SCRATCH The location of a (cluster wide) global scratch dir provided to you. (= /scratch/username on MaRC2)

Queues

Jobs scheduled for execution are assigned to one of the following queues, depending on the h_rt and parallel environment (and maybe queue list) you chose:

serial_test   (max. h_rt=3600,   i.e.  1 hour)
serial        (max. h_rt=259200, i.e.  3 days)
serial_long   (max. h_rt=864000, i.e. 10 days)
parallel_test (max. h_rt=3600,   i.e.  1 hour)
parallel      (max. h_rt=259200, i.e.  3 days)
parallel_long (max. h_rt=864000, i.e. 10 days)

Old and new MaRC2 nodes:

The MaRC2 Cluster has been expanded in Sep / Oct 2014 by 8 additional compute nodes (node089-096). Therefor, we have defined host groups for the different hardware generations:

-l proc=amd6276 (=> node001 ... 088)
-l proc=amd6376 (=> node089 ... 096)

In order to execute e.g. a serial job only on the second generation nodes, you may submit your job as follows:

qsub -l h_rt=100,proc=amd6376 hello-world

Job priorities

Submitted jobs are ordered and executed (i.e. assigned to queues) by job priority. The job priority usually depends on several distinct values (see man sge_priority).

Current status (05-Sep-2014):

On MaRC2, the job priority only depends on allocated slots and waiting time (both the higher the better).

In detail, the priority is calculated as follows:

priority = 1*normalized_posix_priority + 0.1*normalized_urgency + 0.01*normalized_tickets
urgency = resource_requirement_contribution + waiting_time_contribution + deadline_contribution
resource_requirement_contribution = 1000*allocated_slots
waiting_time_contribution = 0.740740*seconds_since_submit

whereas normalized_posix_priority, normalized_tickets and deadline_contribution are usually constant.

Hint: Call qstat -pri or qstat -urg to see your jobs' distinct priority or urgency values.
Use qconf -ssconf and qconf -sc to show scheduler and complex configuration.