Run a computation

You submit batch jobs to the scheduler with details of which resources (for example number of processors, memory) are required for the job to run. The scheduler queues the job until the requested resources are available then executes the job on whichever compute node(s) are free at the time.

Jobs may be queued for some time before execution, so it is important to ensure that they do not require any user interaction at run time. It is also possible to run interactive jobs.

DIal2.5@Leicester has a number of queues:

dirac25x – Runs jobs on the DIaL skylake cluster. This is the default queue, if you do not specify a queue your job will run in this queue.
highmem - Runs jobs on the high mem (1.5TB memory) nodes.
superdome - Runs jobs on the superdome system (144 cores, 6TB RAM)
dirac2 - Runs jobs on the Data Intensive 2 (formerly complexity) sandybridge cluster.
devel – this is for code development and testing and has limits placed on it so that it cannot be used for large production runs. To enable quicker turnaround time for development work, jobs in this queue have an increased priority. Jobs in this queue will run on the skylake cluster.

Submit a job

Use the qsub command to submit a simulation to the job scheduler. You should not run any intensive or long-lived processes directly on the login nodes.

You can control the behaviour of qsub using a directive. You can specify this either on the command line or in a job submission file. For example:

$ qsub <job submission file>

The qsub command returns a unique identifier (jobID) for the job. This is how the system refers to the job during its lifetime.

Resources

The scheduler will use the resource information when finding a suitable spare slot on the cluster. There are default values for each resource and maximum values that can be requested. It's always a good idea to request the minimum amount of a particular resource. The more resources requested, the longer the job is likely to be queued waiting for the resources to become free.

You do not need to choose a queue since production jobs are routed automatically to the appropriate queue. When you submit a job, you may need to specify:

Number of processor cores (procs or nodes x ppn)
Amount of memory (pvmem or vmem)
Execution time (walltime)
DiRAC Project

Processor cores

The cluster is built as a number of compute nodes (nodes), each with a number of processor cores (ppn).

The number of processor cores requested will be the product of these two values, and this would be expressed as nodes=N:ppn=P (where N and P are appropriate values). Alternatively jobs can simply request a number of procs.

OpenMP/threaded jobs can request nodes=1:ppn=P or procs=P, where P is between 1 and 36 (144 on the superdome).
MPI jobs can make use of any combination of nodes and ppn values, within the limits dictated by the system size.

Best practice for MPI jobs is to minimise nodes and maximise ppn values in order to get the greatest performance.

The scheduler only allows one job to run on each node so requests for less than the full number of cores per node will leave the remaining cores idle.

Memory

Each standard skylake compute node has 192GB of physical memory and 16GB of swap available, giving a total of 144GB of virtual memory. Use pvmem to request a portion of the virtual memory per process or vmem to make a request for the job as a whole.

For MPI parallel jobs, request **pvmem**
For threaded/OpenMP parallel jobs, request **vmem**

Execution time

The walltime is the amount of time that the job should run for. Setting a value for walltime is effectively mandatory, as the default value has been deliberately set very low. It can be expressed as a number of seconds, or more usefully as hours, minutes and seconds:

walltime=hh:mm:ss

walltime is the maximum 'real' time that a job will execute for. Once the time is up, the job will be killed.

The maximum walltime which can be requested is 5 days (120 hours).

DiRAC Project

To run jobs against the allocation of CPU hours awarded to the DiRAC project, the job submission should specify the project code. Jobs using allocated time are given a much higher priority in the queue. Projects are specified with the submission option:

-A <projectcode>

Job types

Example submission file

All the examples below use the same basic job submission file. This example creates a job called JobName that requests six hours of walltime with 1GB of memory per process. Job events (begin, end and abort) will be reported to the specified email address.

#PBS -N JobName
#PBS -l walltime=06:00:00
#PBS -m bea
#PBS -M <email address>
#PBS -A <project name>

Serial jobs

Serial jobs run as a single process, thus require only a single processor core for execution.

In this case the only addition required to the example job script is the command to execute, with any arguments needed. For a program called mycode.exe this would be:

/path/to/mycode.exe

You would need to substitute the real full path to the mycode.exe executable file.

Full submission script:

#PBS -N JobName
#PBS -l walltime=06:00:00
#PBS -A <project name>
# Execute the job code
/path/to/mycode.exe

Open MP/Threaded jobs

OpenMP and threaded jobs are parallel in nature, but only scale as far as the resources available within a single node. They cannot take advantage of processors across multiple compute nodes.

For these jobs, you must additionally request the number of processors that the job will require. As the job cannot be spread across more than one node, then nodes must equal 1, and ppn can be any value from 1 to the number of physical cores per node (36 for skylake nodes, 16 for sandybridge nodes, 144 on the superdome). Note that, by default, only one job is run per node so requests for less than 16 cores will still be allocated a complete node. In the example below, four cores on a single node have been requested:

#PBS -l nodes=1:ppn=4

The environment variable OMP_NUM_THREADS must be set to the same value as ppn. To avoid having to edit this, the value can be calculated at submission time from within the script. The following line calculates the value that OMP_NUM_THREADS should be by counting the number of lines in the special file pointed to by PBS_NODEFILE.

export OMP_NUM_THREADS=$(cat $PBS_NODEFILE | wc -l)

You also need to add the command to execute the job code. In this example it's a program called myopenmp.exe

/path/to/myopenmp.exe

You would need to substitute the real full path to the myopenmp.exe executable file.

Complete job submission script, based on the example above:

#!/bin/bash
#
#PBS -N JobName
#PBS -l walltime=06:00:00
#PBS -l vmem=1gb
#PBS -l nodes=1:ppn=4

# Set OMP_NUM_THREADS for OpenMP jobs

export OMP_NUM_THREADS=$(cat $PBS_NODEFILE | wc -l)


# Execute the job code

/path/to/myopenmp.exe

As OMP_NUM_THREADS is calculated, if there are any changes to the number of processes required for the job, you only need to change the value of ppn.

MPI jobs

MPI jobs are parallel in nature and can scale beyond the bounds of a single compute node. How far such jobs can scale is entirely dependent on the nature of the program code and the problem it is designed to solve.

MPI jobs can take full advantage of the size of the cluster but are harder to write and debug than serial or OpenMP/threaded programs. You also need to consider the cluster's architecture when instructing the scheduler how to distribute the job.

As with OpenMP/threaded jobs, the resources nodes and ppn should be used to specify how a MPI job should be distributed across the cluster. The product of the values of nodes and ppn determines the number of processes requested. A MPI job that requires 64 cores would be best described with the following combination of nodes and ppn values:

#PBS -l nodes=2:ppn=32

As a rule of thumb, MPI jobs should be distributed so as to maximise the ppn value, as MPI communication within a node will always be faster than between nodes.

You usually start MPI jobs with the mpirun command. To run an MPI executable called mympicode.exe, use the following command in the submission script:

mpirun ./mympicode.exe

The mpirun command is provided by the appropriate MPI module. Therefore you need to load the correct module in the submission script. Which one should be loaded depends on the compiler used to build the executable. In this example the Intel compilers and MPI libraries will be used:

module load intel/compilers/13.0.0
module load intel/impi/4.1.3

OpenMPI and IntelMPI modules are currently supported.

Complete job submission script, based on the example above:

#PBS -N JobName
#PBS -l walltime=06:00:00
#PBS -l nodes=8:ppn=16
#PBS -m bea
#PBS -M <email address>
#PBS -A <project name>

# Make the Intel compiler and MPI libs available
module load intel/compilers/18.0.1
module load intel/mpi/18.0.1

# Execute the MPI job code mpiexec /path/to/mympicode.exe

Hybrid MPI/OpenMP

Assume that we need 144 cores for our Hybrid mpi-OpenMP simulation. We request 4 nodes (with 36 cores on each node). We want to place 2 MPI processes per node, where each MPI process has 18 threads.

#PBS -N JobName
#PBS -l walltime=06:00:00
#PBS -l nodes=4:ppn=36
#PBS -A <project name>

module load intel/mpi/18

# Ignore any placement instructions from the job scheduler.
export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=0

#  Each process and all of its OMP threads on the same socket.  
export I_MPI_PIN_DOMAIN=sock

# Set threads
export OMP_NUM_THREADS=18

#  8 processes in total, with two process on each of the four nodes.
mpirun -perhost 2 -np 8 ./hybrid_omp_mpi.exe

To check that placement happened correctly you can add

I_MPI_DEBUG=4

to your job submission file (at any point before you call mpirun). This will show the CPU affinities for each MPI process, the information will be displayed process ID followed by a list of Physical processor IDs, for example

  [0] MPI startup(): 0       4421     xnode001    {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17}

This tells us that MPI process 4421, running on xnode001, can use cores 0-17. Finally, you can see which socket these physical processor IDs belong to using the lstopo command:

module load hwloc
lstopo --physical  --no-caches --no-io --no-bridges --no-icaches

Machine (192GB total)
  NUMANode P#0 (96GB) + Package P#0
    Core P#0 + PU P#0
    Core P#1 + PU P#1
    Core P#2 + PU P#2
    Core P#3 + PU P#3
    Core P#4 + PU P#4
    Core P#8 + PU P#5
    Core P#9 + PU P#6
    Core P#10 + PU P#7
    Core P#11 + PU P#8
    Core P#16 + PU P#9
    Core P#17 + PU P#10
    Core P#18 + PU P#11
    Core P#19 + PU P#12
    Core P#20 + PU P#13
    Core P#24 + PU P#14
    Core P#25 + PU P#15
    Core P#26 + PU P#16
    Core P#27 + PU P#17
  NUMANode P#1 (96GB) + Package P#1
    Core P#0 + PU P#18
    Core P#1 + PU P#19
    Core P#2 + PU P#20
    Core P#3 + PU P#21
    Core P#4 + PU P#22
    Core P#8 + PU P#23
    Core P#9 + PU P#24
    Core P#10 + PU P#25
    Core P#11 + PU P#26
    Core P#16 + PU P#27
    Core P#17 + PU P#28
    Core P#18 + PU P#29
    Core P#19 + PU P#30
    Core P#20 + PU P#31
    Core P#24 + PU P#32
    Core P#25 + PU P#33
    Core P#26 + PU P#34
    Core P#27 + PU P#35

Array jobs

An array job lets you submit the same job dozens or even hundreds of times with different inputs. This is a lot quicker than manually submitting the job multiple times.

In this example the serial job mycode.exe will be submitted 10 times. Each job is still a serial job requesting only one processor core, but 10 instances will be queued. The qsub option for creating an array job is -t:

#PBS -t 1-10

Each instance of the job gets a job ID as usual, but also an array ID. This is available within the environment variable PBS_ARRAYID, and can be used within the script or program code to derive input values for the jobs. The array values can be expressed in a variety of ways but must always be integers. For example:

#PBS -t 0,2,4,6,8,10,12,14,16,18

This would still be an array of 10 jobs, but the PBS_ARRAYID value would only be set to even values.

Array jobs are given a job ID of the form JobID[ArrayID]

Complete job submission script, based on the example above:

#PBS -N JobName
#PBS -l walltime=06:00:00
#PBS -A <project name>
#PBS -t 1-10

# Execute the serial job code /path/to/mycode.exe

Use the Torque commands such as qstat, qalter and qdel to view or change properties for all tasks of the array job or for specific tasks or task ranges. For example:

qdel 12345[] -t 34-56

will delete tasks 34-56 of job 12345. Note that the syntax for referencing an array job with the Torque commands is [] rather than simply

Interactive jobs

Interactive jobs are those jobs that require some sort of interaction with a user or expect to be able to display graphical output. Interactive jobs may be serial, OpenMP/threaded or MPI types, but the submission process is slightly different.

Rather than create a job submission script, interactive jobs are best started by passing the resource requests to the qsub command directly. The important option is -I which instructs the system to initiate an interactive job. Other options should be as they are for a normal job.

$ qsub -A <project name> -I -l walltime=01:00:00 -l pvmem=4gb -l nodes=1:ppn=36

You may want to use the devel queue for interactive jobs, since these jobs are typically short and need to run immediately to be useful. This example provides access to 64 CPUs (32 CPUs on each of 2 nodes) for one hour:

qsub -A <project name> -I -q devel -l walltime=01:00:00,nodes=2:ppn=32

The scheduler will queue the job as normal at this point until the requested resources are available. Once they are, the scheduler will start a new login shell on whichever node becomes available. Then you can run anything within the limits of the requested resource. Once the walltime limit has been reached, the session will be closed.

When the job starts the user will see a command prompt on one of the compute nodes. The working environment is the same as it would be for a non-interactive (batch) job making this a good way to debug problems with batch jobs.

For interactive jobs which need a graphical display, add the -X option to the qsub command. This will forward X requests back to the login node. Additionally for X-forwarding in interactive jobs to work, you must be logged into DiRAC via SSH with X-forwarding enabled.

Monitor jobs

Use the showq command to show all jobs in the queue. Its output is split into three sections:

Active jobs
Eligible jobs
Blocked jobs

Active jobs

Active jobs are those that are running or starting and consuming resources. You will see the following information:

Job id
Job owner
Job state
Number of processors allocated to the job
Time remaining until the job completes (HH:MM:SS)
Time the job started

All active jobs are sorted in "Earliest Completion Time First" order.

Eligible jobs

Eligible Jobs are those that are queued and eligible to be scheduled. They are all in the Idle job state and do not violate any fairness policies or have any job holds in place.

The jobs in the Idle section display the same information as the Active Jobs section except:

CPULIMIT is specified rather than job time remaining
QUEUETIME is displayed rather than job start time

The jobs in this section are ordered by job priority. Jobs in this queue are considered eligible for both scheduling and backfilling.

Blocked jobs

Blocked jobs are those that are ineligible to be run or queued. If you find that your jobs are "blocked" (usually shown as Held or Deferred) you can find out more information using the command:

$ checkjob -v <jobid>

In particular you are looking for the section that reads:

available for 8 tasks     - node[015-155]
rejected for CPU          - node[014,035-243]
rejected for State        - node[009-256]
rejected for Swap         - node[039-040]
rejected for Reserved     - node[001-008]

This tells you the reason why particular nodes are unable to run your job. Jobs may also end up in the blocked section because a back-end service or process was not working at the time you submitted the job.

Deferred jobs

Finally, jobs may be shown as Deferred if you have a lot of outstanding work (i.e. running and/or idle jobs). There is no need to do anything as this is the expected behaviour. The job scheduler will reassess 'Deferred' jobs once every hour to see if the condition that was preventing them from starting has been resolved. If so the job will be moved to the Idle (queuing) state and when resources become available the job will start running as normal.

Detailed queue information

To get detailed information about running jobs use the command

$ showq -r

It is useful to look at the EFFIC field. This is the CPU efficiency of your job between 0 - 100%. Parallel jobs are generally less efficient than serial ones, but very low efficiencies (less than say 50%) may indicate that you have a lot of interprocess communication (you might want to consider using fewer cores), or that you are doing a lot of IO. Similarly, the command

$ showq -i

shows detailed information about idle (queuing) jobs, including the job priority. To show only own jobs you can use the command,

$ showq -u <username>

To shows jobs listed by name (rather than job ID)

$ showq -n

To show even more information about jobs

$ showq -v

Most of the flags shown above can be combined. In addition to the showq command (which is part of Moab) there is a second command for monitoring jobs called qstat (which is part of Torque).

$ qstat

On useful thing that this command can provide is a list of nodes that your jobs has been allocated for you job

$ qstat -n

Other useful commands

You can get an estimated job start time using

$ showstart <jobid>

You can see a summary of historical and current utilization by user, using

$ showstats -u

In particular look at the WCAcc field - your average Wallclock Accuracy. The more accurate you are with your walltime estimates the better chance your job has of being "backfilled" (i.e. run ahead of time). You can see how much time a completed job actually used with

$ tracejob -n 30 <jobid>

This asks tracejob to search for the given job, anywhere in the last 30 days' worth of job log files.

Scheduling priorities

Scheduling Principles

DIaL2.5 is designed as a capability system which is optimised for running larger jobs. Large jobs are given the highest priority in the queue.
Some smaller jobs still need to run on Complexity and will generally fit in gaps on the system around the large jobs. In some cases this doesn't happen so jobs also have their priority increased as their time queued increases.
Jobs which are running against RAC allocated CPU time resources have higher priority than those for which the allocation has run out.
Development jobs have high priority but are limited in size and number per user. A small number of nodes are set aside during the working day to assist rapid turnaround of development jobs

Understanding the priority of each job

This is best done by looking at the output from the command mdiag -p as shown below:

# mdiag -p
diagnosing job priority information (partition: ALL)

Job                    PRIORITY*   Cred(  QOS:Class)  Serv(QTime)   Res( Proc)
 Weights               --------    1(10000000:  1)     1(    5)      1(  250)

51955                  11259525    97.7(  1.0:10000)   0.0(705.0)   2.3(1024.)
51371                  11145590    98.7(  1.0:10000)   0.6(13118)   0.7(320.0)
51394                  11143915    98.7(  1.0:10000)   0.6(12783)   0.7(320.0)
51813                  11143110    98.7(  1.0:10000)   0.1(3022.)   1.1(512.0)
51834                  11142830    98.7(  1.0:10000)   0.1(2966.)   1.1(512.0)

Job 51955 gains 97.7% of its score through a Quality of Service (QoS) weighting (1.0x10000000) due to there being allocated resources remaining plus 100,000x1 for being in the standard queue (devel queue scores higher). It has been queued for 705 minutes which is scaled by a factor 5. Finally it gains priority through requesting 1024 core and this has a weighting of 250. The final score is then generated as follows:

[10,000,000 (allocation remaining) or 0 (no allocation)] + [1,000,000 (default queue) or 1,500,000 (devel queue)] + 5x (minutes queued) + 250x (cores requested)

For job 51955 this calculation is:

1x (1.0x10,000,000 + 1,000,000x1) + 1x (705x5) + 1x (1024x250) = 11,259,525

The next two jobs in the queue (51371 and 51394) only request 320 cores, giving them a lower score. However, they have been queuing for a significant amount of time so have jumped ahead of the last two jobs shown here which each request 512 cores.

Note that the mdiag command truncates the score based on queue (class). Most jobs have a score of 1,000,000 (not the 10,000 in the output) but devel queue jobs gain 1,500,000.