PBS Batch System
User jobs are run in batch mode on the cluster's compute nodes.
The batch system is the
TORQUE Resource Manager or OpenPBS. The executables for manipulating the batch system are located in /usr/local/pbs/bin .
Example PBS Batch Script
A batch script is written to describe the sequence of programs that are to be run.
Batch scripts are familiar shell scripts, written for bash , sh
or csh ,
having additional PBS option lines, prefixed by #PBS ,
which are treated as shell comments.
PBS options may be specified at the beginning of a batch script or they may be
specified as command line options to the qsub command
when the script is submitted to the batch system.
Available options are described in the unix man documentation for
qsub .
The example bash script below runs the MPI example program cpi
(see:
/usr/local/mpich-gcc/example/cpi.c ) in parallel.
The source for this script is here:
batch-example and output from a run is here:
cpi-example.out.
In the example shown below, Lines 2 - 8 are PBS options.
Option "-S" selects bash as
the batch job shell.
Option "-N" is the name this job will have
in the queue.
Option "-j oe" redirects stream stderr to stdout.
Option "-o" sets the name of the file containing stdout from the job.
Job output will be moved to this file when the batch job finishes.
Option "-m n" disables email notification of job status by PBS.
In line 7, "-l nodes=1:ppn=3,walltime=00:01:00,pmem=1gb" lists
the resources this job is requesting. A job requiring more than one node
must specify a value for nodes either in the batch file or when the job is
submitted. The value of ppn determines how many processors get assigned to your job.
This example job requests one node with three processors per node:
"nodes=1:ppn=3" . A Batch job must also specify the maximum walltime it expects to take to complete ("walltime=00:01:00" ) and the maximum amount of memory per core it needs ("pmem=1gb" ). A job exceeding the requested time or memory limits will be terminated.
See the qsub documentation for descriptions of the environment variables
such as PBS_O_WORKDIR and PBS_NODEFILE
that PBS sets for batch jobs.
- #! /bin/bash
- #PBS -S /bin/bash
- #PBS -N cpi-example
- #PBS -j oe
- #PBS -o ./cpi-example.out
- #PBS -m n
- #PBS -l nodes=1:ppn=3,walltime=00:01:00,pmem=1gb
- #PBS -q normal
- cd ${PBS_O_WORKDIR}
- # print identifying info for this job
- echo "Job ${PBS_JOBNAME} submitted from ${PBS_O_HOST} started "`date`" jobid ${PBS_JOBID}"
- # count the number of nodes listed in PBS_NODEFILE
- nodes=$[`cat ${PBS_NODEFILE} | wc --lines`]
- echo "Job allocated $nodes nodes"
- # Always use rcp or rsync to stage any large input files from the head node (lqcd)
- # to your job's control worker node.
- # All worker nodes have attached disk storage in $COSMOS_DIR
- #
- # Copy below is commented since there are no files to transfer
- # See below for the use of fcp
- # fcp -c rcp -r $HOME/myInputDataDir $COSMOS_DIR
- echo "example MPI program see: /usr/local/mvapich/examples/cpi.c"
- application=${PBS_O_WORKDIR}/cpi
- echo
- cpus=$nodes
- echo "=== Run MPI application on $cpus cpus (1 cpu per node) ==="
- mpirun -np $cpus $application
- # Always use rcp or rsync to copy any large result files you want to keep back
- # to the head node before exiting your script. The $COSMOS_DIR area on the
- # workers is wiped clean between jobs.
- #
- # There were no output files created in this example.
- # rcp -p $COSMOS_DIR/myOutputDataFile lqcd:/data/raid1/data/myDataArea
Queue Policy and Queue Names
The available queues on fulla are:
normal (16GB per node, 121 nodes)
large (32GB per node, 3 nodes)
huge (128GB per node, 1 node)
Use the following command to see a particular queue's setting:
$ qmgr -c "list queue normal"
Queue normal
queue_type = Execution
total_jobs = 1
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1 Exiting:0
resources_max.walltime = 36:00:00
resources_default.neednodes = dual
resources_default.nodes = 1
resources_default.walltime = 24:00:00
mtime = Tue Feb 5 14:26:44 2008
resources_assigned.ncpus = 0
resources_assigned.nodect = 8
enabled = True
started = True
Because the large and huge queues are small, the access to them is restricted to prevent misuse. If you need access to more than 16GB of memory per node, send an e-mail with justification to the admin mailing list.
Important!
Batch script must specify the walltime and memory limits they request - an attempt to submit a script without both of these limits specified will result in an error. The memory limit can be specified either as pmem=<request> or as vmem=<request> . The first form specifies the memory limit per core, while the second specifies the total virtual memory for the job on a single node. The first form is often more convenient for MPI jobs, while the second is more suitable for OpenMP and serial jobs.
The memory specification in <request> is an integer number followed by a unit, for example 1500mb , 2gb , etc. Do not put a decimal value there: 1.5gb is a mistake, use 1500mb instead.
Important! Memory Limits
Normal queue nodes do not have exactly 16GB of memory, instead they
have slightly less, only 16,000MB (1GB=1024MB). Thus, if you request
ppn=8,pmem=2gb , you job will never run -
there never will be a node that has that much memory. While you can
request ppn=8,pmem=2000mb , this is not a good
idea either, as this leaves no memory for the operating system. We recommend
that you never request more than 8*1900mb=15200mb
total on a normal queue node, and more than
8*3900mb=31200mb on a large queue node. Of course,
since the nodes have shared memory, requests like
ppn=4,pmem=3800mb ,
ppn=2,pmem=7600mb , or
ppn=1,pmem=15200mb are perfectly fine.
Managing Data Files in Batch Scripts
The recommended way to read a large data file from within a batch job is to
first stage the file to local disk attached to the worker
before your application opens the file.
Each worker has a staging area called $COSMOS_DIR .
To avoid network congestion, under no
circumstances should a batch script transfer large amount of data (>
100MB) from the home area. All large data files must be located
on /lustre filesystem. Home area is provided for keeping the source
files and additional libraries, compiling executables, making plots,
etc. All analysis of large data files must be done on /lustre filesystem.
Submitting and Monitoring a Batch Job
Batch scripts are submitted for execution using the PBS qsub command.
See the unix man page and the PBS documentation for
a description of all of the command line
options for qsub.
The example batch script specifies the required number of nodes and
walltime with the #PBS -l line, hence,
they need not be repeated as arguments to the qsub command.
Use qsub to submit the example script to PBS:
$ qsub batch-example
87.fulla.fnal.gov
Use the qstat command to view queued jobs.
$ qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1689.fulla CL6_512 drudd 0 Q normal
1695.fulla run.sh vanconant 00:01:52 R normal
Status "R" for a job indicates it is running. Status "Q" indicate
a job is queued and waiting to run. See the qstat man pages for
a descrition of all the command line options available.
A graphical display of cluster status is available with the command:
pbstop
Another useful command to check the status of your job is:
/usr/local/maui/bin/checkjob job_number
|