FNAL - KICP Joint Cluster

Cosmos Cluster Documentation

New Users

Fair Use Policy

Strong Authentication

User Authentication

Hardware Details

Software Details

Filesytem Details

Tool Documentation

Data Transfer

TORQUE Batch System

Cluster Usage

PBS Batch System

User jobs are run in batch mode on the cluster's compute nodes. The batch system is the TORQUE Resource Manager or OpenPBS. The executables for manipulating the batch system are located in /usr/local/pbs/bin.

Example PBS Batch Script

A batch script is written to describe the sequence of programs that are to be run. Batch scripts are familiar shell scripts, written for bash, sh or csh, having additional PBS option lines, prefixed by #PBS, which are treated as shell comments. PBS options may be specified at the beginning of a batch script or they may be specified as command line options to the qsub command when the script is submitted to the batch system. Available options are described in the unix man documentation for qsub.

The example bash script below runs the MPI example program cpi (see: /usr/local/mpich-gcc/example/cpi.c) in parallel. The source for this script is here: batch-example and output from a run is here: cpi-example.out.

In the example shown below, Lines 2 - 8 are PBS options.

Option "-S" selects bash as the batch job shell.

Option "-N" is the name this job will have in the queue.

Option "-j oe" redirects stream stderr to stdout.

Option "-o" sets the name of the file containing stdout from the job. Job output will be moved to this file when the batch job finishes.

Option "-m n" disables email notification of job status by PBS.

In line 7, "-l nodes=1:ppn=3,walltime=00:01:00,pmem=1gb" lists the resources this job is requesting. A job requiring more than one node must specify a value for nodes either in the batch file or when the job is submitted. The value of ppn determines how many processors get assigned to your job. This example job requests one node with three processors per node: "nodes=1:ppn=3". A Batch job must also specify the maximum walltime it expects to take to complete ("walltime=00:01:00") and the maximum amount of memory per core it needs ("pmem=1gb"). A job exceeding the requested time or memory limits will be terminated.

See the qsub documentation for descriptions of the environment variables such as PBS_O_WORKDIR and PBS_NODEFILE that PBS sets for batch jobs.

  1. #! /bin/bash
  2. #PBS -S /bin/bash
  3. #PBS -N cpi-example
  4. #PBS -j oe
  5. #PBS -o ./cpi-example.out
  6. #PBS -m n
  7. #PBS -l nodes=1:ppn=3,walltime=00:01:00,pmem=1gb
  8. #PBS -q normal
  9. cd ${PBS_O_WORKDIR}
  10. # print identifying info for this job
  11. echo "Job ${PBS_JOBNAME} submitted from ${PBS_O_HOST} started "`date`" jobid ${PBS_JOBID}"
  12. # count the number of nodes listed in PBS_NODEFILE
  13. nodes=$[`cat ${PBS_NODEFILE} | wc --lines`]
  14. echo "Job allocated $nodes nodes"
  15. # Always use rcp or rsync to stage any large input files from the head node (lqcd)
  16. # to your job's control worker node.
  17. # All worker nodes have attached disk storage in $COSMOS_DIR
  18. #
  19. # Copy below is commented since there are no files to transfer
  20. # See below for the use of fcp
  21. # fcp -c rcp -r $HOME/myInputDataDir $COSMOS_DIR
  22. echo "example MPI program see: /usr/local/mvapich/examples/cpi.c"
  23. application=${PBS_O_WORKDIR}/cpi
  24. echo
  25. cpus=$nodes
  26. echo "=== Run MPI application on $cpus cpus (1 cpu per node) ==="
  27. mpirun -np $cpus $application
  28. # Always use rcp or rsync to copy any large result files you want to keep back
  29. # to the head node before exiting your script. The $COSMOS_DIR area on the
  30. # workers is wiped clean between jobs.
  31. #
  32. # There were no output files created in this example.
  33. # rcp -p $COSMOS_DIR/myOutputDataFile lqcd:/data/raid1/data/myDataArea

Queue Policy and Queue Names

The available queues on fulla are:

    normal (16GB per node, 121 nodes)
    large  (32GB per node, 3 nodes)
    huge   (128GB per node, 1 node)

Use the following command to see a particular queue's setting:

$ qmgr -c "list queue normal"
Queue normal
        queue_type = Execution
        total_jobs = 1
        state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1 Exiting:0 
        resources_max.walltime = 36:00:00
        resources_default.neednodes = dual
        resources_default.nodes = 1
        resources_default.walltime = 24:00:00
        mtime = Tue Feb  5 14:26:44 2008
        resources_assigned.ncpus = 0
        resources_assigned.nodect = 8
        enabled = True
        started = True

Because the large and huge queues are small, the access to them is restricted to prevent misuse. If you need access to more than 16GB of memory per node, send an e-mail with justification to the admin mailing list.

Important! Batch script must specify the walltime and memory limits they request - an attempt to submit a script without both of these limits specified will result in an error. The memory limit can be specified either as pmem=<request> or as vmem=<request>. The first form specifies the memory limit per core, while the second specifies the total virtual memory for the job on a single node. The first form is often more convenient for MPI jobs, while the second is more suitable for OpenMP and serial jobs.

The memory specification in <request> is an integer number followed by a unit, for example 1500mb, 2gb, etc. Do not put a decimal value there: 1.5gb is a mistake, use 1500mb instead.

Important! Memory Limits

Normal queue nodes do not have exactly 16GB of memory, instead they have slightly less, only 16,000MB (1GB=1024MB). Thus, if you request ppn=8,pmem=2gb, you job will never run - there never will be a node that has that much memory. While you can request ppn=8,pmem=2000mb, this is not a good idea either, as this leaves no memory for the operating system. We recommend that you never request more than 8*1900mb=15200mb total on a normal queue node, and more than 8*3900mb=31200mb on a large queue node. Of course, since the nodes have shared memory, requests like ppn=4,pmem=3800mb, ppn=2,pmem=7600mb, or ppn=1,pmem=15200mb are perfectly fine.

Managing Data Files in Batch Scripts

The recommended way to read a large data file from within a batch job is to first stage the file to local disk attached to the worker before your application opens the file. Each worker has a staging area called $COSMOS_DIR. To avoid network congestion, under no circumstances should a batch script transfer large amount of data (> 100MB) from the home area. All large data files must be located on /lustre filesystem. Home area is provided for keeping the source files and additional libraries, compiling executables, making plots, etc. All analysis of large data files must be done on /lustre filesystem.

Submitting and Monitoring a Batch Job

Batch scripts are submitted for execution using the PBS qsub command. See the unix man page and the PBS documentation for a description of all of the command line options for qsub. The example batch script specifies the required number of nodes and walltime with the #PBS -l line, hence, they need not be repeated as arguments to the qsub command. Use qsub to submit the example script to PBS:

$ qsub batch-example
87.fulla.fnal.gov

Use the qstat command to view queued jobs.

$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1689.fulla                CL6_512          drudd                  0 Q normal         
1695.fulla                run.sh           vanconant       00:01:52 R normal      
Status "R" for a job indicates it is running. Status "Q" indicate a job is queued and waiting to run. See the qstat man pages for a descrition of all the command line options available.

A graphical display of cluster status is available with the command:

pbstop

Another useful command to check the status of your job is:

/usr/local/maui/bin/checkjob job_number

Last Modified 2/06/2008   webmaster@fulla.fnal.gov
Security, Privacy, Legal
Fermilab Policy on Computing
Fermi National Accelerator Laboratory