Submitting Jobs via PBS


PBS, the Portable Batch System, is a networked subsystem for submitting, monitoring, and controlling a workload of batch jobs on one or more systems (here, Beagle and Clark).  With PBS, jobs can be scheduled for execution on clark according to scheduling policies that attempt to fully utilize system resources without over committing those resources, while being fair to all users.  For more information about PBS, see the online manual page, which can be viewed by executing the command:

man pbs
Long running jobs on clark must be submitted to run under PBS.  Short test runs can be run from the interactive login session, but they should be limited to no more than 10 minutes of CPU time.

Jobs are submitted to be run under PBS via the qsub command.  For complete details on using qsub, see the online manual page, which can be viewed by executing the command:
man qsub
A job is represented by a shell script file which contains the commands which should be run.  A simple PBS job script file might contain lines like the following:
#PBS -l cput=10:00:00,ncpus=2,mem=2gb

cd project1

./myprog
The job script file at its simplest can be just a sequence of commands to be executed.  But in general, it can be any script.   The #PBS statement is a special comment statement used to specify job parameters to PBS.  Specify values for cput (CPU time required), ncpus (number of CPUs required), and mem (amount of main memory required) that are appropriate for your job.  The parameters on the #PBS statement can also be specified with the qsub command on the command line when the job is submitted, but it is convenient to put them in the job file so that they are not forgotten.  

The cput parameter specifies the maximum amount of CPU time that the job will be allowed to consume.  If a job uses more CPU time than the amount specified by cput, it will be terminated by PBS.  If cput is not specified, a default time limit of 10 minutes (10:00) will be used.

The ncpus resource specification is very important for job scheduling on clark.  Job scheduling is based upon the load level of the system and the number of CPUs that are in use.  The PBS scheduler cannot "look inside" of a job to determine how many CPUs it will actually use, so the scheduler must be told how many CPUs a job will use via the ncpus resource.  Please be sure that the value of ncpus is correct for your job.  If you request more CPUs than your job will actually use, then your job may wait in the queue, even though enough CPUs are available for it.  If you request fewer CPUs than your job actually uses, then your job may be cancelled.  If ncpus is not specified, PBS will assume that 1 CPU is needed.   (Note: clark has 64 CPUs.   However, since clark is shared by many users, jobs that request more than 8 CPUs may have to wait longer in the job queue, depending upon system load.)

The mem parameter specifies the maximum amount of main memory that the job is expected to use.  Please specify as accurate an estimate as possible.  If the value of mem is higher than necessary, then your job may wait in the queue, even though CPU and memory resources are available.  If the value of mem is lower that what your job actually uses, system memory may be over committed, causing performance problems, and possibly causing your job to be terminated.

Once the job script has been created, it can be submitted for execution via the qsub command, as follows:
qsub scriptfile
The qsub command will respond with a line like the following:
nnnn.clark
where nnnn is the job number assigned to the job, and clark is the name of the PBS server to which the job was submitted, i.e., the local system.  The job number can be used to identify your job with other PBS commands, like qstat and qdel.  The job number is also used to identify output files for your job.

When the job is executed, PBS will create an environment for it that is as much like your normal login environment as is possible.  The job will run under your userid, and the initial working directory will be your home directory.  If the files your jobs uses are not in your home directory, use relative pathnames or full pathnames as appropriate.

After the job completes execution, there will be two output files created in the directory from which the job was submitted:
scriptfile.ennnn  (for output written to the error output stream)

scriptfile.onnnn  (for output written to the standard output stream)
where scriptfile is the name of the script file you specified on the qsub command (or STDIN if you did not specify a script file, and entered commands for the job from standard input), and nnnn is the job number.

There are several job queues that have been created on clark.  But in general, you should not specify a queue name when submitting the job unless asked to do so by the system administrator, for specific types of jobs.  Jobs will be routed by default into an appropriate execution queue based upon job resource requirements.  One exception is for short test jobs.  A PBS queue called test has been created for test runs that will not need more than 10 minutes of CPU time.  You can submit a job to the test queue by specifying the queue name on the qsub command:
qsub -q test scriptfile
Do not specify the ncpus parameter when submitting jobs to the test queue.  Doing so may cause your test jobs to wait much longer in the queue before executing.

Use the qstat command to see which jobs are queued and executing on clark.  For example:
qstat -a
See the man page for qstat for complete details.  

For more information about accessing clark in general, see Accessing Clark.