batchq 7 NIH batch queuing system 26 Sept 1992
NAME
batchq - description of the batch queue system on the HP-UX machines
COMMANDS
submit QUEUE [ JOBID ] submit a job to a queue
qb TOKEN [opts] | [opts] QUEUE [ QUEUE ... ] status for one or more queues
dq QUEUE remove queue entries
QUEUE :: the name of a batch queue; prompted for if not supplied
JOBID :: used instead of the timestamp to name the job script and log file
TOKEN :: a predfined collection of queue names
opts :: additional options; see qb man page
See submit(1), qb(1), dq(1)
LOCAL SPECIFICS
The recommended "run time" for a
single
host batch job is no more than about 8-12 hours; this provides the most
equitable access to the available queues, and avoids the loss of more
than 8-12 hours of work due to hardware or network failures. Nominal
queue time limits are listed for each queue with the
qb
command; when a job exceeds that limit, you and the sysadmins will receive a
mail message identifying the job. Limits for
parallel
queues are much shorter, since the same calculation should be about 3.5
times faster than for a single processor. Since there are limited parallel
queues at present, please do
NOT
submit jobs to multiple parallel queues. Note that the parallel queues
are somewhat more sensitive to job failures; users are
strongly
advised to test all input scripts on a single processor queue (with at
least a 1/4 reduction in the number of steps).
There are currently about 16 queues, each on machines
principally in the group of workstations in NIH/DCRT/LSB,
Bldg 12A. This is subject to change as the configuration of hosts,
especially for parallel calculations, may change as hosts are added (or
possibly down). The queue names correspond to the simple hostnames,
i.e. par5, par11, and deimos. (The latter resides in the FDA
Biophysics Lab; not all "visitor" accounts are honored on those
machines). The queue data resides on each host in /batch whether it has
a queue or not; the 'master' copy of the queue database resides in the
/batch directory on par10.
Current configuration:
Parallel queues (additional hosts):
par0 (par1f, par2f, par3f)
par11 (par12f, par13f, par14f)
j201 (j201)
t2 Terra 16 processors
Single processor queues:
par4 par5 par6 par8 par9 par15(+) par16(+) par17(*) par18
(+) no "large" CHARMM jobs (*) for "large" jobs
Private queues: Yong Lee (UID yongslee)
par7 j202
FDA Biophysics Lab queues:
phobos deimos europa ceres titan(*)
(*) multiprocessor; requires charmm-pvm
SYNOPSIS
The batch system permits controlled usage of Unix computers via a queue
of job entries; all jobs for a given user are run in the order
submitted, but the entries are dynamically interleaved with the jobs of
other users submitted to the same queue. The net effect is that each
user in a queue gets one job run before the user owning the current (or
first) job gets another job run. This provides more equitable access to
a computing resource than a simple first-in, first-out list. The batch
server process is actually a script run as a root 'cron' process every 4
minutes, and starts automatically at boot time.
Batch jobs are scripts using /bin/csh, with commands entered by the user
embedded in a common framework which provides, in a log file, job
information such as: start and stop times, host name, queue name,
echoed commands, and any output not redirected. The script is built by
the 'submit' command, which also creates an entry in the proper queue;
both the script and the log file produced when the job is run reside in
whatever directory the user was in when the 'submit' command was
invoked.
The status of one, some, or all queues may be monitored with the 'qb'
command, which reports the status of the csh process started by the
job script for any running jobs, and all processes belonging to the user
who submitted the running job. Note that the csh process and any
processes spawned belong to the user who submitted the job, as do any
files created. The submitter of a job may thus use the kill command to
terminate the job or any processes belonging to the job. The 'dq'
command may be also be used to kill the active job; it also removes
pending entries.
The queue data resides in the /batch directory. The file 'batch_queues'
contains the data describing the queues: the queue name, the host name,
and the pathname of the directory containing the queue file itself.
Each active /batch directory contains a 'queue.txt' file, which is where
'submit' places the queue entries. Only the superuser can enable or
disable a queue on a given host, although it may be disabled on a
temporary basis by placing the string 'ON STRIKE' in the queue.txt file;
the current job finishes, but no more are run until the string is
removed. Note that the queue.txt file should contain at least 1 line of
text with 3 words, with only a single space between words. The default
string for an idle queue is 'no job running', and the job data supplied
by submit (userid, jobid, working directory) is in the same format. The
only exception is the 'ON STRIKE' string, but it must *not* be the first
line.
USER SETUP
In order to use the queuing system, each user must have the following:
- the directory /usr/local/bin in the [command search] "path" variable
- the following entries in the file ~/.rhosts (user's home directory),
plus machines normally used interactively (e.g. par10):
par0f.mgsl.dcrt.nih.gov
par0.mgsl.dcrt.nih.gov
par11f.mgsl.dcrt.nih.gov
par11.mgsl.dcrt.nih.gov
FILES
/usr/local/bin:
qb query batch queues; user command
submit prepares script and queue entry; user command
dq removes queue entries
/batch:
batch_queues the list of active queues
queue.txt queue entries
SEE ALSO
qb(1), submit(1), dq(1), kill(1), remsh(1), charmm(7)
AUTHORS
Initial Aegis batch system by BR Brooks, with contributions by RM Venable.
Unix /bin/csh translation and enhancements by RM Venable.
BUGS
You tell me: rvenable@deimos.cber.nih.gov
Information and HTML Formatting Courtesy of:
NHLBI/LBC Computational Biophtsics Section
FDA/CBER/OVRR Biophysics Laboratory