batchq 7 NIH batch queuing system 26 Sept 1992

NAME

batchq - description of the batch queue system on the HP-UX machines

COMMANDS


submit QUEUE [ JOBID ] submit a job to a queue
qb TOKEN [opts] | [opts] QUEUE [ QUEUE ... ] status for one or more queues
dq QUEUE remove queue entries

QUEUE :: the name of a batch queue; prompted for if not supplied
JOBID :: used instead of the timestamp to name the job script and log file
TOKEN :: a predfined collection of queue names
opts :: additional options; see qb man page

See submit(1), qb(1), dq(1)

LOCAL SPECIFICS



The recommended "run time" for a single host batch job is no more than about 8-12 hours; this provides the most equitable access to the available queues, and avoids the loss of more than 8-12 hours of work due to hardware or network failures. Nominal queue time limits are listed for each queue with the qb command; when a job exceeds that limit, you and the sysadmins will receive a mail message identifying the job. Limits for parallel queues are much shorter, since the same calculation should be about 3.5 times faster than for a single processor. Since there are limited parallel queues at present, please do NOT submit jobs to multiple parallel queues. Note that the parallel queues are somewhat more sensitive to job failures; users are strongly advised to test all input scripts on a single processor queue (with at least a 1/4 reduction in the number of steps). There are currently about 16 queues, each on machines principally in the group of workstations in NIH/DCRT/LSB, Bldg 12A. This is subject to change as the configuration of hosts, especially for parallel calculations, may change as hosts are added (or possibly down). The queue names correspond to the simple hostnames, i.e. par5, par11, and deimos. (The latter resides in the FDA Biophysics Lab; not all "visitor" accounts are honored on those machines). The queue data resides on each host in /batch whether it has a queue or not; the 'master' copy of the queue database resides in the /batch directory on par10. Current configuration:
Parallel queues (additional hosts):
par0 (par1f, par2f, par3f)
par11 (par12f, par13f, par14f)
j201 (j201)
t2 Terra 16 processors

Single processor queues:
par4 par5 par6 par8 par9 par15(+) par16(+) par17(*) par18
(+) no "large" CHARMM jobs (*) for "large" jobs

Private queues: Yong Lee (UID yongslee)
par7 j202

FDA Biophysics Lab queues:
phobos deimos europa ceres titan(*)
(*) multiprocessor; requires charmm-pvm


SYNOPSIS



The batch system permits controlled usage of Unix computers via a queue of job entries; all jobs for a given user are run in the order submitted, but the entries are dynamically interleaved with the jobs of other users submitted to the same queue. The net effect is that each user in a queue gets one job run before the user owning the current (or first) job gets another job run. This provides more equitable access to a computing resource than a simple first-in, first-out list. The batch server process is actually a script run as a root 'cron' process every 4 minutes, and starts automatically at boot time. Batch jobs are scripts using /bin/csh, with commands entered by the user embedded in a common framework which provides, in a log file, job information such as: start and stop times, host name, queue name, echoed commands, and any output not redirected. The script is built by the 'submit' command, which also creates an entry in the proper queue; both the script and the log file produced when the job is run reside in whatever directory the user was in when the 'submit' command was invoked. The status of one, some, or all queues may be monitored with the 'qb' command, which reports the status of the csh process started by the job script for any running jobs, and all processes belonging to the user who submitted the running job. Note that the csh process and any processes spawned belong to the user who submitted the job, as do any files created. The submitter of a job may thus use the kill command to terminate the job or any processes belonging to the job. The 'dq' command may be also be used to kill the active job; it also removes pending entries. The queue data resides in the /batch directory. The file 'batch_queues' contains the data describing the queues: the queue name, the host name, and the pathname of the directory containing the queue file itself. Each active /batch directory contains a 'queue.txt' file, which is where 'submit' places the queue entries. Only the superuser can enable or disable a queue on a given host, although it may be disabled on a temporary basis by placing the string 'ON STRIKE' in the queue.txt file; the current job finishes, but no more are run until the string is removed. Note that the queue.txt file should contain at least 1 line of text with 3 words, with only a single space between words. The default string for an idle queue is 'no job running', and the job data supplied by submit (userid, jobid, working directory) is in the same format. The only exception is the 'ON STRIKE' string, but it must *not* be the first line.

USER SETUP


In order to use the queuing system, each user must have the following:

- the directory /usr/local/bin in the [command search] "path" variable

- the following entries in the file ~/.rhosts (user's home directory),
plus machines normally used interactively (e.g. par10):

par0f.mgsl.dcrt.nih.gov
par0.mgsl.dcrt.nih.gov
par11f.mgsl.dcrt.nih.gov
par11.mgsl.dcrt.nih.gov

FILES



/usr/local/bin:

qb query batch queues; user command
submit prepares script and queue entry; user command
dq removes queue entries

/batch:

batch_queues the list of active queues
queue.txt queue entries

SEE ALSO



qb(1), submit(1), dq(1), kill(1), remsh(1), charmm(7)

AUTHORS



Initial Aegis batch system by BR Brooks, with contributions by RM Venable.
Unix /bin/csh translation and enhancements by RM Venable.

BUGS



You tell me: rvenable@deimos.cber.nih.gov


CHARMM .doc Homepage

Information and HTML Formatting Courtesy of:

NHLBI/LBC Computational Biophtsics Section
FDA/CBER/OVRR Biophysics Laboratory