LoBoS Pack Software

Version 0.1

 

Document Revision History

Date

Author

Description

07/17/97

EMB

Initial Version

10/09/97

EMB

Recent additions to code

 

1. Description

 

The queue manager is responsible for providing users with an interface to the LoBoS system which:

 

  1. Provides users with the ability to submit and cancel jobs
  2. Monitors the health of the individual nodes and network connections
  3. Dispatches jobs to LoBoS based on availability and other criteria.
  4. Monitor system, queue and job status

 

The following assumptions are inherent in the design of the queueing system:

 

  1. Each master and processing node has a unique IP address accessible to the net (i.e. unrestricted)
  2. Each node is connected to the NIH network
  3. Each node is equipped with Network File Service (NFS) software

 

The system handles manages resources using three distinct pieces of code:

 

Program

Description

Configuration file

Output

Pulse

Nodes exchange datagrams reporting on

CPUs, network links, disk space

 

 

/lobos/config/pulse.conf

/lobos/config/nodes

/lobos/config/topology

/lobos/node_status

/lobos/node.log

lobosq

Accepts job requests and dispatches them

 

 

/lobos/config/nodes

/lobos/config/topology

 

/lobos/job_status

/lobos/job.log

LoBoS Web Page

Displays of system and job status

/lobos/config/nodes

/lobos/config/topology

/lobos/node_status

/lobos/job_status

Display

 

 

2. Commands

 

In general, when a job is submitted the following sequence occurs:

    1. A file is created in the current directory with the form username.yymmddhhmmss
    2. The job file is then copied (rcp) to the master nodes.
    3. When the job is queued for execution a username.yymmddhhmmss.run is created and output from the job appears in username.yymmddhhmmss.log
    4. When the job completes username.yymmddhhmmss.run is deleted.

 

2.1 To Submit a job

 

Type:

lobosq <CR>

Enter the commands…

<ctrl>-d

 

or

 

lobosq < filename

 

2.2 To cancel a job

 

Type:

lobosq can username.yymmddhhmmss <CR>

 

2.3 To check on job status

 

Use the web to monitor the job prior to execution.

Use the username.yymmddhhmmss.log file to monitor the job during execution

 

3. Lobos Queue Directives

 

The following directives may be used in the scripts submitted to LoBoS.

 

Directive

Min

Max

Default

Units

Description

#ETIME

1

 

60

Minutes

Estimated time of job

#CONFidence

0

100

90

Percent

Confidence that ETIME is correct (future)

#NPROCS

1

128

4

CPUs

Number of requested processors

Syntax: n or nlow-nhigh (future)

#NPROC0

   

None

CPU

Requested CPU to use for NODE0 (future)

#TPIP

1

2

1

Threads

Number of threads per IP

#ADMIN

   

None

 

Schedules job immediately, permanently (root use only)

#RTIME

     

Time

Schedule job at the requested time (root use only)

           

 

 

 

4. Configuration Files

 

These configuration files collectively define the role and resources of nodes

as well as the network topology. The files are to be identical on all nodes of

the system. They comprise the database available to each node to determine its

role and network connections.

 

 

4.1 Node Configuration

 

The file is called /lobos/config/nodes

/lobos/config/nodes----------------------------------------------------------------

# Description of the nodes within the LoBoS system

#

# To do:

# Must add hub type

# type 0=compute;non-zero is rank of master node

# The format for the file is:

# Host CPU- Logical Physical

#id xxx.xxx.xxx.xxx name type CPUs Type clock speed RAM swap links x y x y

1 165.112.185.1 pe1 0 2 1 200 1.0 128 512 2 0.878 0.437 0.160 0.850

2 165.112.185.2 pe2 0 2 1 200 1.0 128 512 2 0.804 0.460 0.120 0.850

3 165.112.185.3 pe3 0 2 1 200 1.0 128 512 2 0.864 0.510 0.080 0.850

4 165.112.185.4 pe4 0 2 1 200 1.0 128 512 2 0.786 0.519 0.040 0.850

65 165.112.184.11 master1 1 2 1 200 1.0 128 128 2 0.250 0.850 0.260 0.850

66 165.112.184.12 master2 2 2 1 200 1.0 128 128 2 0.400 0.900 0.340 0.850

67 165.112.184.13 master3 3 2 1 200 1.0 128 128 2 0.600 0.900 0.260 0.680

68 165.112.184.14 master4 4 2 1 200 1.0 128 128 2 0.750 0.850 0.340 0.680

69 999.999.999.999 hub1 0 0 1 0 0.0 0 0 0 0.400 0.400 0.300 0.100

70 999.999.999.999 hub2 0 0 1 0 0.0 0 0 0 0.600 0.400 0.300 0.300

71 999.999.999.999 Gigabit 0 0 1 0 0.0 0 0 0 0.100 0.150 0.300 0.500

--------------------------------------------------------------------------------

 

4.2 Network Topology

 

The file is called /lobos/config/topology

/lobos/topology-------------------------------------------------------------

# Description of the topology within the LoBoS system

#

#

# The format for the file is:

# name1 name2 IP1 IP2

master1 hub1 165.112.184.11 165.112.184.1

master2 hub1 165.112.184.12 165.112.184.1

master3 hub1 165.112.184.13 165.112.184.1

master4 hub1 165.112.184.14 165.112.184.1

master1 hub2 165.112.184.11 165.112.184.1

master2 hub2 165.112.184.12 165.112.184.1

master3 hub2 165.112.184.13 165.112.184.1

master4 hub2 165.112.184.14 165.112.184.1

pe1 pe2 165.112.185.1 165.112.185.2

pe2 pe3 165.112.185.2 165.112.185.3

pe3 pe4 165.112.185.3 165.112.185.4

pe4 pe5 165.112.185.4 165.112.185.5

pe1 hub1 165.112.185.1 999.999.999.999

pe3 hub1 165.112.185.3 999.999.999.999

pe2 hub2 165.112.185.2 999.999.999.999

pe4 hub2 165.112.185.4 999.999.999.999

pe37 Gigabit 165.112.185.37 999.999.999.999

pe38 Gigabit 165.112.185.38 999.999.999.999

pe39 Gigabit 165.112.185.39 999.999.999.999

pe40 Gigabit 165.112.185.40 999.999.999.999

--------------------------------------------------------------------------------

 

 

4.3 Status

 

/lobos/node_status---------------------------------------------------------------

# Node status file

# Version 0.10 Date: Sat Sep 20 13:53:04 1997

node 01:53:04 pe1 0

node 01:53:04 pe2 0

node 01:53:04 pe3 0

node 01:53:04 pe4 0

node 01:53:04 master1 0

node 01:53:04 master2 0

node 01:53:04 master3 0

node 01:53:04 master4 0

node 01:53:04 hub1 0

node 01:53:04 hub2 0

node 01:53:04 Gigabit 0

link 10:50:24 pe1 pe2 165.112.185.1 165.112.185.2 1

link 10:50:24 pe1 hub1 165.112.185.1 231.231.231.231 1

link 10:50:24 pe2 pe3 165.112.185.2 165.112.185.3 1

link 10:50:24 pe2 hub2 165.112.185.2 231.231.231.231 1

link 10:50:24 pe3 pe4 165.112.185.3 165.112.185.4 1

link 10:50:24 pe3 hub1 165.112.185.3 231.231.231.231 1

link 10:50:24 pe4 pe5 165.112.185.4 165.112.185.5 1

link 10:50:24 pe4 hub2 165.112.185.4 231.231.231.231 1

link 10:50:24 master1 master1 165.112.184.11 165.112.184.11 1

link 10:50:24 master1 hub1 165.112.184.11 231.231.231.231 1

link 10:50:24 master1 hub2 165.112.184.11 231.231.231.231 1

link 10:50:24 master2 master1 165.112.184.12 165.112.184.11 1

link 10:50:24 master2 hub1 165.112.184.12 231.231.231.231 1

link 10:50:24 master2 hub2 165.112.184.12 231.231.231.231 1

link 10:50:24 master3 hub1 165.112.184.13 231.231.231.231 1

link 10:50:24 master3 hub2 165.112.184.13 231.231.231.231 1

link 10:50:24 master4 hub1 165.112.184.14 231.231.231.231 1

link 10:50:24 master4 hub2 165.112.184.14 231.231.231.231 1

--------------------------------------------------------------------------------

 

 

4.4 Pulse configuration

 

The file is called /lobos/pulse_config

/lobos/pulse_config---------------------------------------------------------------

# Configuration information for reporting status of the machine

#

# keyword parameters

HeartBeat 61

NodeTimeOut 90.

LinkTimeOut 90.

MasterTimeOut 90.

--------------------------------------------------------------------------------

 

5. Software

 

5.1 pulse

 

The pulse software is implemented as a daemon which runs on each of the LoBoS

nodes. On a compute node the pulse daemon reports on each processor and network

connection defined in the configuration files for the node. On a master node

the pulse daemon listens for reports from other nodes and updates the file

/lobos/node_status. Eventually the master nodes will converse and provide a

fail-over capability.

 

6. Lobosq

 

 

6.1 Job status

Job dispatching and assignments to individual CPUs is performed by lobosq.

It generates the following status file.

 

The file is called /lobosq/job_status

/lobos/job_status-------------------------------------------------------------------------------

# Job Status Date:Wed Oct 8 16:23:00 1997

# job id uid gid pri CPUs Start Time Est Time Act Time cpus

#xxxxxxxxxxxxxxxxxxxxxxxx xxxx xxxx xxx xxxx mmm dd hh:mm:ss hhh:mm:ss hhh:mm:ss xxxxxxxxxxxxx

billings.971008161803 260 9000 1 2 1997 Oct 08 18:23:00 1:00:00 0:00:00 1,3

billings.971008161805 260 9000 1 2 1997 Oct 08 18:23:00 1:00:00 0:00:00 2,4

billings.971008162411 260 9000 1 2 1997 Oct 08 16:23:00 1:00:00 0:00:00 2,4

------------------------------------------------------------------------------------------------