Setting up OpenPBS To Manage Your Cluster

Introduction

Since the development of Beowulf clusters using Linux running on commodity PC's in the early 1990's, parallel computing has become something of a commodity. Using COTS (cheap of the shelf) machines its very easy to build a cluster of computers to run a variety of parallel codes. Of course, theres nothing holding you back from running multiple serial jobs as well.

There are many aspects to cluster (or high performance) computing - these include the cluster architecture, the type of networks, the nature of the parallel code etc. However an important component of a cluster is the job management system. For a 'cluster' of 2 or 3 machines its feasible (though not much fun!) to check the status of each machine and run a job on a specific node. However, it means you have to check the load on 3 machines, copy files to and from the machines etc. In addition its possible for a user to hog all the resources on a machine.

The role of job submission software is to simplify and automate the tasks of submission, scheduling and tracking of jobs. There are several job management systems available but I'll be concentrating on OpenPBS which is available both as a free download or as a commercial product (which comes with additional support). Not only is it available free, its also open source. As a result you can dig into the code and modify to your site requirements. You can also get a Python interface to the OpenPBS API. In this document I discuss the installation and configuration of OpenPBS on our small cluster of 4 machines.

It is important to remember that PBS uses ports 15001-15004 by default, so it is essential that yourfirewall doesn't block these ports.

Background

As mentioned above I'll discuss how I set up a 4 node cluster - with all nodes being classified as time shared and using the default C scheduler for load balancing. This is quite a mouthfull but things will be clearer as we progress with the installation and configuration. Before I delve into the details of how to go about setting up your job scheduling system I'll point out some definitions and also set up some paths and variables that I'll refer to later.

Definitions

The PBS manual defines a number of terms. I'll just go into those terms which are relevant to this discussion:

Server: The server is the machine that the job queue will be running on. This is the machine where jobs are submitted. Its good practise to make this machine the only in the cluster which is connected to the outside world. People should'nt be able to access the cluster nodes from the outside. But we'll get back to this a little later.
MOM: A MOM is the term used for the execution program, that is the program which actually executes jobs on the compute nodes. (The word MOM comes from the fact that it is the mother of all jobs on a given node).
Node: The is is the actual physical machine that runs a job. It can have one or more CPU's. An alternative term is execution host. Note that computers running a single copy of an OS (though there may be multiple CPU's) are regarded as a single node, whereas machines running multiple copies of an OS (eg. IBM SP) are regarded as multiple nodes.
Scheduler: This is a program which runs on the server machine and handles the job of deciding which node will get a job. OpenPBS supplied several schedulers including one which can be scripted to fit your purposes.
Virtual processors: The number of processors a node is declared to have. The number of VP's can be more than the number of physical processors. Essentially it is virtual processors that are allocated rather than actual nodes. Thus a node decalred to have 3 VP's will be able to accept upto 3 jobs (but see below).
Exclusive nodes: Only one jobs is allocated to such nodes. Until the job ends no new jobs can be run on these nodes. Also termed a cluster node.
Time shared node: This term is used to indicate that a host can run multiple jobs regardless of the number of VP's. In such a case, specifying virtual processors is meaningless since multiple jobs will always be allocated to such a node.
Load balancing: This is a scheduling polciy whereby jobs are spread out over multiple time shared hosts to evenly balance the load on each host.

Paths & Variables

For the purpose of this article we'll decide on some variable (like host names, home directories etc).

The server will be called zeus
The internal interface (see pg ) for zeus will 192.168.1.1 and will be called zeus1.
The nodes will be called zeus2, zeus3 etc
The variable $PBS_HOME will be set to /usr/local/spool/pbs

rcp Or ssh?

PBS supports both rcp and ssh for file transfers among the machines in the cluster. However since part of our cluster (the server / login node) was connected to the world, the use of rcp on that machine would gaurantee our being hacked. So in this document I ignore the rcp option and will only consider ssh. Unless you have a totally isolated cluster (ie no connection to outside networks) I can see no use of running r* services.

Setting Up The Cluster

Our setup consisted of 3 nodes and 1 server machine. The nodes were all P4 1.5GHz machines with 60GB of hard disk space and 256MB RAM. The server node was a PIII 733 MHz with 256 MB RAM and a 120GB hard disk. Since I did'nt want to waste any resources I also set the server machine to be a compute node, giving us 4 compute nodes in all. All 4 machines had RedHat 8.0 installed on them. The 3 node machines were put on an internal network (192.168.1.*) and connected to the server by a switch. The server's internal interface (192.168.1.1) was specified as the gateway for the node machines. On the server I set up ipchains to allow the internal machines to connect to the rest of the world.

The following ipchain commands were placed in /etc/rc.d/rc.local on the server:

    /sbin/modprobe ipchains
    /sbin/sbin/ipchains -F forward
    /sbin/ipchains -P forward DENY
    /sbin/ipchains -A forward -s 192.168.1.0/24 -j MASQ
    /sbin/ipchains -A forward -i eth1 -j MASQ       
    echo 1 > /proc/sys/net/ipv4/ip_forward

The gateway for the internal interface of the server was set to the external interface. No doubt the above ipchains script could be (and should really be iptables) improved to allow only certain packets out etc. But I'll leave that for another day.

The next step is to decide on how the hard drive space will be used. Partitioning the drives is left up to the reader. However, there are two things to consider regarding the home (/home) space. The easy way out is to mount /home on the server across the nodes via NFS (or your choice of network filesystem) and thats what I did for this cluster. However if you dont want to mount your home space then you will have to make sure to generate passwordless logins for everybody using SSH - this is a major pain and requires such a lot of extra work, as to make it unfeasible if you have more than 2 users or more than 2 nodes! To export the /home directory I added the following to the /etc/exports on the server (our nodes were numbered 192.168.1.*)

    /home 192.168.1.*(rw)

and then ran

    exportfs -vr

On the nodes you'll need to mount the exported directory on the nodes. To do that I went to each node machine (as root) and did

    mount zeus:/home /home -t nfs

You'll also need to add the following line to /etc/fstab on each node:

    zeus:/home        /home           nfs     defaults     0 0

Finally make sure that the programs that will be run on the cluster are available to all the nodes, either by copying them onto the nodes or else by making the binary directory from the server available to the nodes.

Compiling the Code

You can get the OpenPBS package in the form of RPM's or as a source tarball. I used the tarballs and thats what I'll be discussing now. To get access to the source you will need to register (free) after which you get access to the source as well as documentation and mailing list.

NOTE: It seems gcc 3.2 cannot compile OpenPBS. Until it can you will need gcc 2.95 or gcc 2.96. If you have both (ie 2.9x & 3.2) on your system then make sure that /usr/bin/gcc refers to version 2.95 or 2.96.

OK, lets assume you have the tarball. Extract it into some directory (say /root). The source tarball contains code for the server, scheduler, MOM's as well as several client programs and a GUI. You can compile all of them on each machine or else just compile the required components. For our setup I compiled everything for the server machine, but skipped the GUI stuff on the node machines.

To compile for the server I used the following configure command:

    ./configure --set-default-server=zeus --enable-syslog --with-scp \
    --set-server-home=/usr/local/spool/pbs --enable-clients

This will

Set the default server name to zeus (the hostname of the server machine without the domain)
All the PBS stuff such as logs and configuration files will be under /usr/local/spool/pbs. Its a good idea to make this the same for all the machines. Also remember that each machine should have its own PBS home directory - ie, it should'nt be mounted from the server.
Since we decided that the sever machine will also act as an execution node, we compile the MOM as well
The system will log both in the PBS home directory as well as to syslog (usually /var/log/messages on a RedHat system).
All the clients (like qsub,qdel,pbsnodes etc) will be compiled. This also includes the GUI programs which need Tcl/Tk to compile.
ssh will be used for file transfers rather than rcp

After configure I did,

    make && make install

Next I copied the source tarball to each node machine and ran configure with the following command:

    ./configure --disable-gui --disable-server --set-default-server=zeus1 \
    --enable-syslog --with-scp --set-server-home=/usr/local/spool/pbs \
    --enable-mom --enable-clients

This configure command will

Prevent compilation of the PBS server
Prevent compilation of the GUI clients
Enables logging to syslog and enables the use of scp rather than rcp.
The other switches mean the same as they did for the server

As before, after the configure script I did,

    make && make install

This should have all the programs and directories set up on the server and the nodes. Next I configured the server, the execution nodes and the scheduler.

Setting Up the Execution Nodes

Setting up the execution node is quite simple. On each execution node open up the file $PBS_HOME/mom_priv/config. (The first time round, it wont exist so just make a file with that name). You need to add the following lines:

    $logevent 0x1ff
    $clienthost zeus1
    $clienthost zeus.chem.psu.edu
    $max_load 2.0
    $ideal_load 1.0
    $usecp zeus.chem.psu.edu:/home /home

The manual discusses the meanings in detail, but I'll give a quick overview of each item in the MOM config.

$logevent: This indicates the level of logging that the MOM should do.
$clienthost: The value of this keyword indicates which machines in the cluster are allowed to contact the MOM to shedule a job for example. In our case, the only machine that was meant to contact a MOM was the server node - thus we specify the names of the internal and external interface of the server machine (zeus1 and zeus.chem.psu.edu respetively).
$max_load: The value of this keyword plays a role in scheduling which I'll talk about in more detail in the section discussing the scheduler. When the system load average goes above this value the MOM will refuse any new jobs.
$ideal_load: The value of this keyword also plays a role in scheduling. After the load average has crossed $max_load, the MOM will only take on new jobs after the load average has gone below this value.
$usecp: This is a very important keyword for the setup described here. After a job has finished the MOM will transfer the output files etc back to the host which supplied it the job (in this case the server zeus). Since we compiled the code with the -with-scp flag it will try and use scp. However, if you have not set up passwordless logins, then the scp will fail and you will not get back your output files.
Hence, this keyword instructs the MOM that if the host to which it is trying to transfer files to matches the above host (the matching rule is described in a little more detail in the manual), then it should use cp, which is OK for us since our home directories are mounted via NFS. The paths specified indicate that anything under /home on the execution node should be copied under /home on the remote host (ie zeus.chem.psu.edu in this case).
It is important to specify the FQDN for the server. This is because the server gets its hostname with the gethostbyname() C library call and uses that in communications with the MOM's. Hence if you just gave zeus as the host it would never match the server supplied hostname and thus would use scp rather than cp and you would be scratching your head! Also note that if the MOM detects that it is going to be copying files to the local machine (ie itself) then it will use cp by default. Thus if you run a MOM on the server machine you can skip this line.

The values for $max_load and $ideal_load given above are specific for our cluster. You'll probably want to tweak the values. After you have written the configuration file, you can start the MOM by

    /usr/local/sbin/pbs_mom

Remember that you should start the MOM as root. It will become a daemon and drop root privileges. To make sure that the MOM will restart automatically in the event of a reboot, make sure to execute it from /etc/rc.d/rc.local

Setting Up the Scheduler

Scheduling is a vital part of a job management system. By implementing a scheduling policy you can instruct the server to allow short running jobs to run before long running ones. You can also specify that certain jobs should be run at certain times of the day. In general the scheduler for a site determines which job is run and where & when it should be run. OpenPBS comes with a number of different schedulers and scheduling policies - the default C scheduler, a Tcl based scheduler (which you can script using Tcl) and a BASL scheduler which can be programmed a C like language. You can also get Python based scheduler (you'll need the Python interface to PBS) and CLAN. You can get a number of ready made schedulers at the OpenPBS site. You can also use the scheduling systems that come with OpenPBS to implement your own sheduling policy. I used the C scheduler with the default FIFO policy to implement our scheduling policy.

Some features of the default scheduler include

All jobs in a given queue are considered for execution before another queue is examined.
Queues are order by queue priority and jobs within queues are ordered by the requested cpu time.
Prime time is from 4:00am to 5:30pm.

Its important to note that though the C shedule is defined as FIFO, this term really applies to queues and the server - both the server and queues are FIFO in nature. But the actual decision of which jobs are run when can be completely redfined by implementing a scheduling policy of your own.

To set up the C scheduler I edited the file $PBS_HOME/sched_priv/sched_config. To allow for load balancing I added the line

load_balancing: true ALL

This option will allow for load balancing between a list of time shared hosts obtained from the server. There are lots more options that can be set which control how and when jobs should be run, what time of the day is regarded as prime time, whether certain queues have higher priority etc. However, for our purposes, we just needed to allow jobs be distributed amongst our nodes - nodes with low load averages should receive jobs before systems with high load averages.

After the scheduler is configured I started the scheduler (which will become a daemon) with

    /usr/local/sbin/pbs_sched

Setting Up the Queue Server

The final step in the setup is to start the server and create the queues which will receive jobs. The server will submit a job to a queue depending on what type of queues are available and what type of resources have been requested for the job. For large clusters its possible to set up more than one queue - say, a queue for short running jobs and another queue for long running jobs. Each queue can be assigned a priority which the scheduler examines when deciding which queue to run a job from. In addition you can also set up feeder (or routing) queues whose job is to feed execution queues from which the actual jobs will be run. More details regarding queue setup can be found in the manual.

The server

The first thing to do is to start the server by

        /usr/local/sbin/pbs_server -t create

The -t option should only be given the first time you start the server. At this point the server is in an idle state; that is, it won't contact the scheduler or run jobs. So the next step is to configure the server and the queue(s). I set the into active mode (ie, it will run jobs) by

        qmgr -c "set server scheduling=true"

To configure the server I used the qmgr program. Typing qmgr at the command line will put you at the qmgr prompt where you can enter commands. We first create a queue and set it to the default queue by doing

        create queue qsar queue_type=execution
        set server default_queue=qsar

This gives us a single execution queue (ie a queue from which jobs will be run) called qsar. By setting the default queue to qsar, we can skip the queue specification in our job submission scripts. Of course if you make more than 1 queue you will probably want to specify a queue when you submit your jobs.

This effectively configures your server and queue. However its good practice to set a few more things before unleashing users onto your cluster. I used the following configuration for the server:

        set server default_node=zeus2
        set server acl_hosts=zeus.chem.psu.edu,ra.chem.psu.edu
        set server acl_host_enable=true
        set server managers=rajarshi@zeus.chem.psu.edu
        set server query_other_jobs=true

OK, lets go through each command one at a time.

set server default_node=zeus2 This command specifies that if no node is specified in a job submission script the job should be sent to zeus2. Since I set up the system for load balancing, this is not always true, since zeus2 might be too loaded to take on a job.
set server acl_hosts=zeus.chem.psu.edu,ra.chem.psu.edu This sets up access control based on hostnames. The two hosts named are the only ones allowed to submit jobs.
set server acl_host_enable=true Enable the access control list on the server
set server managers=rajarshi@zeus.chem.psu.edu This allows the user to make changes to the server and queue settings. Useful, since you dont have to be root to make changes.
set server query_other_jobs=true Normally when a user submits a series of jobs he can only see the status of his jobs via the qstat command. Setting this option to true allows him to get the status of all the jobs running on the system

The queue

The next step is to configure the queue. The important things to set up are the default, maximum and minimum resources. In this context resources refers to CPU timeand memory. In the case of our cluster I only set the CPU times with the following:

        set queue qsar resources_min.cput=1, resources_max.cput=240:00:00
        set queue qsar resources_default.cput=120:00:00
        set queue qsar enabled=true, started=true

The first two commands set up the minimum, maximum and default CPU time for a job. I set them to 1 minute, 10 days and 5 days respectively. To set memory limits simply replace cput with mem. Note that in all queue commands (set queue) you must specify the queue name. If a resource (cput or mem) is unset it indicates that it is infinite. The last line instructs that the queue can accept jobs and those jobs can be processed by the scheduler and be started.

Its recommended that you provide values for resources_min and resources_max, to prevent runaway jobs. It is also possible to provide a resources_defaults value to the server by

        set server resources_defaults.cput=1:00:00
        set server resources_defaults.mem=4mb

These limit the resources a job can request on any queue on this server. These limits are only checked if

A job does not specify any resource limits
The queue has no specified resource limits

It is also possible to set the maximum limit on a resource (cput or mem) on the server. This limit is only checked if the corresponding limit is not set on the queue.

Adding nodes

The final step is to let the server know that execution nodes exist and their properties. Nodes can be created using

        create node NODENAME ATTRIBUTE=value

Here nodename refers to the hostname (as opposed to the FQDN) of the execution node. There are 4 possible attributes that can be set:

state This can be either free, down or offline
properties This can be any string or strings seperated by commas and can be used for grouping nodes when multiple nodes are required for a job.
type This can be set to cluster or time-shared. In our case since we simply wanted load balancing between execution nodes and more than one job could run on a node, we set each node to be time.
np This is the set to the number of virtual processors (any positive integer). In our case I didn't need to set this attribute.

Thus to create a node on our cluster I did

        create node zeus2 ntype=time-shared,properties="fast,big"

You can also modify node settings or delete nodes using

        set node attributes[+|-]=values
        delete node NODENAME

At this point, the queue and server are configured and ready to take and distribute jobs to the nodes. For larger installations its possible to set up multiple queues (say one for long jobs and one for short jobs) with different resource limits. You can also assign queue priorities, so that jobs from one queue would get preference over jobs from the other. You can refer to the manual for more details.

Using OpenPBS

Now that you have the job management system up and running, you'll want to start submitting jobs. The program qsub allows you to submit a job. Though you can specify various options to qsub on the command line, its easier to make a job script and run your job by using

        qsub JOBSCRIPT

You can get a basic job script here.

Once you have your jobs running you can view the status of jobs by running qstat. Running jobs can be killed by using qdel.

Thanks

Thanks go to Jason Holmes for pointing me in the right direction regarding home directories and Jon Serra for patiently waiting around for PBS to get up and running.