Setting up OpenPBS To Manage Your Cluster

Rajarshi Guha


Contents

Introduction

Since the development of Beowulf clusters using Linux running on commodity PC's in the early 1990's, parallel computing has become something of a commodity. Using COTS (cheap of the shelf) machines its very easy to build a cluster of computers to run a variety of parallel codes. Of course, theres nothing holding you back from running multiple serial jobs as well.

There are many aspects to cluster (or high performance) computing - these include the cluster architecture, the type of networks, the nature of the parallel code etc. However an important component of a cluster is the job management system. For a 'cluster' of 2 or 3 machines its feasible (though not much fun!) to check the status of each machine and run a job on a specific node. However, it means you have to check the load on 3 machines, copy files to and from the machines etc. In addition its possible for a user to hog all the resources on a machine.

The role of job submission software is to simplify and automate the tasks of submission, scheduling and tracking of jobs. There are several job management systems available but I'll be concentrating on OpenPBS which is available both as a free download or as a commercial product (which comes with additional support). Not only is it available free, its also open source. As a result you can dig into the code and modify to your site requirements. You can also get a Python interface to the OpenPBS API. In this document I discuss the installation and configuration of OpenPBS on our small cluster of 4 machines.

It is important to remember that PBS uses ports 15001-15004 by default, so it is essential that yourfirewall doesn't block these ports.


Background

As mentioned above I'll discuss how I set up a 4 node cluster - with all nodes being classified as time shared and using the default C scheduler for load balancing. This is quite a mouthfull but things will be clearer as we progress with the installation and configuration. Before I delve into the details of how to go about setting up your job scheduling system I'll point out some definitions and also set up some paths and variables that I'll refer to later.

Definitions

The PBS manual defines a number of terms. I'll just go into those terms which are relevant to this discussion:

Paths & Variables

For the purpose of this article we'll decide on some variable (like host names, home directories etc).

rcp Or ssh?

PBS supports both rcp and ssh for file transfers among the machines in the cluster. However since part of our cluster (the server / login node) was connected to the world, the use of rcp on that machine would gaurantee our being hacked. So in this document I ignore the rcp option and will only consider ssh. Unless you have a totally isolated cluster (ie no connection to outside networks) I can see no use of running r* services.

Setting Up The Cluster

Our setup consisted of 3 nodes and 1 server machine. The nodes were all P4 1.5GHz machines with 60GB of hard disk space and 256MB RAM. The server node was a PIII 733 MHz with 256 MB RAM and a 120GB hard disk. Since I did'nt want to waste any resources I also set the server machine to be a compute node, giving us 4 compute nodes in all. All 4 machines had RedHat 8.0 installed on them. The 3 node machines were put on an internal network (192.168.1.*) and connected to the server by a switch. The server's internal interface (192.168.1.1) was specified as the gateway for the node machines. On the server I set up ipchains to allow the internal machines to connect to the rest of the world.

The following ipchain commands were placed in /etc/rc.d/rc.local on the server:

    /sbin/modprobe ipchains
    /sbin/sbin/ipchains -F forward
    /sbin/ipchains -P forward DENY
    /sbin/ipchains -A forward -s 192.168.1.0/24 -j MASQ
    /sbin/ipchains -A forward -i eth1 -j MASQ       
    echo 1 > /proc/sys/net/ipv4/ip_forward

The gateway for the internal interface of the server was set to the external interface. No doubt the above ipchains script could be (and should really be iptables) improved to allow only certain packets out etc. But I'll leave that for another day.

The next step is to decide on how the hard drive space will be used. Partitioning the drives is left up to the reader. However, there are two things to consider regarding the home (/home) space. The easy way out is to mount /home on the server across the nodes via NFS (or your choice of network filesystem) and thats what I did for this cluster. However if you dont want to mount your home space then you will have to make sure to generate passwordless logins for everybody using SSH - this is a major pain and requires such a lot of extra work, as to make it unfeasible if you have more than 2 users or more than 2 nodes! To export the /home directory I added the following to the /etc/exports on the server (our nodes were numbered 192.168.1.*)

    /home 192.168.1.*(rw)
and then ran
    exportfs -vr
On the nodes you'll need to mount the exported directory on the nodes. To do that I went to each node machine (as root) and did
    mount zeus:/home /home -t nfs
You'll also need to add the following line to /etc/fstab on each node:
    zeus:/home        /home           nfs     defaults     0 0
Finally make sure that the programs that will be run on the cluster are available to all the nodes, either by copying them onto the nodes or else by making the binary directory from the server available to the nodes.

Compiling the Code

You can get the OpenPBS package in the form of RPM's or as a source tarball. I used the tarballs and thats what I'll be discussing now. To get access to the source you will need to register (free) after which you get access to the source as well as documentation and mailing list.

NOTE: It seems gcc 3.2 cannot compile OpenPBS. Until it can you will need gcc 2.95 or gcc 2.96. If you have both (ie 2.9x & 3.2) on your system then make sure that /usr/bin/gcc refers to version 2.95 or 2.96.

OK, lets assume you have the tarball. Extract it into some directory (say /root). The source tarball contains code for the server, scheduler, MOM's as well as several client programs and a GUI. You can compile all of them on each machine or else just compile the required components. For our setup I compiled everything for the server machine, but skipped the GUI stuff on the node machines.

To compile for the server I used the following configure command:

    ./configure --set-default-server=zeus --enable-syslog --with-scp \
    --set-server-home=/usr/local/spool/pbs --enable-clients

This will

After configure I did,
    make && make install
Next I copied the source tarball to each node machine and ran configure with the following command:
    ./configure --disable-gui --disable-server --set-default-server=zeus1 \
    --enable-syslog --with-scp --set-server-home=/usr/local/spool/pbs \
    --enable-mom --enable-clients

This configure command will

As before, after the configure script I did,
    make && make install
This should have all the programs and directories set up on the server and the nodes. Next I configured the server, the execution nodes and the scheduler.

Setting Up the Execution Nodes

Setting up the execution node is quite simple. On each execution node open up the file $PBS_HOME/mom_priv/config. (The first time round, it wont exist so just make a file with that name). You need to add the following lines:
    $logevent 0x1ff
    $clienthost zeus1
    $clienthost zeus.chem.psu.edu
    $max_load 2.0
    $ideal_load 1.0
    $usecp zeus.chem.psu.edu:/home /home
The manual discusses the meanings in detail, but I'll give a quick overview of each item in the MOM config. The values for $max_load and $ideal_load given above are specific for our cluster. You'll probably want to tweak the values. After you have written the configuration file, you can start the MOM by
    /usr/local/sbin/pbs_mom
Remember that you should start the MOM as root. It will become a daemon and drop root privileges. To make sure that the MOM will restart automatically in the event of a reboot, make sure to execute it from /etc/rc.d/rc.local


Setting Up the Scheduler

Scheduling is a vital part of a job management system. By implementing a scheduling policy you can instruct the server to allow short running jobs to run before long running ones. You can also specify that certain jobs should be run at certain times of the day. In general the scheduler for a site determines which job is run and where & when it should be run. OpenPBS comes with a number of different schedulers and scheduling policies - the default C scheduler, a Tcl based scheduler (which you can script using Tcl) and a BASL scheduler which can be programmed a C like language. You can also get Python based scheduler (you'll need the Python interface to PBS) and CLAN. You can get a number of ready made schedulers at the OpenPBS site. You can also use the scheduling systems that come with OpenPBS to implement your own sheduling policy. I used the C scheduler with the default FIFO policy to implement our scheduling policy.

Some features of the default scheduler include

Its important to note that though the C shedule is defined as FIFO, this term really applies to queues and the server - both the server and queues are FIFO in nature. But the actual decision of which jobs are run when can be completely redfined by implementing a scheduling policy of your own.

To set up the C scheduler I edited the file $PBS_HOME/sched_priv/sched_config. To allow for load balancing I added the line

load_balancing: true ALL
This option will allow for load balancing between a list of time shared hosts obtained from the server. There are lots more options that can be set which control how and when jobs should be run, what time of the day is regarded as prime time, whether certain queues have higher priority etc. However, for our purposes, we just needed to allow jobs be distributed amongst our nodes - nodes with low load averages should receive jobs before systems with high load averages.

After the scheduler is configured I started the scheduler (which will become a daemon) with

    /usr/local/sbin/pbs_sched

Setting Up the Queue Server

The final step in the setup is to start the server and create the queues which will receive jobs. The server will submit a job to a queue depending on what type of queues are available and what type of resources have been requested for the job. For large clusters its possible to set up more than one queue - say, a queue for short running jobs and another queue for long running jobs. Each queue can be assigned a priority which the scheduler examines when deciding which queue to run a job from. In addition you can also set up feeder (or routing) queues whose job is to feed execution queues from which the actual jobs will be run. More details regarding queue setup can be found in the manual.

The server

The first thing to do is to start the server by
        /usr/local/sbin/pbs_server -t create
The -t option should only be given the first time you start the server. At this point the server is in an idle state; that is, it won't contact the scheduler or run jobs. So the next step is to configure the server and the queue(s). I set the into active mode (ie, it will run jobs) by
        qmgr -c "set server scheduling=true"
To configure the server I used the qmgr program. Typing qmgr at the command line will put you at the qmgr prompt where you can enter commands. We first create a queue and set it to the default queue by doing
        create queue qsar queue_type=execution
        set server default_queue=qsar
This gives us a single execution queue (ie a queue from which jobs will be run) called qsar. By setting the default queue to qsar, we can skip the queue specification in our job submission scripts. Of course if you make more than 1 queue you will probably want to specify a queue when you submit your jobs.

This effectively configures your server and queue. However its good practice to set a few more things before unleashing users onto your cluster. I used the following configuration for the server:

        set server default_node=zeus2
        set server acl_hosts=zeus.chem.psu.edu,ra.chem.psu.edu
        set server acl_host_enable=true
        set server managers=rajarshi@zeus.chem.psu.edu
        set server query_other_jobs=true
OK, lets go through each command one at a time.

The queue

The next step is to configure the queue. The important things to set up are the default, maximum and minimum resources. In this context resources refers to CPU timeand memory. In the case of our cluster I only set the CPU times with the following:
        set queue qsar resources_min.cput=1, resources_max.cput=240:00:00
        set queue qsar resources_default.cput=120:00:00
        set queue qsar enabled=true, started=true
The first two commands set up the minimum, maximum and default CPU time for a job. I set them to 1 minute, 10 days and 5 days respectively. To set memory limits simply replace cput with mem. Note that in all queue commands (set queue) you must specify the queue name. If a resource (cput or mem) is unset it indicates that it is infinite. The last line instructs that the queue can accept jobs and those jobs can be processed by the scheduler and be started.

Its recommended that you provide values for resources_min and resources_max, to prevent runaway jobs. It is also possible to provide a resources_defaults value to the server by

        set server resources_defaults.cput=1:00:00
        set server resources_defaults.mem=4mb
These limit the resources a job can request on any queue on this server. These limits are only checked if
  1. A job does not specify any resource limits
  2. The queue has no specified resource limits
It is also possible to set the maximum limit on a resource (cput or mem) on the server. This limit is only checked if the corresponding limit is not set on the queue.

Adding nodes

The final step is to let the server know that execution nodes exist and their properties. Nodes can be created using
        create node NODENAME ATTRIBUTE=value
Here nodename refers to the hostname (as opposed to the FQDN) of the execution node. There are 4 possible attributes that can be set: Thus to create a node on our cluster I did
        create node zeus2 ntype=time-shared,properties="fast,big"
You can also modify node settings or delete nodes using
        set node attributes[+|-]=values
        delete node NODENAME

At this point, the queue and server are configured and ready to take and distribute jobs to the nodes. For larger installations its possible to set up multiple queues (say one for long jobs and one for short jobs) with different resource limits. You can also assign queue priorities, so that jobs from one queue would get preference over jobs from the other. You can refer to the manual for more details.

Using OpenPBS

Now that you have the job management system up and running, you'll want to start submitting jobs. The program qsub allows you to submit a job. Though you can specify various options to qsub on the command line, its easier to make a job script and run your job by using
        qsub JOBSCRIPT
You can get a basic job script here.

Once you have your jobs running you can view the status of jobs by running qstat. Running jobs can be killed by using qdel.

Further Reading

The best place to get all the details regarding OpenPBS is their website at OpenPBS. Some other useful links include

Thanks

Thanks go to Jason Holmes for pointing me in the right direction regarding home directories and Jon Serra for patiently waiting around for PBS to get up and running.