How to Build a Beowulf

I've set up two Beowulfs so far, and in both cases it involved gathering material from various Web sites and somehow putting it all together. I got everything up and running, but it was quite a "time sink" for me, so I was interested to receive a book entitled "How to Build a Beowulf". Finally, information regarding Beowulfs would be available in one place and I could save my bandwidth for other stuff!


Title: How to Build a Beowulf
Author: Thomas Sterling, John Salmon, Donald J. Becker, and Daniel F. Savarese
Publisher: The MIT Press
Purchase URL: http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=3898

My overall impression of the book is that it is targeted toward scientists and engineers who want to use Beowulfs as computational tools. As a result, there are certain areas which might seem to go into unnecessary detail regarding computer hardware and software, operating systems, etc. However, it provides a good description of the issues involved in setting up, maintaining, and running applications on Beowulf clusters. A lot of the material might seem redundant for an experienced system or network administrator, but one must keep in mind the target audience -- scientists and engineers who solve computational problems, not network problems. The book gives a good all around picture of the various components of a Beowulf.

The first two chapters give a brief overview of what Beowulfs are and what types of problems can be solved with them. There is a nice comparison to other types of proprietary parallel processing systems. The concise description of the hardware and software aspects of building a Beowulf cluster provides a good foundation for the later chapters in the book, which go into more detail about the various points raised in the first two chapters.

The rest of the book covers the hardware and software aspects of Beowulfs in detail. The common feature of all the chapters is that they do not go into specifics, which makes sense. However, the specific examples that are there are quite outdated. This is to be expected since the book was written before 1999, but it would have been nice if it had some updated examples. That said, the descriptions of various components like motherboards, RAM, the PCI bus, etc. are general enough to provide a foundation on which to base further reading from more specialized literature.

The chapter on Beowulf nodes is good, but for a person familiar with computer hardware, it can get a little tedious, with superfluous details. However, keeping in mind the audience to which this book is targeted, I feel the extra detail is justified. I learned a few things about the PCI bus myself! The chapter describes the various components that make up a machine, and then goes on to describe how to assemble one. Though the description is quite general, it provides a good preparation for a person planning on building machines to create a cluster. Networking is covered in much detail, discussing the large variety of network hardware and topologies available. The roles of various network components are discussed. In general, most Beowulf installations will employ fast Ethernet, the most cost-effective option. However, in many cases, the network may become the bottleneck, and high performance network components are required. High end hardware and protocols like FDDI, ATM, and Myrinet are all discussed. The software aspects of the network are also described in detail, including TCP/IP packets, sockets, RPC, and distributed filesystems (NFS and AFS). There is also a section on Java/RMI and CORBA. Mention is made of the r* commands. Overall, this is a very detailed and useful chapter, covering one of the ares of building clusters which can most affect performance.

After the first five chapters, the book starts on the software needed to manage and program a Beowulf. The chapter on managing clusters of machine is quite useful. It raises the issues of security and access to the cluster. Quite a bit of the chapter is devoted to discussion of how to allow access to the cluster and the implementation of a firewall (though it uses ipfwadm in the example!). The section on cloning nodes is very useful and goes a long way toward making installations on a large number of machines much more consistent. Some basic system administration and management tips are also provided. This part of the book is lacking in two ways: Firstly, though it does a good job of discussing basic security, it relies on the use rsh for system administration tasks. SSH is mentioned as a viable option. I would have preferred greater stress on SSH. Secondly, there is hardly any discussion of job submission/scheduling. These are very important components of a cluster used by several people with several jobs. Packages like DQS and PBS could have been discussed.

The last three chapters are devoted to programming a Beowulf. The reader should not expect to become an expert on parallel programming techniques with the help of these chapters. However, as a general introduction to the various aspects of parallel code, they do their job. Various features of parallel code such as graininess, synchronization, latency, and bandwidth are all discussed. A whole chapter is devoted to discussion of MPI. As before, the reader will not reach MPI guru status with the help of the chapter, but it does provide a concise overview of the functionality provided by MPI. Features such as synchronous and asynchronous I/O, data types, and parallel data structures are all covered. I was quite impressed with this chapter, as it provided a very clear and informative description of MPI, with non-trivial examples.

The final chapter provides a much more detailed example of MPI programming, using parallel sorting techniques as its basis. It concisely describes the pitfalls to look out for and the features of good parallel algorithms. Features such as communication costs (in terms of time), load balancing, redundancy, etc. are all described. It is detailed in its analysis of the sort algorithms in terms of time costs and bandwidth costs. Overall, the last chapters are very useful and helpful. Admittedly, they won't make you an expert, but they do provide a very good sampling of the problems and pitfalls of programming Beowulf clusters and provide tips on how to solve or get around them.

To sum up, the book provides an excellent description of the various components of Beowulf clusters. This is impressive, since there are so many variations possible when constructing clustered machines. The easy going style of the authors makes the book an enjoyable read. It covers both hardware and software components in detail, with examples. In both cases, the descriptions are sufficiently detailed, but general enough so as not to be too specific to a certain piece of hardware or software. A few things do detract from the book's utility: It is definitely outdated. A lot of the hardware examples refer to very old CPUs (the PII was the top-of-the-line CPU for this book). The firewall example uses ipfwadm. Security is covered, but I feel that SSH should have been given a little more stress and the dangers of rsh made more apparent. Finally, job management and scheduling should have been discussed in more detail.

Overall, I'd recommend this book to scientists and engineers who want to harness the power of a Beowulf cluster but aren't exactly sure how to go about it.