| 
 Products | 
|
Contents

|
|
Introduction

The ClusterVisionOS™ is a Linux-based cluster operating system and software environment specifically developed for use on ClusterVision™s compute, storage and database clusters. It is flexible, easy to use and it includes all software required for the effective use and management of clusters of any size. Software packages include compilers, mathematical libraries, MPI libraries, a choice of queuing and scheduling systems, and cluster management and monitoring software.
ClusterVisionOS™ Elements

The ClusterVisionOS™ includes the following elements:
- Linux Distribution: The ClusterVisionOS™ is available with a choice of different Linux distributions as its basis. Currently support ed distributions include Scientific Linux, Red Hat Linux and SuSE Linux. Users and system administrators familiar with Scientific Linux, Red Hat or SuSE will therefore find it easy to use and administrate. Effectively, the ClusterVisionOS™ is a set of tools tightly integrated in the regular Linux distributions without significantly affecting the base distribution. Applications that are certified to work on Red Hat or SuSE (such as Abaqus, Fluent or Oracle 10g) work transparently and with no problems on the cluster, even if there are strict requirements.
- Image Management: The management of multiple images for the disk, slave and login nodes. Any changes made to the images are distributed across the cluster with a single command. In addition to this, multiple images can be created for one type of group in order to have multiple revisions. With the ClusterVisionOS™ tool "Trinity" it is easy to change the behaviour of nodes.
- Parallel Tools: A set of scalable tools for distributed Unix commands and management of the cluster. This includes power down/up, restarting, shutdown of the nodes and executing/copying files across the cluster. The tools are tightly integrated with the Image and Cluster Management.
- Application Management: A set of tools to manage applications and different versions of the same application. For example, users can freely choose between different versions of MPI libraries, compilers, scientific libraries etc. System administrators can release a beta version of an application and allow a subset of users to start testing.
- Resource Management: A queuing system for load balancing jobs across the cluster, tightly integrated with the MPI environment for total process management.
- Cluster Monitoring: Spot trends in for example CPU temperatures and fan speeds and monitor the complete cluster.
- Cluster Security: The ClusterVisionOS™ can be well secured from attacks from your local network or the Internet. For example, ClusterVisionOS™ uses a very easy to configure Firewall for optimal security of the master node.
- Compilers & Scientific Libraries: Each cluster is delivered with a set of compilers for creating optimally performing applications. In addition to this a number of scientific libraries are available.
- Linux Configuration: A typical Scientific Linux, Red Hat or SuSE installation is not configured to operate as a cluster node. Many services are installed (LDAP, http, dhcpd etc.) but not configured. In the ClusterVisionOS™ this is done optimally in order to scale across many nodes. For example, we configure the cluster such that: the internal network has a low latency; large arrays can be allocated in shared memory (important for many Quantum Mechanical applications); a huge number of files can be opened; etc. The environment is tested with a large number of scientific applications and users to make sure it is scalable and optimally performing.
The following sections will explain some of these elements in more detail.
|
 Case Study The Cambridge University Supercomputer Introduction
The Cambridge University Supercomputer is a cluster built by ClusterVision in collaboration with Dell. With more than 1150 Intel® Xeon® CPUs and Infinipath interconnect throughtout, it entered the TOP500 at position 20.
It is an excellent reference for the ClusterVisionOS™ because of its complexity, short installation time, and intensive utilisation. It currently provides a central compute resource for the university to the full satisfaction of its managers and users.
Computational Units
The cluster uses the advanced multi-level image management in the ClusterVisionOS™ to create nine computational units (CUs), each with its own Sub-Master. Each CU has 65 nodes for computation. The CUs are managed by two failover Sub-Masters.
What is unique about the setup is that the CUs can run independently — if one CU fails the other CUs will keep on functioning. Also, if one of the ClusterVisionOS™ Main Masters fails, the cluster will keep on running. User jobs, however, are not limited to a single CU and can span multiple CUs.
The construction using independent CUs allows for scaling up to far larger clusters with tens of thousands servers. Users log into one of the four DNS balanced ClusterVisionOS™ Login Nodes.
|
Image Mangement

Effective and flexible Image Management is one of the most important parts of the ClusterVisionOS™. The effective, flexible and multi-level image management is one area that distinguishes ClusterVisionOS™ from its competition and allows for very complex and large-scale cluster implementations.
Within the image management system, slave nodes install from software images located on the master node(s). Any information that is needed for the configuration of these nodes is retrieved from the master node.
The cluster nodes can boot from an unlimited number of unique images. This can be determined upon boot time or preconfigured for each node. Groups of nodes use the same image and typical slave services like Myrinet or any other fast network are automatically started if these hardware devices are present.
The images can differ in packages and even Linux OS (Scientific Linux, Redhat, SuSE, Debian, etc). No change to the kernel is required to make it capable of running as part of the cluster. This means that if your software needs an 'official' Red Hat ES x.x kernel (for example if you use the Oracle 10g database) it will run with no problem. Any software that is certified to run under the Scientific Linux, Red Hat or SuSE distribution will work on the cluster.
Boot Phases Four phases can be distinguished during the boot time of a cluster node:
First Phase: Upon booting a node retrieves its IP address and the Execution Environment from the master node using PXE. The cluster may be configured to skip this part, for example if a particular node should boot independently from the master node.
Second Phase: The Execution Environment will check with the master node which OS needs to be executed on the cluster node. The appropriate kernel/ramdisk is loaded from the master node and executed on the cluster node.
Third Phase: The node will mount a special environment from the master node in which it is decided what needs to happen to the node. Once the environment is mounted using NFS to the master node, there are a number of possibilities. For example:
- Burn-test a node in order to find hardware failures.
- Partition the hard disk and install an image.
- Synchronise (part) of the local hard drive with the image from the master node.
- Bash-shell diagnostics.
Fourth Phase: The node boots from the local hard drive. For this, the node does not require a restart; each phase logically follows the other.
Advantages There are a number of important advantages to working within an image based cluster environment:
- The master node provides a single point of administration for the whole cluster. Replacing hardware and reinstalling software on slave nodes is easy.
- Only the master node's hard disk needs to be backed up.
- It only takes a single command or a reboot for a slave node to return to a known configuration state.
- Nodes are automatically partitioned, installed and configured from scratch within a few minutes per node.
- Multiple images for different groups of nodes within the cluster can easily be created and maintained. Multiple revisions can be kept for a single group, so if after changing something it stops working correctly you can always go back to the previous image.
- Groups of nodes can be synchronised with their image using one command.
Consistent group images. If one node in a group stops working and a reboot doesn't help it is almost certainly a hardware problem.
- Different Linux distributions, or in principle even different operating systems (such as Microsoft Windows Compute Cluster) can be used within the cluster simultaneously.
- System administrators rarely need to login to a slave node for administration purposes.
- Because of the software images, additional serial based access networks or KVM (Keyboard/Video/Mouse) switches are normally unnecessary.
|
|
One of the most important reasons for forcing synchronisation of node installations with images on the master is that it will keep your cluster tidy. If no image environment is used, a frequently encountered problem is that after a while each node will have its own unique installation due to upgrades that finish incorrectly or other errors. This makes tracing problems in your environment almost an impossible task to perform. With the ClusterVisionOS™ you will not experience this problem.
The software images and nodes are managed and configured using the Trinity tool.
Trinity The Trinity tool is developed by ClusterVision and plays a central role in managing all the nodes and devices inside a cluster. It is used to create groups of nodes and assign software images to them. Trinity is also used to configure the power management as each device can be assigned to one, or more, power outlets. To ease the administration and configuration of several key Linux services Trinity will generate all their configuration files from it's device database. The services include DHCP, TFTP and DNS servers. The Trinity tool hides all the implementation details and provides a user friendly, graphical interface to managing the nodes. When all the nodes and devices have been configured in Trinity, booting the nodes, even if they have never been started before, becomes a fully automated process no longer requiring user interaction.
|
 | | Screen shots from the Trinity tool | |
Parallel Tools

The parallel tools are a set of distributed Unix commands for the management of clusters. They are tightly integrated with the Image & Cluster Management. The parallel tools can be used by the administrator to distribute images, manually monitor/configure and power up/down nodes, and much more. Features include:
- Smart broadcasting to find out which nodes are dead and should not receive commands.
- Issue commands on groups of nodes.
- Automatic killing of commands that take too much time.
- Direct access to APC power management of the nodes.
- Extensive command line options, including the use of wildcards (*).
- Parallel background execution for commands that take a long time.
- Enhanced copy mechanism for copying files and directories to the nodes and back using synchronisation tools.
|
 | | Screen shot from the Parallel Tool | |
Power Management Integrated with the parallel tools is the power management of the nodes using APC Switched Rack PDUs (power distribution units) and similar solutions. Power management allows full control over the power supply to the cluster nodes. All cluster nodes can be switched on or off from the master node, individually or in groups. This offers several advantages to the system administrators of our clusters.
All nodes can be switched on and off from the master node. This is especially useful in clusters with many slave nodes. Even when a slave node hangs or cannot be reached over the network for other reasons, the remote power management can power-cycle the offending slave node. In combination with the Image & Cluster Management and PXE booting environment this makes it possible to always boot up a node in the cluster, even if the software on the node is entirely corrupted.
Another reason for having Power Management is that if a power failure causes the whole cluster to be switched off, the APC switched rack PDUs will prevent the cluster nodes from switching on instantaneously causing a power surge. Even if all the cluster nodes are switched off by default, the sudden return of power will trip circuit breakers. Nodes that are switched off still draw sufficient current to trigger this. We strongly recommend using APC Switched Rack PDUs or similar solutions in clusters of all sizes.
|
The APC switched rack power distribution unit.
|
Application Management

The ClusterVisionOS™ provides application, library and compiler version management through "Application Modules". Application Modules are a very useful tool for managing different versions of the same application or combination of applications. For example, users can choose easily between pre-defined and compatible combinations of applications, libraries and compilers. System administrators can release a beta version of an application and allow a subset of users to start testing. With commands like "module list" and "module add intel-14.3", users can load different tools in their environment, whether they use tcsh, csh or bash as a shell. Administrators can also decide for a "default" environment for their users. Users can unload the default environment and load a different one.
Application modules become particularly useful if you have multiple applications that need different versions of compiler libraries available. Some applications are only certified to work with a specific compiler version.
Advantages Advantages of the Application Modules include:
- Different versions of the same application can be installed and used by your users.
- Different libraries that have the same command structure can be installed, like MPICH and LAMMPI.
- If you have two different types of networks, like Gigabit and InfiniBand, you can load the environment that belongs to either of these networks in combination with the best performing compiler. For example, a compiler command like "mpicc" will use the correct compiler and MPI library depending on which environment is loaded.
- Applications & libraries are not installed in the usual /usr/bin/ or /usr/local/bin directories, but in a special directory called /cvos/shared/apps/. In this directory each application has its own sub directory. This will keep your environment clean and manageable.
- Extensive options for creating the correct module for each application, including the location of manual pages, paths to libraries and other environment variables are available.
- It is easy to add new modules and applications to the environment.
|
|
Resource Management

The ClusterVisionOS™ offers a choice between the Sun Grid Engine (SGE) or Torque (formally known as OpenPBS) for job queuing and Maui for job scheduling (license terms & conditions may apply). Other queuing systems such as PBS Pro, Platform LSF and MOAB are available as a cost option. SGE and Torque/Maui are batch job and compute resource management packages which accept batch and parallel jobs, preserve and protect the job until it is run, run the job, and deliver output back to the submitter. Some of the features include automatic load-leveling, file staging, job interdependency, security and authorisation, grid connection, username mapping, job accounting, job suspension and a graphical user interface.
The Maui scheduler acts as a service to the queuing system and enforces job scheduling policies. Maui controls which jobs are run where, when and in which order. Maui includes many advanced features such as back-fill scheduling, which can significantly increase the throughput through the batch queues.
Resource management is of fundamental importance to most multi-user clusters. With resource management many users can freely submit jobs to the job queues of a cluster and these jobs will be automatically redistributed across the cluster. If the cluster is "full", jobs will stay queued until there is available space.
Tight Integration Important is the level of integration between the resource manager and the cluster environment. Although all resource managers can control a batch process (jobs with a single process), none of these resource managers can deal natively with parallel jobs. A typical characteristic of a parallel job is that the resource manager only "knows" about the first process, but does not know anything about the processes that are created by this first process. In the ClusterVisionOS™ the resource manager is tightly integrated with the MPI environment and therefore knows about all processes. This is very important because when a job fails or is deleted, all processes associated with this job should be killed. The ClusterVisionOS™ will take care of this.
Which queuing system suits you best is something we will discuss with you before the installation of the cluster. Training on the management and usage of the queuing system can be provided. They are also well covered in our onsite training, and before cluster handover the resource manager is configured to your taste.
|
The ClusterVisionOS™ offers a choice between various queuing systems, including the Grid Engine
|
Cluster Monitoring

The ClusterVisionOS™ includes a number of graphical tools for the monitoring and management of clusters and server farms. The main tools are Ganglia and Nagios.
Ganglia Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualisation. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency.
We have tightly integrated Ganglia within the ClusterVisionOS™ and adapted it to work with many more types of measurements. We have also changed the behaviour of Ganglia such that it scales to many nodes without inducing a continuous load on the slave nodes. Ganglia is important for cluster management as it will assist you in spotting trends in the cluster environment over a long period. This will enable you to predict potential failures early.
|
 | | Screen shot from Ganglia graph | |
Nagios Nagios is a server network monitoring and management tool with many features which can be made useful for clusters, including:
- Monitoring of network services (LDAP, NFS, SMTP, HTTP, NNTP, PING, etc.).
- Monitoring of host resources (processor load, disk and memory usage, running processes, log files, etc.).
- Monitoring of environmental factors such as temperature. This is very useful in order to monitor the temperature of the cabinets.
- Ability to define network host hierarchy, allowing detection of and distinction between hosts that are down and those that are unreachable.
- Contact notifications when service or host problems occur and get resolved (via email, SMS, or other user-defined method).
- Optional escalation of host and service notifications to different contact groups.
- Ability to define event handlers to be run during service or host events for proactive problem resolution.
- Support for implementing redundant and distributed monitoring servers.
- External command interface that allows on-the-fly modifications to be made to the monitoring and notification behaviour through the use of event handlers, the web interface, and third-party applications.
- Retention of host and service status across program restarts.
- Scheduled downtime for suppressing host and service notifications during periods of planned outages.
- Ability to acknowledge problems via the web interface.
- Web interface for viewing current network status, notification and problem history, log file, etc.
- Simple authorization scheme that allows you restrict what users can see and do from the web interface.
Alarm Raising with SMS and Service Monitoring An optional feature in the ClusterVisionOS™ is the possibility to raise alarms using SMS messages. Services and environment variables like cabinet temperatures are monitored using Nagios, and if certain temperature thresholds are exceeded or when a service fails, an SMS or email message can be sent to the cluster administrator. Critical services include DHCP, TFTP, LDAP and NFS.
|
 | | Screen shot from Nagios | |
Cluster Security

A compute cluster is an attractive target for computer hackers. Hence cluster security should be of high importance to the cluster administrator. As large clusters usually have hundreds of different users, securing the cluster is not a trivial task. ClusterVision's turnkey clusters are preconfigured with a high level of security, making attacks from your local network or the Internet very difficult.
Firewall The first level of security is a very easy to configure and flexible firewall for optimal security of the master node. For this we use Shorewall. The Shorewall Firewall is a high-level tool for configuring Netfilter. It allows the system administrator to describe firewall/gateway requirements using entries in a set of configuration files. Shorewall reads those configuration files and with the help of the iptables utility, Shorewall configures Netfilter to match your requirements.
Shorewall does not use Netfilter's ipchains compatibility mode and can thus take advantage of Netfilter's connection state tracking capabilities. Using Shorewall we usually only allow SSH traffic in and out of the cluster. However, this can be easily adapted. For example, it is easy to allow the nodes in the cluster to contact software license servers that are located on your local network.
Access Restriction The second level of security is optional; users can be limited to have access to the master node and login nodes only. Users are not allowed to log into any of the slave nodes. Their only method of running jobs on the cluster would be by using the resource manager. Between the nodes in the cluster only SSH and NFS traffic is allowed.
Software Updates The third level of security is provided for trough software updates. Security patches and bug fixes are available directly from the underlying Linux distribution, such as Scientific Linux, Red Hat or SuSE. These updates can be automatically downloaded and installed on the master node and the software images for the other nodes. ClusterVision also provides repositories with updates to the cluster management software and other cluster specific packages. These updates can be automatically installed using yum. Once all the updates are installed a simple command can be used to distribute the patched images across the cluster, making your cluster secure and up-to-date.
Linux Security Besides these three levels of security, the normal Linux security is fully configured in ClusterVisionOS™. PAM modules, shadow passwords, and other typical Linux security features are enabled. Cluster security is a central part of the onsite training provided with each turnkey cluster.
Compilers & Libraries

Each cluster is delivered with a set of compilers for creating optimally performing applications. In addition to this a number of scientific libraries are pre-installed. The ClusterVisionOS™ includes (amongst others) the following compilers, debuggers and maths libraries:
GNU Compilers and Debugger These are the standard Linux compilers. Included are the GNU C, C++, Objective C, Fortran 77/90/95, Java and Ada compilers and debuggers.
Intel Compilers
The Intel compiler suite includes Fortran 70/90/95, C, C++ and OpenMP compilers. The Intel compilers are optimised for the Intel processors and can make optimal use of specific Intel instruction sets such as SSE2 and SSE3. (The Intel compilers are a cost option.)
Intel Math Kernel Library The Intel Math Kernel Library (MKL) contains the complete set of LAPACK routines, the complete set of functions from the basic linear algebra subprograms (BLAS), the extended BLAS (sparse) and a small set of BLAS-like functions to compute minima for various data types. The MKL is optimised for and supports all Intel processors, including the Itanium2. (The MKL is a cost option.)
Portland Group Compilers
The Portland Group PGI Workstation compiler suite includes Fortran 70/90/95, HPF, C, C++ and OpenMP compilers, debuggers and profilers. The PGI compilers usually produce significantly faster binaries than the GNU compilers. (The PGI compilers are a cost option.)
PathScale Compiler Suite
The PathScale compiler suite inludes C, C++, Fortran 77/90/95 and OpenMP compilers and debuggers. The PathScale compilers usually produce significantly faster binaries than the GNU compilers. (The PathScale compilers are a cost option.)
ATLAS and GOTO Libraries The ATLAS (Automatically Tuned Linear Algebra Software) and GOTO libraries are highly optimised implementations of the BLAS (Basic Linear Algebra Subprograms) linear algebra kernel library. (License restrictions may apply to the GOTO libraries).
ScaLAPACK ScaLAPACK, or Scalable LAPACK, is a library of high performance linear algebra routines specifically developed for distributed memory message-passing MIMD computers which use MPI and/or PVM, such as Beowulf clusters. ScaLAPACK in the ClusterVisionOS™ is optimised to work with fast interconnects like Gigabit, Myrinet and Infiniband.
Global Arrays Global Arrays (GA) allows you to create large matrices spanning multiple nodes. GA is used in a number of Quantum Mechanical packages like GAMESS-UK and MolPro, and its performance is crucial. In ClusterVisionOS™ GA is tested thoroughly.
MPI & PVM User Environment
The ClusterVisionOS™ offers a choice of message passing libraries, including, LAMMPI, MPICH, OpenMPI and, if applicable, an MPI library optimised for a high-speed interconnect. Most available message passing libraries can be combined with all aforementioned compilers.
For customers who require training in MPI or PVM programming, we offer parallel programming courses as part of our professional services program.
|
|
Documentation

The ClusterVisionOS™ comes with a comprehensive administrator manual. A user manual is provided in the form of a central Wiki, accessible to all our customers. The user manual Wiki, or parts thereof, can also be installed locally and adapted to local preferences. If required, the user manual Wiki can be printed.
|
|
|
| | |
| |