| |
QsNetII Features and Benefits
The unique features of QsNetII that make it particularly suited as an interconnect for high performance computing clusters are:
Full, pageable 64 bit virtual memory support
Multiple, virtual, programmable network interfaces
Ultra-low latency short messaging
Optimized support for scalable global operations
Ability to scale number of network connections with number of CPUs for SMP nodes.
Proven scalability to many 1000s of processors.
In addition to the above it is important to remember that Quadrics is unique amongst high performance interconnect vendors in providing a complete, verified and fully supported solution encompassing both hardware and software. Quadrics provides a single open source software distribution including drivers, modules for use of QsNetII within the OS, user level libraries (MPI and Shmem), switch network management software and diagnostics.
Full pageable 64-bit virtual memory support
All Elan 4 memory references are to full 64 bit virtual memory addresses. This enables zero copy send and receive from anywhere in a processes memory space even on the systems with very large amounts of memory. The support of pageable memory means that memory that may be accessed by the network adapter does not have to be locked down, allowing the whole of a processes address space to be enabled for RDMA. The requirement to lock down memory has a substantial CPU overhead, and limits the amount of memory that can be enabled for RDMA operations at any time.
The Elan keeps its own set of page tables in local memory on the network interface card. This insures that translations can be fetched at low latency and without consuming any host bus bandwidth The Elan page tables are kept in sync with the main processor page tables by software, which provides the flexibility to support different CPU architectures with different page table formats. Elan 4 supports two page sizes to allow for efficient support of large pages in systems with large amounts of physical memory.
Programmable, virtual network interface
QsNetII makes use of the Elan4 thread processor and the Elan4 event engine to offload processing from the main CPU. Elan4 functional units are fully virtual in operation, so that each application can use its own network protocols on its own virtual network interface. This has several advantages:
Multiple versions of protocols can be run simultaneously on the same network interface, allowing, for example, a release and beta release version of MPI to be in use at the same time.
User code can be executed on the network interface without compromising security allowing experimental code to safely tested on a production system
There is no layering of protocols, which adds to overhead and latency.
The Elan4 thread processor and event engine are used to handle asynchronous communication operations avoiding the need for interrupts or a dedicated handling thread on the main CPU. Examples of operations that benefit from this are:
MPI sender/tag match performed by the thread processor
Gather operations performed by the event engine
Reduction operations performed by the thread processor
Offloading such operations to the Elan leaves the main CPU(s) free. It also reduces the susceptibility to OS noise (there not being a main CPU available when one is required) improving scalability on large systems.
MPI requires non-blocking operations to progress without the application making further MPI calls. The Elan4 thread processor manages progression independently of the main CPU. Infiniband does not have this capability and must interrupt the main CPU or run a user-level thread in order to achieve independent progression. One of the most commonly used MPI implementations for Infiniband (MVAPICH) does not meet the MPI standard in this respect - the user must make MPI calls in order to progress outstanding communication requests.
Other networks are capable of performing RDMA read and write operations that overlap comms and compute. However before this can be used for transferring large messages the sender must know the target address. It must send a request to the receiving process and wait for a reply containing the target address. This greatly reduces the scope for overlapping communication and computation. QsNetII provides full overlap. The sending process uses a queuing DMA to transmit the envelope information and a small payload to a hardware queue managed by the thread processor in the receiving node, which then completes the transfer.
Ultra low latency short messages
The Elan4 Short Transaction Engine (or STEN) is specifically designed to support short put operations. The main CPU simply writes address, data and virtual process to a command queue, leaving the STEN to manage the network operations and set a completion event if required. This approach significantly increases the issue rate of short put and gets operations. This is particularly important on operations that require large numbers of short messages, such as scatter gather operations.
Support for scalable global operations
QsNetII provides hardware support for broadcast to a range of nodes. A broadcast DMA delivers data to a range of nodes in the same time it takes to send it to just one. Broadcast packets are routed to a switch high enough in the network to span the target range. The switches then broadcast the packets over multiple output links and combine the acknowledgements returned by the destination nodes. The source receives a single acknowledge confirming that the operation is complete on all nodes.
QsNetII provides hardware support for network conditionals, for example, testing the value stored at a given virtual address across a range of nodes. These fns are used in our optimized collectives (libelan) and hence in MPI/Shmem.
Global reductions functions can be implemented directly on the Elan 4 thread processor. The Elan4 thread processor can perform efficient emulation of floating point operations such as fadd and fcompare, allowing global reduction operations such as GMAX and GSUM to be implemented without CPU intervention.
Multiple network connections
In systems constructed from high CPU count SMP nodes it is necessary to have multiple network connections in order to maintain a reasonable compute to communications ratio. A QsNetII system can have up to 16 parallel networks, or "rails".
This additional communications resource is utilized in a numbers of ways. The libelan library transparently stripes put, get and message passing operations over all available rails. Elan kernel comms makes use of multiple network rails to increase bandwidth available to kernel services (Lustre and IP over Elan in particular). It also provides transparent failover in the event of nodes becoming disconnected from one rail.
Design for scalability
QsNetII is a connectionless network. All processes in a parallel program are given the same single capability, describing and controlling their right to access each other's virtual address space. Conversely Infiniband is connection-based. Two processes that wish to communicate must go through a connection establishment phase (for each queue pair) or a key exchange (for RDMA) before data can be sent or received.
Conclusions
Quadrics focuses on high-end HPC clusters. Our products are designed solely for this market. We have world leading experience design, build, installation and support of interconnect and associated software for production clusters. Quadrics hardware and software is an integral part of many of the leading production HPC clusters, including Thunder an 4096-way Itanium II system at Lawrence Livermore National Laboratory, the most powerful system in the US and the second most powerful system in the world. Quadrics customers can rely on our ability to install these systems, get them accepted and support them throughout their lifetime. |
Summary
| Features | Benefits | | High speed, proprietary link | Implements memory-based operations for read, write, lock. | | Elan 4 NIC ASIC | Offloads communications tasks from the main processor. | | Elite 4 switch ASIC | Component for large scale networks > 4096 Nodes. | | Patch Free Kernel | Compatible with Red Hat, SuSE, Debian®. | | Redundant paths in network | Minimize congestion and provide High Availability. | | CRC Checking on every packet | Production Supercomputer RAS. | | QsNet Switch Packaging | Price/Performance for product markets. |
|
| Unique Features | Benefits | | Full 64-bit NIC | Supports large memory nodes consistent with current 64-bit commodity architectures. | | Full VM model | Avoids the need to lock-down memory. | | Programmable NIC | Provides for optimized support of higher level message passing APIs and local processing where appropriate. | | Network barrier and broadcast | Efficient support for critical parallel application primitives. | | Transparent rail striping | Bandwidth scales with the size of the SMP node. | | Connectionless network | Avoids scaling limitations. | | Short Transaction ENgine | Provides for ultra-low level latency. | | | | |
|
Latest news Quadrics QsTenG for HPC Interconnect Product Family (13 Nov 2007). - Click Here to view
ISC07: Quadrics, ParTec and Forschungszentrum Jülich launch the new high-performance cluster projekt JuRoPA (27 Jun 2007). - Click Here to view
> Legal
| |