Region-based techniques for modeling and enhancing cluster OpenMP performance

Cai, Jie

Region-based techniques for modeling and enhancing cluster OpenMP performance

Date

2011

Authors

Cai, Jie

Abstract

Cluster OpenMP enables the use of the OpenMP shared memory programming clusters. Intel has released a cluster OpenMP implementation called Intel Cluster OpenMP (CLOMP). While this offers better programmability than message passing alternatives such as the Message Passing Interface (MPI), such convenience comes with overheads resulting from having to maintain the consistency of underlying shared memory abstractions. CLOMP is no exception. This thesis introduces models for understanding these overheads of cluster OpenMP implementations like CLOMP and proposes techniques for enhancing their performance. Cluster OpenMP systems are usually implemented using page-based software distributed shared memory systems. A key issue for such system is maintaining the consistency of the shared memory space. This forms a major source of overhead, and it is driven by detecting and servicing page faults. To understand these systems, we evaluate their performance with different OpenMP applications, and we also develop a benchmark, called MCBENCH, to characterize the memory consistency costs. Using MCBENCH, we discover that this overhead is proportional to the number of writers to the same shared page and the number of shared pages. Furthermore, we divide an OpenMP program into parallel and serial regions. Based on the regions, we develop two region-based models to rationalize the numbers and types of the page faults and their associated costs to performance. The models highlight the fact that the major overhead is servicing the type of page faults, which requires data to be transferred across a network. With this understanding, we have developed three region-based prefetch (ReP) techniques based on the execution history of each region. The first ReP technique (TReP) considers temporal paging behaviour between consecutive executions of the same region. The second technique (HReP) considers both the temporal paging behaviour between consecutive region executions and the spatial paging behaviour within a region execution. The last technique (DReP) utilizes a novel stride-augmented run length encoding (sRLE) method to address the both the temporal and spatial paging behaviour between consecutive region executions. RePs effectively reduce the number of page faults and aggregate data into larger transfers, which leverages the network bandwidth provided by interconnects. All three ReP techniques are implemented into runtime libraries of CLOMP to enhance its performance. Both the original and the enhanced CLOMP are evaluated using the NAS Parallel Benchmark OpenMP (NPB-OMP) suite, and two LINPACK OpenMP benchmarks on two clusters connected with Ethernet and InfiniBand interconnects. The performance data is quantitatively analyzed and modeled. MCBENCH is used to evaluate the impact of ReP techniques on memory consistency cost. The evaluation results demonstrate that, on average, CLOMP spends 75% and 55% overall elapsed time of the NPB-OMP benchmarks on Gigabit Ethernet and double data rate InfiniBand network respectively. These ratios of the NPB-OMP benchmarks are reduced effectively by ?60% and ?40% after implementing the ReP techniques on to the CLOMP runtime. For the LINPACK benchmarks, with the assistance of sRLE, DReP significantly outperforms the other ReP techniques with effectively reducing 50% and 58% of page fault handling costs on the Ethernet and InfiniBand networks respectively.