Developing Scientific Software for Low-power System-on-Chip Processors: Optimising for Energy

Varghese, Anish

Developing Scientific Software for Low-power System-on-Chip Processors: Optimising for Energy

Date

2019

Authors

Varghese, Anish

Abstract

Energy consumption has been identified as the major bottleneck in the push to increase the scale of current High Performance Computing (HPC) systems. Consequently there has been an increased effort to investigate the suitability of low-power hardware for HPC. Low-power system-on-chips (LPSoCs), which are widely used in a mobile and embedded context, typically integrate multicore Central Processing Units (CPUs) and accelerators on a single chip, offering high floating point capabilities while consuming low power. While there are merits to using such low-power systems for scientific computing, there are a number of challenges in using them efficiently. This thesis considers three issues. i) development of applications which are able to use all the LPSoC processing elements effectively, ii) measurement, understanding and modelling of the energy usage of an application executing on such platforms, iii) strategies for deciding the optimal partitioning of an application's workload between the different processing elements in order to minimise energy-to-solution. Each of these issues are investigated in the context of three applications - two core computational science kernels, namely matrix multiplication as an exemplar of dense linear algebra and stencil computation as an exemplar of grid based numerical methods, and the complex block tridiagonal benchmark from the multizone NAS parallel benchmark suite. To study the challenges associated with the development of scientific software for LPSoCs, two fundamentally different systems are considered, the Epiphany-IV Network-on-chip (NoC) and the Tegra systems. The former was a kickstarter project which aimed to design a LPSoC that could scale to over 4096 cores with a peak performance in excess of 5 trillion single-precision floating point operations per second (TFLOP/s) while operating at an energy efficiency of 70 GFLOP/s per Watt. By contrast, the latter is a product range from multinational company NVIDIA that combines their popular Graphics Processing Unit (GPU) technology with a general purpose ARM processor in a mass market LPSoC. This thesis reports the implementation of both the matrix multiplication and stencil kernels on both systems comparing their performance, energy usage and the programming challenges associated with developing code for these systems to those on conventional systems. In order to analyse the energy efficiency of applications running on an LPSoC, the ability to measure its energy usage is crucial. However, very few platforms have internal sensors which provide details of energy usage, and when they do measurements obtained using such sensors are usually low-resolution and intrusive. This thesis presents a high-resolution, non-intrusive, energy measurement framework along with an Application Programming Interface (API) which enables an application to obtain real-time measurement of its energy usage at the function level. Based on these measurements a simple energy usage model is proposed to describe the energy usage as a function of how the workload is partitioned between the different computing devices. This model predicts the conditions under which energy minimisation occurs when using all available computing devices. This prediction is tested and demonstrated for the matrix multiplication and stencil kernels. Given access to high resolution, real-time energy measurements and a model describing energy usage as a function of how an application is partitioned between the available computing devices, this thesis explores various strategies for runtime energy tuning. Different scenarios are considered; offline pre-tuning, tuning based on estimates gained from solving a small fraction of the complete problem, and tuning based on iteratively solving fractions of the entire problem a small number of times with the expectation that the final solution involves many repetitions of this. The applicability of these for the model kernels is discussed and tested.