Khoshkbar Foroushha, Ali Reza
Description
Efficiently and effectively processing large volume of
data (often at high velocity) using an optimal mix of
data-intensive systems (e.g., batch processing, stream
processing, NoSQL) is the key step in the big data
value chain. Availability and affordability of these
data-intensive systems as cloud managed services (e.g,
AmazonElastic MapReduce, Amazon DynamoDB) have enabled data
scientists and software engineers to deploy ...[Show more] versatile data
analytics flow applications, such as click-stream analysis
and collaborative filtering with less efforts. Although easy to
deploy, run-time performance and elasticity management of
these complex data analytics flow applications has emerged
as a major challenge. As we discuss later in this
thesis, the data analytics flow applications combine multiple
programming models for per-forming specialized and pre-defined
set of activities, such as ingestion, analytics, and storage of
data. To support users across such heterogeneous workloads where
they are charged for every CPU cycle used and every data byte
transferred in or out of the cloud datacenter, we need a set of
intelligent performance and workload management techniques and
tools. Our research methodology investigates and develops these
techniques and tools by significantly extending the well known
formal mod-els available from other disciplines of computer
science including machine learning, optimization and control
theory.
To this end, this PhD dissertation makes the following
core research contributions: a) investigates a novel workload
prediction models (based on machine learn-ing techniques, such
as Mixture Density Networks) to forecast how performance
parameters of data-intensive systems are affected due to run-time
variations in dataflow behaviours (e.g. data volume, data
velocity, query mix) b) investigates control-theoretic approach
for managing elasticity of data-intensive systems for ensuring
the achievement of service level objectives. In the former (a),
we propose a novel application of Mixture Density Networks in
distribution-based resource and performance modelling of both
stream and batch processing data-intensive systems. We
argue that distribution-based resource and performance
modelling approach, unlike the existing single point
techniques, is able to predict the whole spectrum of
resource usage and performance behaviours as probability
distribution functions. Therefore, they provide more valuable
statistical measures about the system performance at
run-time. To demonstrate the usefulness of our technique, we
apply it to undertake following workload management activities:
i) predictable auto-scaling policy setting which highlights the
potential of distribution prediction in consistent
definition of cloud elasticity rules; and ii) designing a
predictive admission controller which is able to
efficiently admit or reject incoming queries based on
probabilistic service level agreements compliance goals.
In the latter (b), we apply advanced techniques in control and
optimization theory, for designing an adaptive control scheme
that is able to continuously detect and self-adapt to workload
changes for meeting the users’ service level objectives.
More-over, we also develop a workload management tool
called Flower for end-to-end elasticity management of
different data-intensive systems across the data analytics
flows. Through extensive numerical and empirical evaluation we
validate the pro-posed models, techniques and tools.
Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.