A Control-Theory Approach for Cluster Autonomic Management: Maximizing Usage While Avoiding Overload

Agustín Gabriel Yabo¹, Olivier Richard², Bruno Bzeznik², Bogdan Robu, Eric Rutten³

¹INRIA
²Univ. Grenoble Alpes
³INRIA Grenoble - Rhone-Alpes

Details

10:30 - 10:50 | Mon 19 Aug | Lau, 6-213 | MoA6.1

Session: Predictive Control 1

Abstract

In data centers, Cloud and HPC (High-Performance Computing) systems have increasingly become more varying in their behavior, in particular in aspects such as performance and power consumption, and the fact that they are becoming less predictable demands more runtime management. In this work, we describe results addressing autonomic administration in HPC systems for scienti c work ows management through a systems and control theory approach. We propose a model described by speci c parameters related to the key aspects of the infrastructure, from the Computer Science point of view, thus achieving a deterministic dynamical representation that contemplates the varying behaviors of the real computing system. Later on, we propose a simple model-predictive control loop to achieve two di erent objectives: a) maximize cluster utilization by best-e ort jobs, and b) control the le server’s load due to the impact of the jobs. The accuracy of the prediction relies on a parameter estimation scheme based on the well-known EKF (Extended Kalman Filter) to adjust the predictive-model to the real system, making the approach adaptive to parametric variations in the infrastructure. We show there is an average performance improvement of 8%, and consequently a reduction in the total computation time, when implementing the closed-loop strategy in the real system. The problem is addressed in a general way, to allow the implementation on similar HPC computing platforms, as well as scalability to di erent infrastructures.