Agustín Gabriel Yabo1, Olivier Richard2, Bruno Bzeznik2, Bogdan Robu, Eric Rutten3
10:30 - 10:50 | Mon 19 Aug | Lau, 6-213 | MoA6.1
In data centers, Cloud and HPC (High-Performance Computing) systems have increasingly become more varying in their behavior, in particular in aspects such as performance and power consumption, and the fact that they are becoming less predictable demands more runtime management. In this work, we describe results addressing autonomic administration in HPC systems for scienti c work ows management through a systems and control theory approach. We propose a model described by speci c parameters related to the key aspects of the infrastructure, from the Computer Science point of view, thus achieving a deterministic dynamical representation that contemplates the varying behaviors of the real computing system. Later on, we propose a simple model-predictive control loop to achieve two di erent objectives: a) maximize cluster utilization by best-e ort jobs, and b) control the le server’s load due to the impact of the jobs. The accuracy of the prediction relies on a parameter estimation scheme based on the well-known EKF (Extended Kalman Filter) to adjust the predictive-model to the real system, making the approach adaptive to parametric variations in the infrastructure. We show there is an average performance improvement of 8%, and consequently a reduction in the total computation time, when implementing the closed-loop strategy in the real system. The problem is addressed in a general way, to allow the implementation on similar HPC computing platforms, as well as scalability to di erent infrastructures.