POWER PROFILING, ANALYSIS, LEARNING, AND MANAGEMENT FOR HIGH-PERFORMANCE COMPUTING
MetadataShow full item record
As the field of supercomputing continues its relentless push towards greater speeds and higher levels of parallelism the power consumption of these large scale systems is steadily transitioning from a burden to a serious problem. While the machines are highly scaleable, the buildings, power supplies, etc. are not. Even the most power efficient systems today consume one to two megawatts per peata op/s. Multiplying that by 1,000 to reach the next generation of supercomputer (i.e., exascale) and the power necessary just to turn the machine on is simply impractical. Thus, power has become a primary design constraint for future supercomputing system designs. As such, it has become a matter of paramount importance to understand exactly how current generation systems utilize power and what implications this has on future systems. As the saying goes, you can't manage what you don't measure. This work addresses several large hurdles in fully understanding the power consumption of current systems and making actionable decisions based on this understanding. First, by leveraging environmental data collected from runs of real leadership class applications we analyze power consumption and temperature as it pertains to scale on a production IBM Blue Gene/Q supercomputer. Then, through development of a new power monitoring library, MonEQ, we quantitatively studied how power is consumed in major portions of the system (e.g., CPU, memory, etc.) through profiling of microbenchmarks. Expanding on this, we then studied how scale and network topology affect power consumption for several well-known benchmarks. Wanting to increase the effectiveness of our power monitoring library, we extended it to work with many of the most common classes of hardware available in today's HPC landscape. In doing so, we provided an in-depth analysis of what data is obtainable, what the process of obtaining it is like, and how data from different systems compares. Next, utilizing the knowledge gained from these experiences, we developed a new scheduling approach which utilizing power data can effectively keep a production system's power consumption under a user-specified power cap without modification to the applications running on the system. Finally, we extend this scheduling approach to be applicable to more than just one objective. In doing so, the scheduler can now optimize on multiple criteria instead of simply considering system utilization.