The rising prominence of large language models (LLMs) has surged the demand for compute capacity in datacenter GPUs, prompting cloud providers and enterprises to expand their datacenter capabilities. However, the intensifying power requirements of these expanding models present challenges.
A recent study has revealed a notable opportunity to enhance power efficiency in LLM clusters through power oversubscription. This approach not only amplifies the number of deployable servers in a datacenter but also slashes deployment time, a boon given the prolonged process of constructing new datacenters.
Researchers delved deep into the power consumption tendencies of various LLM configurations, distinguishing between inference and training consumption patterns. Their findings indicate that LLM clusters typically don’t exhibit high average or peak power consumption during inference tasks. This conclusion, mirrored by data from operational LLM clusters, suggests there’s considerable room to apply power oversubscription to inference workloads.
Yet, challenges arise due to the limited telemetry and controls GPUs provide in virtual environments, complicating the development of a steadfast power oversubscription system.
Enter POLCA: a cutting-edge framework designed for power oversubscription in GPU clusters. Emulating real-world power patterns using open-source models, simulations show that POLCA can facilitate the deployment of an additional 30% of servers in a GPU cluster, dedicated to inference, with negligible performance degradation.