The rise of artificial intelligence (AI) has made this truly an exciting time in the world of computing. But with excitement and possibilities come questions and uncertainty and many organisations with data centres are scrambling to add AI capacity to existing operations. Says Jon Abbott, technologies director – global strategic clients at Vertiv
At the same time, those planning new data centres are wondering exactly how much AI capacity they should include in their blueprints. What will the rack architecture look like and what power and cooling will be needed?
Answering these types of questions will be difficult, but one way to make it easier is to design critical digital infrastructure in collaboration with leaders in AI development. In a world where change is both rapid and certain, the incorrect approach could mean investing in a solution that could be out of date shortly after it has been deployed. By being smart and deliberate, designing into obsolescence can be avoided and instead something more efficient, reliable and long-lasting can be developed.
The need to support AI is already bringing major changes in data centre infrastructure, but the biggest developments are yet to come. Over the remainder of the decade, AI will be the driving force behind a wave of changes, with new Graphics Processing Unit (GPU) architecture, row architecture and rack architecture all contributing to significant changes in IT infrastructure and critical digital infrastructure.
GPU architecture
Microprocessors are one of the key elements of an IT infrastructure that enables AI. These chips that are used to train AI models require a significant amount of power and generate a corresponding amount of heat. GPUs are the current chip of choice for running AI workloads because they allow parallel compute workloads and are 100x more efficient in terms of power consumption.
However, something to consider is that compute requirements will continue to increase exponentially – far outpacing any increase in chip efficiency. Whilst GPUs produce much more compute for the same amount of power used, the demands of AI mean that the amount of total power used will still increase. Therefore, it is important to know what chip makers are developing to be able to anticipate and meet power challenges.
Row architecture
For a long time, data centres have worked with 3 MW blocks directly coupled to 3 MW generators at 4,000-amp bus. Those days are ending. With the rise of AI, it is common now to start talking about 10 MW blocks and even 20 MW blocks.
This means that the industry is looking at a huge increase in size and scale to power them and the immediate challenge is finding where that power is going to come from. AI will also result in spiky loads associated with training generative AI and a demand for continuous power associated with cooling needs.
Rack architecture
The rise of AI means that rack densities could be 1 MW per rack by the end of the decade. To meet the growth of AI, rack architecture will increase from 30kW to 300kW-600kW densities in the near-term and possibly 1 MW and above by 2030.
For those organisations that already have significant infrastructure investments in place, they should be thinking about how they can make the most out of existing investments while still being prepared for a new scale of demands in the future. For many businesses, this means retrofitting assets. Careful consideration is needed of what densities the retrofits are for and what cooling technologies will be necessary as part of the retrofit.
Organisations should plan for increased rack densities by developing high-density rack power distribution units for managing power distribution more effectively in quickly evolving data centres.
Resilience to protect assets
There has been an immense investment in valuable next generation computing chips and server systems, much of it related to AI. This means that keeping systems up and running is more critical than ever before. To get the most out of these assets, IT infrastructure will need to run 24/7, making resilience a top priority. In the future resilience will be more important, and more complex.
AI demands often require different levels of resilience for different functions. For example, the resilience needed for training is different to the resilience needed for inference. In the future, resilience will not always be about protecting the app but protecting the asset value.
For electronics updates please visit: https://efemag.co.uk/category/news/