Data bloat happens. The plethora of modern IT systems running enterprise applications with data services spanning the breadth of the planet’s cloud networks means that data growth, by some measures, is arguably out of control. Although we have ‘coping mechanisms’ in the form of data lakes (information resources designed to tank unstructured data flows that we’re not able to initially process or use to any extensive functional degree) and data warehouses (where we’ve been able to apply a degree of order to our storage), we still exist in a world of data bloating.
With the existence of Large Language Models (LLMs) now proliferating to serve the needs of generative Artificial Intelligence, many agree that the data overload situation will now be amplified further. Thanks to data protection and storage policies, video streams, online games and so on, data volumes have skyrocketed in recent years. While cloud-based storage services alongside on-premises data stores are now comparatively cheap, datacenters need physical space and lots of energy.
Time for a data diet?
Data protection and recovery vendor Cohesity is among those highlighting the current overabundance of data. The company has compiled industry data that highlights the problems with datacenter energy and suggests that efficiency is not keeping pace with data growth, which itself has direct implications unless organizations start to lose data. Lose some of their data? Yes… it might just be time to shed a few pounds and go on a data diet.
The International Bureau of Weights and Measures is the oldest international scientific institution in the world. Since it was founded in 1875, the organization has been tasked with promoting a globally standardized system of units. At its last quadrennial conference (they meet every four years, this is not a place for quick fixes or snap decisions) with representatives from 62 member states, the committee stated that in view of the rapidly increasing volumes of data, it should introduce two new units for data for the first time since 1991: ronnabytes & quettabytes and ronnabytes.
A ronnabyte has 27 zeros, a quettabyte even has 30. Written out, the latter looks like this: 1,000,000,000,000,000,000,000,000,000,000.
If we wanted to store a quettabyte on a modern smartphone, we would need so many devices that they would be around 93 million miles long when lined up end to end. This, Cohesity points out, corresponds roughly to the distance from the Earth to the sun. The company reminds us that the reason for the gigantic new data entities is the rapid growth in global data volumes. While people around the world generated just under two zettabytes of data in 2010, by 2022 this figure had risen to almost 104 zettabytes.
Environmental impact
“What all of this points to is a ‘big plate’ of data, hence the suggestion for a data diet,” said Mark Molyneux, chief technology officer for EMEA at Cohesity. “By this (admittedly somewhat cheeky) term, we mean that enterprises should use contemporary data classification and application analysis techniques to more directly distinguish mission-critical data from other residual information streams that – although still subject to appropriate levels of security and compliance – can be removed from the ingestion stream that a business opens itself to. Using data management processes empowered by modern Artificial Intelligence (AI) engines, we can act now before our data backbones need to consider anything akin to gastric bypass surgery when the situation has worsened.”
Molyneux talks about a ‘worsening situation’ and, as of now, the impact of data sprawl on the environment is still limited – at least for now. According to the International Energy Agency, data volumes in datacenters more than tripled between 2015 and 2021. However, the energy consumption of datacenters has remained largely constant. This is mainly due to major efficiency gains and a shift towards more modern hyperscale datacenters.
“Datacenters have become more efficient, but they have almost reached the optimum level of efficiency they can achieve,” warns Cohesity’s Molyneux. “There are only marginal efficiency gains left. Estimates suggest that the planet’s current pack of datacenters are expected to collectively produce 496 million tons of carbon dioxide in 2030 with current forms of energy generation. That would be more than France emitted in total in 2021.”
AI is a big side order
Staying with the company’s calorie-counting analogy, we can certainly expect AI to add a lot of extra data to the consumption pile. A 2019 study by the Massachusetts Institute of Technology (MIT) concluded that training neural networks produce as much carbon dioxide as five combustion engine cars over their entire life cycle. A 2021 study by Google and the University of Berkeley proposed that training GPT-3, the AI model behind the original version of ChatGPT, consumed 1287 gigawatt hours and thus emitted 502 tons of carbon dioxide. That would be equivalent to the electricity consumption of 120 American households in one year.
“We barely manage our digital footprint,” insists Molyneux. “Companies are often sitting on a huge mountain of ‘dark’ data and they no longer need a lot of it, but they still don’t delete it. This is often due to the lack of classification of data. Companies often don’t even know what data is still on their servers. The notion of a data diet describes an attitude change that organizations can adopt to reduce the overall volume of data that they look to store. This change sees enterprises take a more proactive approach to the way they index, classify and amass data across the data management lifecyle. It also means taking positive steps to consolidate an organization’s data store workloads onto a single common platform.”
Although there’s no Atkins Diet methodology on offer here, the Cohesity team do point to some proven practices that they say can lighten the data diet that enterprise organizations feed on, on a daily, weekly and indeed annual basis.
Atkins for data?
The aforementioned process of indexing data as accurately as possible via a data management platform can help companies single out data streams that have become obsolete, redundant, orphaned or just out-of-date. In line with this activity, using de-duplication tools applied at a data platform level can help reduce data storage loads by a surprisingly large amount i.e. depending on the ‘type’ of data in question, by as much as 97 percent, although that figure may be open to debate.
“There’s a key efficiency opportunity here for organizations to grasp in all industries. By cleaning up an organization’s data store, the business gets a win that spans four major ‘food’ groups. This approach can be said to a) reduce an organization’s carbon footprint by virtue of the use of a more accurate level of cloud resources b) reduce the risk of litigation related to outdated Personally Identifiable Information (PII) residing on the company data layer and c) ensure that the firm’s approach to AI is founded on a base of the leanest and most accurate information resources spanning the organization itself and d) it probably helps the IT team lose weight as they become more nimble and less encumbered by late night data builds fuelled on takeaway pizza,” concluded Cohesity’s Molyneux.
The data diet might be a cute idea designed to simply get us thinking about information rationalization and management in new ways and yes, it’s a concept proposed by a data protection and data recovery vendor, so that that with a pinch of salt. Come to think of it, don’t take it with extra salt, your sodium intake is already high enough and we do need to be more careful with our plate of data.
Pass the salt-free seasoning please.