Deep Dive: Optimizing AI Data Storage ManagementDeep Dive: Optimizing AI Data Storage Management
By viewing AI processing as part of a project data pipeline, enterprises can ensure their AI models are trained effectively and the storage selection is fit for purpose.
Already have an account?
Optimizing storage for AI involves more than just choosing the right hardware; it requires a data management approach to successfully process the vast amounts of data large language models (LLMs) require.
By viewing AI processing as part of a project data pipeline, enterprises can ensure their generative AI models are trained effectively and the storage selection is fit for purpose. And by emphasizing the importance on the data storage requirements for AI, businesses can ensure that their AI models are both effective and scalable.
AI Data Pipeline Stages Aligned to Storage Needs
In an AI data pipeline, various stages align with specific storage needs to ensure efficient data processing and utilization. Here are the typical stages along with their associated storage requirements:
Data collection and pre-processing: The storage where the raw and often unstructured data is gathered and centralized (increasingly into Data Lakes) and then cleaned and transformed into curated data sets ready for training processes.
Model training and processing: The storage that feeds the curated data set into GPUs for processing. This stage of the pipeline also needs to store training artifacts such as the hyper parameters, run metrics, validation data, model parameters and the final production inferencing model. Pipeline storage requirements will differ depending on whether you are developing a LLM from scratch or augmenting an existing model, such as a regenerative augmented generation (RAG).
Inferencing and model deployment: The mission-critical storage where the training model is hosted for making predictions or decisions based on new data. The outputs of inferencing are utilized by applications to deliver the results, often embedded into information and automation processes.
Storage for archiving: Once the training stage is complete, various artifacts such as different sets of training data and different versions of the model need to be stored alongside the raw data. This is typically long-term retention, but the model data still needs to be available to pull out specific items related to past training.
Cloud vs. On-Prem Typically Affects the Storage Used
A major decision before starting an AI project is whether to use cloud resources, on-premises data center resources, or both in a hybrid cloud setup.
For storage, the cloud offers various types and classes to match different pipeline stages, while on-premises storage is often limited, leading to a universal solution for various workloads.
The most common hybrid pipeline division is to train in the cloud and do inference on-premises and the edge.
Stage 1: Storage Requirements for Data Collection and Pre-Processing
During data collection, vast amounts of raw unstructured data is centralized from remote data centers and the IoT edge, demanding high aggregate performance levels to efficiently stream data. Performance must match internet speeds, which aren’t exceptionally fast, to transfer terabytes of data using multiple threads collectively.
Capacity scalability is equally crucial, as the storage solution must be able to expand cost-efficiently to accommodate growing datasets and increasing computational demands.
Balancing cost efficiency is essential to meet these scaling and performance demands within budget, ensuring the solution provides value without excessive expenditure. Additionally, redundancy is vital to prevent data loss through reliable backups and replication.
Security is paramount to protect sensitive data from breaches, ensuring the integrity and confidentiality of the information. Lastly, interoperability is necessary for seamless integration with existing systems, facilitating smooth data flow and management across various platforms and technologies.
The most prevalent storage used for data collection and pre-processing is highly redundant cloud object storage. Object storage was designed to interact with the internet well for data collection, is scalable and cost-effective.
To maintain cost effectiveness at large scale, hard disk drive (HDD) devices are commonly used. However, as this storage sees more interaction, low-cost solid-state drives (SSD) are becoming increasingly relevant. This phase culminates in well-organized and refined curated data sets.
Stage 2a: Storage Requirements for Effective LLM Training
The storage needed to feed GPUs for LLM AI model processing must meet several critical requirements. Extreme performance is essential, requiring high throughput and rapid read/write speeds to feed the GPUs and maintain their continuous operation.
GPUs require a constant and fast data stream, underscoring the importance of storage that aligns their processing capabilities. The workload must manage the frequent large-volume checkpoint data dumps generated during training. Reliability is crucial to prevent interruptions in training, as any downtime or inconsistency could lead to significant overall pipeline delays.
Additionally, user-friendly interfaces are important as they simplify and streamline administrative tasks and allow data scientists to focus on AI-model development instead of storage management.
Most LLMs undergo training in the cloud, leveraging numerous GPUs. Curated datasets are copied from the cloud’s object storage to local NVMe SSDs, which provide extreme data GPU feeding performance and require minimal storage management. Cloud providers such as Azure have automated processes to copy and cache this data locally.
However, relying solely on local storage can be inefficient; SSDs can remain unused, datasets need to be resized to fit, and the data transfer times can impede GPU usage. As a result, companies are exploring parallel file system designs that run in the cloud to process data through an NVIDIA direct connection.
Stage 2b: Storage Requirements for Effective RAGS Training
During RAGs training, private data is integrated into the generic LLM model to create a new aggregate model. This decentralized approach enables the LLM to be trained without requiring access to an organization's confidential data. An optimal storage solution for this sensitive data is a system that can obscure Personally Identifiable Information (PII) data.
Recently, there has been a shift from centralizing all the data to managing onsite at remote data centers and then transferred to the cloud for the processing stage.
Another approach involves pulling the data into the cloud using cloud-resident distributed storage systems. Effective storage solutions for RAGS training must combine high-performance with comprehensive data cataloging capabilities.
It is crucial to employ high-throughput storage, such as SSD-based distributed systems, to ensure sufficient bandwidth for feeding large datasets to GPUs.
Additionally, robust security measures, including encryption and access controls, are essential to protect sensitive data throughout the training process.
There is an anticipated competition between parallel file systems and traditional network-attached storage (NAS). NAS has traditionally been the preferred choice for on-premises unstructured data, and this continues to be the case within many on-premises data centers.
Stage 3: Storage Requirements for Effective AI Inference and Model Deployment
Successful deployment of model inferencing requires high-speed, mission-critical storage. High-speed storage enables rapid access and processing of data, minimizing latency and enhancing real-time performance.
Additionally, performance-scalable storage systems are essential to accommodate growing datasets and increasing inferencing workloads. Security measures, including embedded ransomware protection, must be implemented to safeguard sensitive data throughout the inference process.
Read more of the latest data storage news
Inferencing involves processing unstructured data, which is effectively managed by file systems or NAS. Inference is the decision-making phase of AI and is closely integrated with content serving to ensure practical utility. It is commonly deployed across diverse environments spanning edge computing, real-time decision-making, and data center processing.
The deployment of inference demands mission-critical storage and often requires low-latency solution designs to deliver timely results.
Stage 4: Storage Requirements for Project Archiving
Ensuring long-term data retention requires robust durability to maintain the integrity and accessibility of archived data over extended periods.
Online retrieval is important to facilitate the occasional need for access or restore archived data. Cost-efficiency is also critical, as archived data is accessed infrequently, necessitating storage solutions with low-cost options.
Online bulk capacity object storage based on HDDs or tape front-ended by HDDs is the most common approach for archiving in the cloud. Meanwhile, on-premises set ups are increasingly considering active-archive tape for its cost-effectiveness and excellent sustainability characteristics.
The Importance of Scalability: The World of AI is Still Evolving
Different types of storage are commonly employed nowadays to optimize the AI data pipeline process. Looking ahead, Omdia anticipates there will be a greater emphasis on optimizing the overall AI data pipeline and development processes.
During data ingestion and pre-processing stages, scalable and cost-effective storage is used. It is projected that 70% of the project time will be dedicated to converting raw inputs into curated data sets for training. As early-stage AI initiatives are completed, challenges related to data discovery, classification, version control, and data lineage are expected to gain more prominence.
For model training, high-throughput SSD-based distributed storage solutions are crucial for delivering large volumes of data to GPUs, ensuring quick access for iterative training processes. While most cloud training currently relies on local SSDs, as the processes advance, organizations are expected to prioritize more efficient training methods and storage solutions. Consequently, there has been a recent increase in innovative SSD-backed parallel file systems developed by startups as alternatives to local SSDs. These new NVMe SSD storage systems are designed to handle the high throughput and low latency demands of AI workloads more efficiently by optimizing provisioned capacities and eliminating the need for data transfer actions to local drives.
For model inferencing and deployment, low-latency storage such as NVMe (Non-Volatile Memory Express) drives can provide rapid data retrieval and enhance real-time performance. As inference is beginning to progress, Omdia expects inferencing storage will grow at almost a 20% CAGR until 2028, nearly four times the storage used for LLM training.
Throughout the entire pipeline, there is a heightened emphasis on data security and privacy, with advanced encryption and compliance measures being integrated into storage solutions to protect sensitive information. Ensuring secure data access and data encryption is crucial in any data pipeline.
Over time, storage systems might evolve into a single universal type that eliminates phase-specific issues like data transfers and the need to secure multiple systems. Utilizing a single end-to-end system would allow for efficient data collection, training, and inference within the same infrastructure.
This article originally appeared in the Omdia blog.
Read more about:
OmdiaAbout the Author
You May Also Like