Three Ways to Use CPUs to Reduce GPU Usage

The surprising computer science of leveraging CPUs and vector databases to avoid “over-GPUing” apps and three steps toward making it happen.

and

Oct 31, 2023

Gartner recently predicted that “growth in 90% of enterprise deployments of GenAI will slow as costs exceed value,” while UK technology markets analyst CCS Insight noted that “Generative AI has a cold shower in 2024 as the reality of cost, risk, and complexity replaces the hype of 2023.”1 2 Where are these costs coming from? GPU chips, for one. Some GPU chips are listing for more than $40,000 on eBay, and those compute cycles are largely consumed by Generative AI training, scoring, experimentation, and new services. GPUs, as Elon Musk joked, are “considerably harder to get than drugs.”3

And it’s not only the cost of GPUs we should worry about, but also the environmental impact of AI’s water footprint. Water consumption by Microsoft in its Iowa data center rose 34% to nearly 1,700,000,000 gallons in 2022, largely from increased AI-related computing. Running ChatGPT-3 for just one day consumes as much energy as 33,000 US households.4

What if there was another way to scale generative AI computing without such a heavy reliance on GPU usage?

Back to the Computer Science Basics: Align Data and Compute

Central Processing Units (CPUs) are standardized chips that process computer logic. The humble CPU is often overlooked as a candidate for AI computation because “everyone knows” they’re designed for basic functions: basic math, basic code execution, and basic data output.

Or are they?

Graphics Processing Units (GPUs) are purpose-built chips that, at their core (nerd pun intended), parallelize computing, which makes them ideal for generative AI, a technology reliant on computations that require matrices and vectors during model training.5

But much of that computation is about organizing data in memory to prepare for parallel execution. By being proactive and organizing data on disk in a format that naturally matches how it will be used, we can eliminate a large, useless set of data processing and load parallelized data into far fewer good old-fashioned CPUs to get the same job done with less effort.

This format is a vector database. Vector databases offer significant cost and energy efficiency advantages by reducing non-value-added compute cycles, data transformation, aggregation, and organization. Think about these three steps as your path to lower GPU usage.

Step one: rethink how you organize your data

Unfortunately, you can’t just install a vector database, push a button, and replicate the performance benefits of massively parallel processing. You must do a little systems-level thinking, coding, and data preparation.

First, design data by vector to prepare it for CPU processing. This requires understanding the type of AI analysis being used. For example, many algorithms take unstructured data and organize it in temporal order because many parallelized elements of neural networks “scan” through data to generate predictions. So, organizing data by vector pre-configures it in the way it will be used and reduces the computing cost of scanning through unstructured, unprepared data.

For example, approximate nearest neighbor algorithms scan indices and vector embeddings to accelerate retrieval speeds, allowing workloads to be processed 100 times faster than non-vector data structures (e.g., relational or graph) at a fraction of the compute cost. They also enable querying of more data dimensions, such as time-stamped data, which delivers more comprehensive insights.

An additional point to consider is that CPUs are optimized for single-threaded performance, which means they are better suited for tasks that require sequential processing – and many time-series applications require sequential processing.

Step two: fine-tune your knowledge base

Most GenAI applications customize LLMs with proprietary data so query results are contextually relevant to the enterprise. Two paths help fine-tune the LLM or use Knowledge Base (KB) embedding so that CPUs can take advantage of them.

First, take pre-trained language models and adapt them to your task or domain. For example, train generative AI on task-specific data and public data sources. KB Embeddings represent structured knowledge from external sources (Wikipedia, Freebase, etc.) in a vector space. Each proprietary entity (e.g., your customers, employees, products, and partners) is also represented as a vector, allowing for semantic relationships between the two.

While GPUs have conventionally been favored for fine-tuning purposes, especially for large-scale ML operations, many companies are inclining toward KB embeddings as their primary architectural choice for AI platforms and enterprise use cases. By implementing a vector database within this framework, organizations can effectively mitigate or significantly minimize the requirement for continuous fine-tuning and retraining. This is achieved by providing the LLM with relevant, recent, and explainable content extracted from the company’s proprietary knowledge base.

While it is true that initial model training is best done with GPUs, remember that training your model isn’t the biggest time drain. Preparation, investigation, and interrogation of your training data can take even more time, and this is work that can and should be done off of GPU servers.

Step three: use CPUs

Finally, use CPUs. With all their attendant energy consumption and efficiency benefits, CPUs are a great choice for KB embedding architectures. Why?

CPUs have multiple cores that can handle multiple tasks simultaneously, making them well-suited for parallel processing. By strategically designing concurrent workflows, organizations can efficiently utilize CPU resources while keeping costs in check.

When it comes to creating vector embeddings, CPUs shine in scenarios that require efficient parallelization. By distributing the workload across multiple CPU cores, embedding generation can be expedited.

CPUs offer a cost-effective solution for tasks that can be divided into parallel threads. The versatility of CPUs enables organizations to harness their power not only for embedding generation but also for handling concurrent tasks like data preprocessing and storage management, resulting in a comprehensive and efficient processing pipeline.

Furthermore, the advantages of leveraging CPU concurrency extend to similarity searches. With the ability to handle multiple search queries simultaneously, CPUs excel in speeding up similarity calculations across a vast dataset. By optimizing multi-threading techniques and parallel computing strategies, organizations can significantly reduce search times, enabling rapid and accurate retrieval of similar items.

Better for your P&L, better for the planet

Vector databases and CPUs are an attractive alternative to GPUs when the nature of the application (e.g., similarity and anomaly search) aligns with the capabilities of CPUs. When they do, CPUs provide a solution that balances performance, scalability, and cost-efficiency, making them a compelling choice for organizations seeking to strike the optimal price-to-performance balance.

And, if lower energy usage, lower water usage, lower greenhouse gas emissions, and carbon footprint reduction matter to you, the combination of vector databases and CPUs presents a powerful win-win proposition, and you’ll spend a lot less time bidding for GPU chips on eBay.

Vector Database Central is a reader-supported publication sponsored by KX. To receive new posts and support our work, consider becoming a free or paid subscriber.

Gartner, Take This View to Assess ROI for Generative AI, 8.15.23

CCS Insight, Predictions 2024 and Beyond

The Verge, 8.11.23

UW researcher discusses just how much energy ChatGPT uses, University of Washington

Why AI Model Training Using GPU Instead of CPU

Vector Database Central