Geoffrey Angus
Posts
"Intelligence Too Cheap To Meter"

"Intelligence Too Cheap To Meter"

Thoughts on Apple's AI Play and On-Device AI

Geoffrey Angus
June 18, 2024

I came across the phrase ”intelligence too cheap to meter” on Twitter a few weeks ago. As a guy who (1) does AI infra and (2) had never heard the phrase before, I looked it up and found the following on Wikipedia:

Too cheap to meter refers to a commodity so inexpensive that it is cheaper and less bureaucratic to simply provide it for a flat fee or even free and make a profit from associated services.

With GPU cloud market dynamics as they have been, free or nearly-free generative AI seemed quite a ways off. That is, until WWDC.

At WWDC, Apple unveiled Apple Intelligence, their newly enshrined AI stack for iOS and Mac developers.

The most interesting thing about the announcement wasn't the company's proposed Day 1 features; it was the vision articulated between the lines. Apple's demonstrated investment in on-device AI deployments set the stage for what could be a massive shift in how AI is developed and distributed to end users.

The Great Compute Migration

Sparing you the gory details, Apple's new AI stack will make it easier than ever for developers to ship on-device AI models.

Most AI users last week reacted to this news in one of a few ways, including:

"Great! Now my data will be handled privately."

and / or

"Great! Now I can use my AI assistant offline."

and / or

"So what? I don't care whether my AI operates in the cloud or on my device. I just want my apps to work."

My hunch is that there is a much bigger story to be told on the supply-side: On-device AI seems poised to fundamentally alter where and how capital and compute are allocated across the entire industry.

AI Today: Dominated by cloud providers and economies of (compute) scale

As many in the business already know, AI models are almost exclusively deployed on GPU servers in the cloud. Hundreds of millions of investment dollars have been deployed across the entire stack to develop new frontier models, procure more GPU server racks, and build brand-new data centers.

Despite this huge effort, shipping AI-powered products remains costly and fundamentally unscalable for the majority of developers. The vast majority of developers engage with AI cloud providers in two ways:

Per-token pricing, where the developer pays the cloud provider to use some hosted AI model at a rate of $X/1M tokens. The more the end users prompt the model for tokens, the more the developer has to pay.
Per-GPU-hour pricing, where the developer simply rents out the GPU (and the AI onboard) from the cloud provider at some mark-up. This is often favorable to developers with "bursty" batch use cases which enable high GPU utilization, but extremely unfavorable for realtime use cases, in particular if the developer has low request volume.

If you've ever tried to build any software business from scratch, these two pricing models should make you uncomfortable. Per-token pricing scales operating costs linearly with usage, a departure from traditional software economics (i.e. scaling to handle 1000x more usage does not usually mean 1000x more spend on servers). The alternative scheme, per-GPU-hour pricing, fundamentally favors incumbents whose realtime request volumes are high enough to justify renting out GPUs 24/7.

Apple's new developer ecosystem demonstrates a third approach to shipping AI-products– on-device AI deployment.

Living On The Edge

On-device deployment of AI models enables processing inference requests directly on the end user's machine. Models deployed on edge are typically orders of magnitude smaller than the frontier models hosted by AI cloud providers, enabling fast inference with a low overall memory footprint.

For many developers, even a partial implementation of an on-device approach could greatly reduce costs in what is quickly becoming one of the largest spend categories in software. Depending on the product, a well executed on-device AI deployment could eliminate the need for an AI cloud provider entirely.

Wait, Don't Small AI Models Suck?

My most recent role was leading the ML team at Predibase, an AI infra company dedicated to enabling businesses to fine-tune and serve their own AI models in the cloud. When Meta's first Llama model dropped last February, we carefully crafted our vision of the future: that, in the fullness of time, AI deployments would comprise collections of small AI models fine-tuned with specific tasks in mind. We conducted our own research on small AI models and found that small AI models were capable of outperforming massive frontier models when LoRA fine-tuned on well-defined tasks. We also launched LoRAX, a production-ready, first-of-its-kind OSS package capable of serving hundreds of fine-tuned LoRAs concurrently on a single base model.

While Apple's newly announced OpenAI partnership ended up sweeping media headlines, what seemed to be the most salient aspect of their announcement was their model architecture: a 3 billion parameter foundation model shipped with a series of hot-swappable LoRAs for a variety of tasks.

Apple’s proposed model architecture closely aligned with our thinking at Predibase. However, the real kicker was that they implemented an efficient multi-LoRA serving framework on Apple Silicon. Individually, LoRA fine-tuned small AI models can be powerful, but lack the versatility of frontier models. Multi-LoRA serving directly mitigates this by enabling multiple fine-tuned LoRAs to be served from a single base model. This is a key piece of what will make on-device model inference truly viable.

Conclusion (and tl;dr)

Per-token pricing and high GPU rental costs currently dominate enterprise spend on AI. Models of a certain size can be deployed directly onto end user devices and outperform massive frontier models when fine-tuned on well-defined tasks. Depending on the use case, on-device models could very well eliminate the need for an AI cloud provider, therefore paving a realistic path towards variations of intelligence actually too cheap to meter.

Some parting thoughts:

Advanced reasoning capabilities will be deployed in the cloud first. Having orders of magnitude fewer parameters, on-device AI will always lag behind frontier models on complex reasoning tasks. Therefore, some portion of workflows will still need to be handled by models deployed in the cloud.
The cloud will continue to be the best place to do fine-tuning. This article primarily discusses the costs associated with model inference; model training is a different story. Training on non-trivial data is ultimately a big batch job that allows for near-optimal GPU utilization. Therefore, renting out a few A100s for a couple of hours will likely remain the most efficient option for those seeking to fine-tune a task-specific model.

For those interested in giving on-device model inference a shot, Apple's MLX project is an open-source effort to make Apple Silicon a user-friendly hardware accelerator for AI models. For developers building outside of the Apple ecosystem, there's also transformers.js and llama.cpp. And of course, for those looking to fine-tune and serve LoRAs in the cloud, check out LoRAX and Predibase.

I'll be experimenting with local models in the coming months. Interested in the space? Subscribe below, shoot me an email ([email protected]) or find me on Twitter (@GeoffreyAngus). Thanks for reading!

Still trying to figure out how to get AI to "work" with your business? Have a bunch of cloud credits you don't know how to spend? Let's chat. Shoot me an email ([email protected]) or find me on Twitter (@GeoffreyAngus).