Local and Cloud AI: Better Together

June 2024

This piece was originally written and translated into Korean as a part of LG Technology Ventures' Monthly Newsletter to business units and strategic partners. It is republished here with permission. Sensitive information has been removed.

Last week, Apple announced its plans for the next version of iOS, and exposed the world to an emerging trend in AI: Local inference and multi-model querying.

The feature, advertised as an upgrade to Siri, or “Apple Intelligence” will only be available on iPhone 15 and newer. Apple has a history of gating software features behind new hardware to drive new sales, even if there was no technical reason not to. Notably, after its acquisition of Siri, Apple removed the app from older iPhones, and forced users to upgrade to the iPhone 4S to use the tool. However, with the advent of AI, there is a real technical reason to limit features to newer phones. Although Apple’s GPUs and AI processors are extremely powerful, the limiting factor for running AI is RAM. Apple phones have historically had substantially less RAM than their Android counterparts, and their paltry RAM allocation makes it difficult for them to run all but the smallest models locally, reaching out to a cloud-based model for more complex queries. If you have a slow processor, AI models run slowly, but if you don’t have enough RAM, you can’t run the model at all. For the first time in years, there is a compelling reason for iPhone users to upgrade their phones.

Small open source models like Phi3 and TinyLlama are quite capable for basic tasks, and can be powerful when paired with RAG on a user’s personal data (a project I am personally working on). A user does not need to have a particularly advanced LLM to be able to find a relevant email or text message, or cross-reference information between the two. One VC remarked “Your butler doesn’t need a PhD”, indicating that the trusted access to personal information for a “dumb” model is far more valuable than any “smart” model. However, I would counter that most executive assistants have a college degree, and that there are still strong reasons why a local LLM should be as powerful as possible - context and nuance are particularly important when organizing and coordinating a schedule, or presenting accurate information, and that is why Apple offloads some queries to the cloud, and why some of those queries may include personal information. Apple will continue to increase the capabilities of their on-board LLMs through more on-device RAM and better training, until off-device requests are limited to only the most technical or sector-specific queries.

However, any data leakage to external LLMs, whether it is personal content included in the prompt, or even just the topic of the prompt itself, can be a privacy risk for consumers, and a security risk for enterprise. Elon Musk has announced that none of his employees will be permitted iPhones if ChatGPT cloud requests are integrated into iOS. Many companies have already banned access to LLMs at the office, and likely will extend that to corporate phones as well.

But there is an easy solution - Apple created an open market for search on their device, and they could do the same for AI. For search on iOS devices, users can select which search engine they want to use, including Bing, DuckDuckGo, or even their own homemade one if they wanted to, but most choose the default, a position that Google pays Apple $20Bn/year for. Similarly, Apple could allow users to pick their own AI, whether it is OpenAI, Google, Anthropic, or even an on-prem AI hosted by their work or even themselves. Google and OpenAI will likely bid for the “default” position, and users have the freedom to pick the assistant they want, and the option for complete privacy. Apple earns revenue by auctioning off the right to be the cloud AI solution, and simultaneously grants its users (personal and corporate) options for better privacy and security.

Google could do something similar, especially because Android is a much more open OS, and Android users have an expectation of customization. However, Google faces an innovator’s dilemma on Android - Although most high end android phones could easily run advanced AI models locally, doing so would deprive Google of potential revenue from running those models on their cloud (and inserting ads or sponsored suggestions).

Because of Google (and Microsoft’s) innovator dilemma, Apple has a unique opportunity to set standards for multi-model communication, not only for sending complex queries from a phone model to a cloud model, but potentially even directing queries to a variety of different AIs depending on the context and subject.

LG has been actively following companies experimenting with multi-model structures. Mixtral is an open source “mixture of experts” model. Edgerunner takes this to another level, leveraging a large number of small (~8bn parameter) models that it loads on demand based off of which is most likely to provide an informative answer to a user’s query. Crestra uses a small model to provide immediate low-latency responses to customer requests via phone, and then poll a larger model to inform the smaller model while it is speaking.

There are clear advantages to a multi-model setup, and we believe that Apple’s bringing this approach to the mainstream will lead to an explosion in innovation, especially in models that are hyper-focused on specific use cases or knowledge domains.