What we know about Apple’s artificial intelligence on the device

It’s time to celebrate the amazing women leading the way in AI! Nominate your inspiring leaders for the VentureBeat Women in AI Awards today before June 18. Find out more

After Microsoft Build and Google I/O, Apple was under a lot of pressure to showcase its on-device AI capabilities at the 2024 Worldwide Developers Conference. When it comes to demos, Apple has done a great job of integrating generative AI into the user experience across all of its devices.

One of the most impressive aspects of the demo was how much of the workload was performed on the devices themselves. Apple was able to leverage its cutting-edge processors and tons of open research to bring high-quality, low-latency AI features to its phones and computers. Here’s what we know about Apple’s AI on the device.

A model with 3 billion parameters

According to Apple’s State of the Union presentation and accompanying blog post published on June 10, Apple uses a 3 billion parameter model. Apple doesn’t explicitly state what model it uses as the base model. However, it has recently released several open models, including the OpenELM family of language models, which includes a 3 billion parameter version.

OpenELM has been optimized for resource-constrained devices. For example, modifications were made to the basic transformer model to improve model quality without increasing parameters. The basic model used in Apple devices may be a specialized version of OpenELM-3B.

Registration for VB Transform 2024 is open

Join enterprise leaders in San Francisco July 9-11 for our flagship AI event. Connect with your peers, explore the opportunities and challenges of generative AI, and learn how to integrate AI applications in your industry. Register now

OpenELM was trained on 1.8 trillion tokens of open datasets. According to the blog post, the new base model is trained on “licensed data, including data selected to enhance specific features, as well as publicly available data collected by our AppleBot web crawler.”

What is licensed data? From what we know, Apple has a $25-50 million deal with Shutterstock for images and a possible $50 million deal with major news and publishing organizations.

The model has been fine-tuned for instruction execution through reinforcement learning from human feedback (RLHF) and a “rejection sampling fine-tuning algorithm with a committee of teachers.” RLHF uses human annotated data to model user preferences and train language models to better follow instructions, and gained popularity with the release of ChatGPT.

Discarded sampling generates multiple examples at each training stage and uses the one that provides the best result to update the model. The Llama-2 team also used discarded sampling to refine their models. The “teacher committee” suggests that a larger and more efficient model be used as a benchmark for assessing the quality of training examples generated to fine-tune the on-device model. Many researchers use frontier models such as GPT-4 and Claude 3 as teachers in these scenarios. It’s unclear what models Apple used to evaluate the samples.


Apple has used several techniques to improve the capabilities of the models while maintaining their resource efficiency.

According to the blog post, the base model uses “group query attention” (GQA), a technique developed by Google Research that speeds up inference without excessively increasing memory and computation requirements. (OpenELM also uses GQA.)

According to Apple’s blog, the model uses “palletization,” a technique for compressing model weights using lookup tables and indexes to group similar model weights together. However, the presentation mentioned “quantization”, which is another compression technique that reduces the number of bits per parameter.

Moreover, the models will only work on MacBooks with M1 chips and newer, and iPhone 15 Pro and Pro Max, which are equipped with the A17 Pro chip. This suggests that the model uses some optimization techniques that are particularly suited to Apple chips, such as the flash large language model (LLM) introduced late last year.

Reported results on the iPhone 15 Pro include “a first-token latency of approximately 0.6 milliseconds per prompt token and a generation rate of 30 tokens per second.” This means that if, for example, you send a token containing 1000 hints to the model, the model will start responding within 0.6 seconds and then generate 30 tokens per second, which is a very reasonable performance.


Because the small language model can’t do much, Apple engineers have created refined versions of the base model to store on the device. However, to avoid storing multiple copies of the model, they use low-rank adaptation adapters (LoRA).

LoRA is a technique that finds and adjusts a very small subset of weights that need to be modified to update the model for a specific task. Adapters store the LoRA weights and connect them to the base model at the time of inference. Each adapter is less than 100 megabytes, which allows a device to store and use multiple LoRA adapters for a variety of tasks such as proofreading, summarizing, replying to emails, and more.

According to Apple reports, human review shows that its model is generally preferred over other models of the same size and some larger models, including the Gemma-2B, Mistral-7B, Phi-3B-Mini, and Gemma-7B.

At first glance, Apple’s on-device AI shows how far you can go by combining small models with the right optimization techniques, data, and hardware. They have made every effort to find the right balance between accuracy and optimal user experience. It will be interesting to see how the demo performs when the technology is made available to users in the fall.