On-device AI models and core ML tools: Insights from WWDC 2024

During Apple’s Worldwide Developers Conference (WWDC) 2024, the company unveiled a number of improvements to improve the deployment and performance of AI models on devices. Some of the important changes included major improvements to the Core ML tools that came with the 8.0b1 pre-release.

These updates are intended to improve the performance and efficiency of deploying machine learning (ML) models on Apple devices. Here’s a breakdown of these innovations, their impact on developers, and their benefits for end users.

Explanation of key terminology

Before we dive into updates, let’s clarify some key terms:

Palletization

This technique reduces the accuracy of model weights by grouping the weights into clusters and representing each cluster with a single value. It’s like having a whole range of colors in an image, where one color represents the whole range of colors. In machine learning, palletization significantly reduces the size of a model by compressing its weight values.

Quantization

Quantization is the process of reducing the precision of weights and activations from floating-point numbers, such as 32-bit floating-point numbers, to lower-precision numbers, such as 8-bit integers. This compression technique helps reduce model size and also speeds up inference, speeding up computations on lower-precision hardware.

Block quantization

This variant of quantization divides the model weights into smaller blocks or chunks and quantizes each block individually, leading to better accuracy due to more precise quantization performed on each chunk.

Pruning

It is a data compression technique that eliminates weights in the model that are not critical and have the least impact on the model’s prediction. This process sets the least important weights to zero, which can be stored efficiently using a sparse matrix representation.

State models

Stateful models are models that keep track of information that needs to be passed through multiple model passes, in other words, they maintain the context and its state. This is especially important for tasks such as language modeling, where the model requires remembering words that were generated in the past in order to correctly and coherently generate subsequent text.

Basic ML tools (coreml tools) is a Python package for converting third-party models to formats suitable for Core ML (Apple’s framework for integrating machine learning models into applications). Core ML Tools supports conversion from popular libraries such as TensorFlow and PyTorch to the Core ML model package format.

The coreml tools the package allows you to:

Convert trained models from different libraries and platforms to the Core ML model package format.

Read, write, and optimize Core ML models to reduce storage space, reduce power consumption, and minimize inference latency.

Verify creation and conversion by making predictions with Core ML on macOS

Core ML provides a unified representation of all models, allowing your application to use Core ML APIs and user data to predict and tune models directly on the user’s device. This approach eliminates the need for a network connection, ensures privacy of user data, and makes the application more responsive. Core ML optimizes device performance by leveraging the CPU, GPU, and Neural Engine (NE) while minimizing memory usage and power consumption.

Let’s finally start talking about the changes themselves. We’ve covered the theory and terminology, and now it’s time to dive into the new features and changes in Core ML tools in the soon-to-be-released version 8.0b1.

New tools and state models

Introduction coremltools.utils.MultiFunctionDescriptor() AND coremltools.utils.save_multifunctionsimplify the creation of ML programs with multiple functions that can share weights among themselves. This increases the versatility and ease of use of models by allowing specific features to be easily loaded into predictions.

Core ML has now been improved to support stateful models thanks to recent changes to the converter to enable generating models with a new state type that was introduced in iOS 18 and macOS 15. These models can store information from one inference pass to another, which is especially useful in the case of tasks where the model must remember inputs that it has seen in the past.

Advanced compression techniques

Core ML tools have expanded the range of compression capabilities to reduce model sizes while maintaining performance. Updated coremltools.optimize module now supports:

Block quantization: Allows more precise quantization control because the model weights are divided into smaller sections and quantized separately.

Grouped palletization by channels: Groups weights with similar values into groups, which will reduce the number of unique weight values and increase flexibility and accuracy.

4-bit weight quantization: Reduces memory requirements by half compared to 8-bit quantization, further reducing model size.

3-bit palletizing: Expands the bit depth options for palletization by using only three bits to represent weight clusters, providing the ability for higher compression.

These techniques, in addition to common compression modes such as 8-bit lookup tables (LUTs) for palletizing, weight trimming combined with quantization or palletizing, offer effective tools for reducing model size and increasing performance.

Advanced API improvements: compression and quantization

The coremltools.optimize the module has significant API updates supporting advanced compression techniques. For example, a new API for quantizing activations based on calibration data can turn a W16A16 Core ML model (16-bit weights and activations) into a W8A8 model (8-bit weights and activations), improving performance while maintaining accuracy. Additionally, updates to coremltools.optimize.torch Dataless compression methods based on calibration data were introduced, making it easier to optimize the PyTorch model for Core ML.

iOS 18/macOS 15 optimizations

The latest operating systems support new operations such as constexpr_blockwise_shift_scale, constexpr_lut_to_denseAND constexpr_sparse_to_dense, which are crucial for effective model compression. Updates to Gated Recurrent Unit (GRU) operations and the addition of PyTorch scaled_dot_product_attention the operation helps improve performance and make transformer models and other complex structures work well on Apple silicon. These updates ensure more efficient operation and better use of hardware capabilities.

Experimental burner export conversion

The torch.export conversion support helps you seamlessly convert your model directly to Core ML With PyTorch.

This process includes:

import the necessary libraries
export PyTorch model with torch.export
convert exported program to Core ML model with coremltools.convert

This simplified process reduced the complexity of deploying PyTorch models on Apple devices, taking advantage of the increased performance of Core ML.

Multifunctional models

Integration of cross-functional models in Core ML tools allows you to combine models with common weights into a single ML program. This is beneficial for multi-tasking applications such as combining a feature extractor with classifiers and regressors. Multifunctional descriptor i save_multifunction The tool ensures that common weights are not duplicated, saving more memory space and performance.

Performance improvements and bug fixes

The new Core ML Tools 8.0b1 release includes various bug fixes, improvements, and optimizations that make development smoother. Fixed some known issues such as conversion errors in some palletizing modes and incorrect quantization scales to improve the reliability and accuracy of compressed models.

Benefits for end users

The improvements introduced in coremltools 8.0b1 pre-release provide several key benefits to end users, improving the overall experience of AI applications:

Improved performance: Smaller, optimized models are lighter, load and run faster on devices, which means they can respond and engage faster for smoother interactions.

Reduced application sizes: Compressed models also take up less space, which means the app can be lighter and more cost-effective, which is especially useful for users with less space on their mobile devices.

Improved functionality: Cross-functional and stateful models enable more complex innovative features in applications, which helps deliver more sophisticated features and more intelligent behaviors.

Better battery life: optimization of model execution translates into lower energy consumption and longer battery life of mobile devices during intensive AI operations.

Enhanced privacy: Integration of artificial intelligence with the device allows for local processing of user data, which eliminates the risk of sending data to other external servers.

Application

The pre-release version of coremltools 8.0b1 represents a significant step forward in implementing the artificial intelligence model on devices. Now developers can create more efficient, compact, and versatile ML models with improved compression techniques, stateful model support, and multi-function model tools. These advancements underscore Apple’s commitment to providing developers with robust tools to harness the power of Apple silicon, ultimately delivering faster, more efficient and more efficient AI applications on devices.

As Core ML and its environment evolve, the opportunities for innovation in AI-powered applications continue to expand and grow, opening the door to more sophisticated and user-friendly experiences.

In an upcoming article, we will demonstrate these new features practically in a sample project, showing you how to apply them in real-world scenarios. Stay tuned for further information!