DescriptionThe Apple Neural Engine (ANE) shipped on M3 / M4 / M5 Macs and A17 Pro / A18 / A19 iPhones and iPads delivers high-throughput int8 matrix multiplication at a fraction of the energy cost of the Metal GPU path. llama.cpp and ggml currently route all LLM inference through Metal or CPU, leaving the ANE unused on every Apple device that ships one.
Draw Things recently demonstrated that ANE can be integrated as a targeted accelerator for int8 matmul inside a custom inference stack — using CoreML exclusively as a front door to ANE — while the host runtime keeps full ownership of intermediate allocations, KV cache, kernel caching, and scheduling. This grant funds the integration of that architectural pattern into llama.cpp / ggml and the QVAC addon stack: ANE accelerates matmul, Metal / CPU continue to own everything else, and the existing Metal path remains the fallback.
Draw Things recently demonstrated that ANE can be integrated as a targeted accelerator for int8 matmul inside a custom inference stack — using CoreML exclusively as a front door to ANE — while the host runtime keeps full ownership of intermediate allocations, KV cache, kernel caching, and scheduling. This grant funds the integration of that architectural pattern into llama.cpp / ggml and the QVAC addon stack: ANE accelerates matmul, Metal / CPU continue to own everything else, and the existing Metal path remains the fallback.