THE LATEST NEWS
NXP’s Edge LLM Strategy: Kinara, RAG, Agents

SANTA CLARA, Calif. — At the Embedded Vision Summit 2025, Ali Ors, global director of AI strategy and technologies at NXP, went into some detail on NXP’s strategy for enabling LLM inference on edge devices.

NXP’s i.MX-8M+ and i.MX-95 application processors both have NXP’s NPU on chip. The i.MX-95 can handle inference for LLMs below around 4 billion parameters in automotive, industrial and smart appliance applications. For bigger LLMs, an external accelerator chip is required.

NXP announced it would acquire AI accelerator chip startup Kinara in February. Ara-2, the startup’s LLM-focused chip, offers up to 40 eTOPS (the equivalent of a 40 TOPS GPU; Kinara’s accelerator is not a MAC array so the company does not offer direct TOPS figures). Ara-2 can have up to 16 GB LPDDR4 attached and it supports high-transfer-rate DDR. It connects to its host, which could be an NXP application processor or another CPU, through PCIe or USB.

Ors said that designs currently using the i.MX-8M+ or i.MX-95 can add Ara-2 to support bigger LLMs, and multiple accelerators can be supported if further expansion is required. Ara-2 can run multiple data streams and multiple models, concurrently, which will be critical for agent-based workloads, Ors said, noting that this capability was particularly attractive to NXP.

Working with a discrete AI accelerator means NXP customers can take advantage of specialized architectures to offer more performance than could be achieved on-chip on an application processor, Ors said. They can also provide versatility to adapt to changing needs like new operators and models and emerging paradigms like agentic AI and physical AI.

Kinara has shipped about half a million of its AI accelerator chips so far, mostly for pilot programs and use case evaluations for applications including embedded applications and a Lenovo AI PC.

The startup already has a proof-of-concept up and running with NXP application processors; Ors said this does not represent any level of software integration, simply that both companies’ software stacks work as intended.

“Going after sockets and trying to win together with [Kinara], we saw they had significant design wins with some big names,” Ors told EE Times in an interview after his presentation. “So there was independent validation of their technology as well.”

As well as acquiring Kinara, NXP has introduced a new tool flow for running LLMs and generative AI at the edge. GenAI Flow has a library of functional building blocks, including wake event detectors, speech recognition and text-to-speech models. These models are critical in the embedded space as systems may not have a keyboard or screen, Ors pointed out. These building block models could be run on the host—either on the application processor’s NPU, if it is the i.MX95, or even on a CPU if it is an older part.

NXP's GenAI Flow software stack
NXP’s GenAI Flow software stack (Source: NXP)

GenAI Flow can split these parts of the workload between cores, but splitting individual models across more than one core gets “complicated,” Ors said.

“Models are dynamic and you have to be careful how you balance that,” he added. “Any time you start splitting the graph, you create bottlenecks that previously didn’t exist. When you’re doing a lot of passing back and forth, you might start spending more time on back and forth than on compute, so you might be better off running slower but all in one place versus trying to look for something else that’s available and passing [that core] the data—that data processing could be where you end up spending more time than on compute.”

GenAI Flow also has a tool for retrieval augmented generation (RAG). RAG techniques give LLMs context for edge use cases like automotive, industrial or healthcare, where grounding in reality is critical.

“With CNNs, the complexity is at an acceptable level that a lot of productization happens on custom models,” Ors said. “With LLMs, [custom models are] a lot more costly—you need a lot more data, a lot more compute and a lot more specific expertise, which is very hard to come by, and that expertise is costly, the compute is costly and the data curation is extremely costly.”

For this reason, most edge customers prefer to use open-source LLMs like Llama combined with techniques like RAG, which gives an LLM access to a database of factual content to draw on. For an edge application, the RAG database might be hundreds of kilobytes in size, running on the application processor. GenAI Flow includes a library of open-source models pre-optimized for NXP hardware.

“Those models are still trained on anything and everything, but RAG allows you to give context to that training,” Ors said. “If the LLM is running on a medical device, I can put in all the manuals on how to use that device so that when I ask it a question, it’s not going to answer based on pictures of cardiograms it’s seen on the internet. It’s going to respond based on what this device is really trying to do…RAG gives context and contextual awareness to open-source LLMs in domains where it’s extremely critical they respond in a factual way.”

RAG, crucially, does not modify models; it merely runs alongside them. This is something Ors said customers concerned about explainable AI and upcoming AI regulations are keen on, since once a model is certified, adding additional context after deployment via RAG could mean potentially avoiding additional regulatory approvals.

Overall, GenAI Flow aims to remove customer friction points with LLM deployment at the edge. While fully agentic AI at the edge is still a little way into the future, Ors said the technique shows lots of promise for edge use cases. For example, in an industrial setting, perhaps a security camera captures an accident at a work site. Agents could interpret the scene, call a supervisor or the emergency services, begin collecting reports of what happened, shut down machinery involved, etc. These actions are all “totally within the realms of possibility,” Ors said.

From EETimes

Back
EnCharge Picks The PC For Its First Analog AI Chip
Analog AI accelerator startup EnCharge AI announced its first product, the 200-TOPS (INT8) EN100 AI accelerator designed f...
More info
NXP’s Edge LLM Strategy: Kinara, RAG, Agents
SANTA CLARA, Calif.— At the Embedded Vision Summit 2025, Ali Ors, global director of AI strategy and technologies at NX...
More info
Chip Industry Warns U.S. Tariffs, Bans Could Halt Growth
Leading chipmakers building new fabs in the U.S. warned the administration of U.S. President Donald Trump against levying new tariff...
More info