Optimizing On-Device LLM Inference in React Native with Expo and Nitro Modules

As we move into early 2026, the architectural bottleneck for mobile AI has shifted from API latency to local execution efficiency. While cloud-based LLMs remain the standard for complex reasoning, the demand for privacy-first, offline-capable, and zero-latency features has pushed on-device inference into the mainstream.

With the recent stabilization of the Nitro Modules architecture in the React Native ecosystem, we finally have a high-performance bridge capable of handling the heavy lifting required for local model execution without the overhead of the legacy bridge or the complexity of manual JSI (JavaScript Interface) bindings.

The Shift to Local Inference

Running Large Language Models (LLMs) on mobile devices presents three primary engineering challenges: memory constraints, thermal throttling, and the communication overhead between the JavaScript thread and the native C++ inference engine.

In previous iterations of React Native, passing large tensors or frequent updates from a native model (like Llama 3.2 or Phi-4-mini) to the UI resulted in significant frame drops. The introduction of Nitro Modules—a highly optimized, type-safe native module system—allows us to share memory buffers directly between C++ and JavaScript, minimizing serialization costs.

Architecture: The Nitro Bridge for AI

Nitro Modules provide a direct path for passing ArrayBuffer and complex objects with near-zero overhead. When implementing an on-device LLM, we typically use a C++ backend (like llama.cpp or MLX) and wrap it in a Nitro Module.

Implementation Pattern

To implement a performant local inference engine, we define a strictly typed interface that handles the model lifecycle: loading, tokenization, and streaming inference.

// Nitro Module Definition
import { type HybridObject } from 'react-native-nitro-modules';

export interface InferenceEngine extends HybridObject {
  loadModel(path: string, config: ModelConfig): Promise<boolean>;
  generate(prompt: string, onToken: (token: string) => void): Promise<InferenceStats>;
  unload(): void;
}

interface ModelConfig {
  contextSize: number;
  temp: number;
  useGpu: boolean;
}

By using onToken as a synchronous or fast-path callback, we can stream text directly into a SharedValue (if using Reanimated) or a state variable, ensuring the UI remains responsive at 120Hz even while the NPU (Neural Processing Unit) is under load.

Quantization and Memory Management

On mobile, raw model weights are non-starters. A 7B parameter model in FP16 takes ~14GB of RAM, exceeding the limits of almost all consumer smartphones. To make these models viable, we must employ 4-bit or even 3-bit quantization (GGUF or EXL2 formats).

The 2026 Standard: 4-bit GGUF

As of March 2026, 4-bit quantization has become the sweet spot for mobile deployment. It offers a ~70% reduction in memory footprint with less than a 1-2% increase in perplexity. When deploying via Expo, we recommend using the expo-asset system to manage these large binary files, but with a caveat: do not bundle them in the main app binary if they exceed 100MB. Instead, use a background download strategy with checksum verification.

Performance Tradeoffs: CPU vs. GPU vs. NPU

Modern mobile chips (Apple A18/A19, Snapdragon Gen 4/5) feature dedicated NPUs. However, accessing these from React Native requires specific native drivers:

CoreML (iOS): Best for Apple Silicon, provides the best energy efficiency.
QNN / NNAPI (Android): Essential for leveraging the Hexagon DSP on Snapdragon devices.
Vulkan/Metal: Good fallbacks for general GPU acceleration but higher power consumption.

When building your Nitro Module, you should implement a fallback logic that detects the hardware capabilities at runtime. If an NPU is unavailable, falling back to a highly optimized CPU implementation (using ARM NEON instructions) is often more stable than forcing a generic GPU shader that might crash on lower-end devices.

Real-World Evaluation and Tracing

Deploying the model is only half the battle. Monitoring performance in the wild is critical. Unlike web-based LLMs where you track Time To First Token (TTFT) from the server, on-device AI requires tracking:

Model Load Time: How long it takes to map the weights into memory.
Tokens Per Second (TPS): The actual inference speed.
Battery Drain: The delta in mAh during an active session.
Thermal State: Monitoring ProcessInfo.thermalState on iOS to throttle context length if the device overheats.

Integrating a tracing system like LangSmith or an open-source alternative via a lightweight Nitro-based bridge allows you to capture these metrics without blocking the inference loop.

Practical Implementation Steps

If you are starting a new React Native project with local AI requirements today, follow this roadmap:

Initialize with Expo: Use the latest Expo SDK (v53+) which has first-class support for the New Architecture.
Select a Base Model: Start with Llama-3.2-1B or Phi-4-mini. These are small enough to fit in the 1GB-2GB RAM envelope typically allocated to background processes.
Leverage Nitro Modules: Avoid the legacy bridge. Use Nitro for the C++ bindings to your inference engine of choice.
Optimize Assets: Use react-native-fs or expo-file-system to manage model weights outside of the JS bundle. Store them in the document directory to avoid being cleared by the OS.
Handle Concurrency: Always run inference on a dedicated background thread. Nitro Modules make this easier by allowing you to dispatch work to a custom C++ thread pool while keeping the JS thread free for UI interactions.

Conclusion

The combination of Nitro Modules and highly optimized small language models has turned React Native into a formidable platform for AI-native applications. By moving inference to the edge, we eliminate API costs, improve user privacy, and create snappier experiences. The key to success lies not just in the model choice, but in the efficiency of the native-to-JS data pipeline and rigorous memory management.